Import **pandas** as the required package for working with dataframes.

In [1]:
import pandas as pd

A dataframe is a multi-dimensional table with rows and columns.

# Creating a Dataframe


Usually you import a dataframe from a file, a SQL server, or a web-resource. But here I will show you how to create a dataframe from scratch.

You can create a dataframe based on lists, tuples, arrays. Here we develop it based on a dictionary.

In [9]:
data = {
    'A': [1,2,3],
    'B': [4,5,6],
    'C': [7,8,9]
}
df = pd.DataFrame(data=data)
df

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


A dataframe has columns (here: A, B, C), and rows. The index is creating and starts with 0.

# Import and Export of Dataframes

You can export a dataframe into different formats like Excel, JSON, ... Here I export it to a CSV file.

In [10]:
filename = 'df.csv' 
df.to_csv(filename, index=False)

Similarly the dataframe can be imported with **pandas**. There are many different read-functions to import from different formats.

In [6]:
df = pd.read_csv(filename)

# Exploratory Data Analysis

You can explore the data with *head()* to see the first observations. If you are interested in the last observations go with *tail()*. The argument refers to the number of observations to be shown.

In [13]:
df.head(2)

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8


Statistical properties are shown with the *describe()* method.

In [12]:
df.describe()

Unnamed: 0,A,B,C
count,3.0,3.0,3.0
mean,2.0,5.0,8.0
std,1.0,1.0,1.0
min,1.0,4.0,7.0
25%,1.5,4.5,7.5
50%,2.0,5.0,8.0
75%,2.5,5.5,8.5
max,3.0,6.0,9.0


A general summary on the dataframe is provided by *info()* method.

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
A    3 non-null int64
B    3 non-null int64
C    3 non-null int64
dtypes: int64(3)
memory usage: 152.0 bytes


Often you are interested in getting the number of rows and columns. You can get this with the shape property.

In [15]:
df.shape

(3, 3)

The column-names are stored in the property *columns*.

In [16]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

# Accessing Elements

In [23]:
# integer location
df.iloc[0,0]  # get the first 

A    2
B    5
C    8
Name: 1, dtype: int64

In [24]:
df.iloc[1, :]  # 2nd row

A    2
B    5
C    8
Name: 1, dtype: int64

In [26]:
df.iloc[:, -1]  # last column

0    7
1    8
2    9
Name: C, dtype: int64

# Adding/Modifying Columns

In [29]:
df['D'] = list(range(10,13))
df

Unnamed: 0,A,B,C,D
0,1,4,7,10
1,2,5,8,11
2,3,6,9,12


# Delete Rows or Columns

If you want to delete a column use method *drop()* and specify the column name. The argument axis needs to be 1 for columns. With inplace set to true the dataframe is directly modified.

In [30]:
df.drop('C', axis=1, inplace=True)
df

Unnamed: 0,A,B,D
0,1,4,10
1,2,5,11
2,3,6,12


Similarly you can delete rows by specifying the index of the row, the axis is 0 for rows and inplace is set to true to change the dataframe directly.

In [31]:
df.drop(1, axis=0, inplace=True)
df

Unnamed: 0,A,B,D
0,1,4,10
2,3,6,12


# Apply a lambda function to a column

You can also apply a specific function to a column.

In [32]:
my_func = lambda x: x + 2

df['E'] = df['A'].apply(my_func)
df

Unnamed: 0,A,B,D,E
0,1,4,10,3
2,3,6,12,5


# Reshape your dataframe structure

You can reshape your dataframe structure from wide data to tidy data and vice versa. We are starting with wide-data.

In [36]:
data = {
    'student': ['Stuart', 'Bob', 'Kevin'],
    'math': [2,3,3],
    'sport': [3,1,2],
    'art': [4,2,1]
    
}
df_wide = pd.DataFrame(data=data)
df_wide

Unnamed: 0,student,math,sport,art
0,Stuart,2,3,4
1,Bob,3,1,2
2,Kevin,3,2,1


In [44]:
df_tidy = df_wide.melt(id_vars= ['student'], 
             var_name = 'subject',
             value_name = 'grade')
df_tidy

Unnamed: 0,student,subject,grade
0,Stuart,math,2
1,Bob,math,3
2,Kevin,math,3
3,Stuart,sport,3
4,Bob,sport,1
5,Kevin,sport,2
6,Stuart,art,4
7,Bob,art,2
8,Kevin,art,1


In [47]:
df_wide2 = df_tidy.pivot(index='student', columns='subject')
df_wide2

Unnamed: 0_level_0,grade,grade,grade
subject,art,math,sport
student,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Bob,2,3,1
Kevin,1,3,2
Stuart,4,2,3


In [48]:
import numpy as np

In [54]:


df_wide.melt(id_vars= np.ndarray([0:1]), 
             var_name = 'subject',
             value_name = 'grade')

SyntaxError: invalid syntax (<ipython-input-54-b1819c152758>, line 3)