We start by importing our libraries and label them with the namespace alias.

In [1]:
import numpy as np
import pandas as pd

## 1. Creating Dataframes

The core of Pandas is the dataframe. So let's see how we can create a Pandas dataframe from a Series or NumPy array.

In [2]:
sample_series = pd.Series([1,2,3,4,np.nan,6,7,8])
print('Sample series is: \n', sample_series)

Sample series is: 
 0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    6.0
6    7.0
7    8.0
dtype: float64


In [3]:
sample_df = pd.DataFrame(np.random.randn(12, 4), columns=list("ABCD"))
print('Sample dataframe is: \n', sample_df)

Sample dataframe is: 
            A         B         C         D
0   0.478231 -0.950501 -1.613626  0.633979
1  -1.841121 -0.106668 -0.929868  0.281903
2   0.318260  0.416277  3.175487  0.385991
3  -0.601547  0.690190 -0.193211  2.122562
4  -1.456766  0.193964  0.484125 -1.098988
5   0.366970  1.895161  1.770509 -0.362223
6   0.228527 -0.443287  0.801501  1.546570
7   2.575139 -1.851373 -1.061542  0.259036
8  -0.472712 -1.545774 -1.725854 -0.237869
9  -0.656836  2.144947  1.074198  2.002524
10 -0.444319  0.998143  0.898986  0.173071
11 -0.346956 -1.533176  0.285758 -0.964378


Remember what I said about Pandas DataFrame being 2D. Let's see what it actually looks like with the example below. Here we are creating a Pandas dataframe from a dictionary object.

In [4]:
sample_dictionary = { "A": 1.0, 
                      "B": pd.Timestamp("20130102"),
                      "C": pd.Series(1, index=list(range(4)), dtype="float32"),
                      "D": np.array([3] * 4, dtype="int32"),
                      "E": pd.Categorical(["test", "train", "test", "train"]),
                      "F": "foo",
                    }

In [5]:
type(sample_dictionary)

dict

In [6]:
sample_dictionary

{'A': 1.0,
 'B': Timestamp('2013-01-02 00:00:00'),
 'C': 0    1.0
 1    1.0
 2    1.0
 3    1.0
 dtype: float32,
 'D': array([3, 3, 3, 3], dtype=int32),
 'E': ['test', 'train', 'test', 'train']
 Categories (2, object): ['test', 'train'],
 'F': 'foo'}

In [7]:
sample_dataframe = pd.DataFrame(sample_dictionary)

In [8]:
type(sample_dataframe)

pandas.core.frame.DataFrame

In [9]:
sample_dataframe

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


See how the shape changes. Which one is computationally faster?

### Displaying DataFrame
Another way to display dataframe is .head() and .tail() functions.

In [10]:
sample_df.head()

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,-1.613626,0.633979
1,-1.841121,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,-1.456766,0.193964,0.484125,-1.098988


In [11]:
sample_df.tail()

Unnamed: 0,A,B,C,D
7,2.575139,-1.851373,-1.061542,0.259036
8,-0.472712,-1.545774,-1.725854,-0.237869
9,-0.656836,2.144947,1.074198,2.002524
10,-0.444319,0.998143,0.898986,0.173071
11,-0.346956,-1.533176,0.285758,-0.964378


Try printing out only a few rows from the end.

In [12]:
sample_df.tail(2)

Unnamed: 0,A,B,C,D
10,-0.444319,0.998143,0.898986,0.173071
11,-0.346956,-1.533176,0.285758,-0.964378


We can also explore the dataframe with the .describe() function.

In [13]:
sample_df.describe()

Unnamed: 0,A,B,C,D
count,12.0,12.0,12.0,12.0
mean,-0.154427,-0.007675,0.247205,0.395182
std,1.116345,1.319581,1.446526,1.048213
min,-1.841121,-1.851373,-1.725854,-1.098988
25%,-0.615369,-1.09617,-0.962786,-0.268957
50%,-0.395637,0.043648,0.384941,0.27047
75%,0.330438,0.767178,0.942789,0.862127
max,2.575139,2.144947,3.175487,2.122562


We can also display the column and index elements separately.

In [14]:
sample_df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [15]:
sample_df.index

RangeIndex(start=0, stop=12, step=1)

## 2. Sorting and Selection

### Sorting
We can sort a dataframe object using an axis or values.

In [16]:
sample_df.head()

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,-1.613626,0.633979
1,-1.841121,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,-1.456766,0.193964,0.484125,-1.098988


In [17]:
sample_df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
0,0.633979,-1.613626,-0.950501,0.478231
1,0.281903,-0.929868,-0.106668,-1.841121
2,0.385991,3.175487,0.416277,0.31826
3,2.122562,-0.193211,0.69019,-0.601547
4,-1.098988,0.484125,0.193964,-1.456766
5,-0.362223,1.770509,1.895161,0.36697
6,1.54657,0.801501,-0.443287,0.228527
7,0.259036,-1.061542,-1.851373,2.575139
8,-0.237869,-1.725854,-1.545774,-0.472712
9,2.002524,1.074198,2.144947,-0.656836


In [18]:
sample_df.sort_values(by='B')

Unnamed: 0,A,B,C,D
7,2.575139,-1.851373,-1.061542,0.259036
8,-0.472712,-1.545774,-1.725854,-0.237869
11,-0.346956,-1.533176,0.285758,-0.964378
0,0.478231,-0.950501,-1.613626,0.633979
6,0.228527,-0.443287,0.801501,1.54657
1,-1.841121,-0.106668,-0.929868,0.281903
4,-1.456766,0.193964,0.484125,-1.098988
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
10,-0.444319,0.998143,0.898986,0.173071


### Selection
There are various ways to make selections with a Pandas dataframe. First, let's look at selection of an entire column.

In [19]:
sample_df_column_a = sample_df["A"]
print('This is the A column of the sample_df: \n', sample_df_column_a)

This is the A column of the sample_df: 
 0     0.478231
1    -1.841121
2     0.318260
3    -0.601547
4    -1.456766
5     0.366970
6     0.228527
7     2.575139
8    -0.472712
9    -0.656836
10   -0.444319
11   -0.346956
Name: A, dtype: float64


In [20]:
print('This is the A column of the sample_df: \n', sample_df.A)

This is the A column of the sample_df: 
 0     0.478231
1    -1.841121
2     0.318260
3    -0.601547
4    -1.456766
5     0.366970
6     0.228527
7     2.575139
8    -0.472712
9    -0.656836
10   -0.444319
11   -0.346956
Name: A, dtype: float64


In [21]:
print('This is the A column of the sample_df: \n', sample_df.loc[:,["A"]])

This is the A column of the sample_df: 
            A
0   0.478231
1  -1.841121
2   0.318260
3  -0.601547
4  -1.456766
5   0.366970
6   0.228527
7   2.575139
8  -0.472712
9  -0.656836
10 -0.444319
11 -0.346956


Now let's see how we can make selections of a subset of elements based on location or values.

In [22]:
sample_df.loc[3:5, ["A", "B"]]

Unnamed: 0,A,B
3,-0.601547,0.69019
4,-1.456766,0.193964
5,0.36697,1.895161


In [23]:
sample_df[sample_df["A"] > 0]

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,-1.613626,0.633979
2,0.31826,0.416277,3.175487,0.385991
5,0.36697,1.895161,1.770509,-0.362223
6,0.228527,-0.443287,0.801501,1.54657
7,2.575139,-1.851373,-1.061542,0.259036


## 3. Handling Missing Data
We need to come up with a way to deal with our missing data points. Here are a few tricks to find and mark them.

In [24]:
# First we need to add some missing data to our sample_df dataframe.
sample_df2 = sample_df[sample_df > -1.0]

In [25]:
sample_df2.head()

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,,0.633979
1,,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,,0.193964,0.484125,


### Dropping the NaN values.
We can use .dropna() command to drop values with NaN.

In [26]:
sample_df2.dropna() 

Unnamed: 0,A,B,C,D
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
5,0.36697,1.895161,1.770509,-0.362223
6,0.228527,-0.443287,0.801501,1.54657
9,-0.656836,2.144947,1.074198,2.002524
10,-0.444319,0.998143,0.898986,0.173071


### Filling the NaN values.
We can use .fillna() command to drop values with NaN.

In [27]:
sample_df2.fillna(value='Missing')

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,Missing,0.633979
1,Missing,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.17549,0.385991
3,-0.601547,0.69019,-0.193211,2.12256
4,Missing,0.193964,0.484125,Missing
5,0.36697,1.89516,1.77051,-0.362223
6,0.228527,-0.443287,0.801501,1.54657
7,2.57514,Missing,Missing,0.259036
8,-0.472712,Missing,Missing,-0.237869
9,-0.656836,2.14495,1.0742,2.00252


In [28]:
sample_df2.fillna(value=2)

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,2.0,0.633979
1,2.0,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,2.0,0.193964,0.484125,2.0
5,0.36697,1.895161,1.770509,-0.362223
6,0.228527,-0.443287,0.801501,1.54657
7,2.575139,2.0,2.0,0.259036
8,-0.472712,2.0,2.0,-0.237869
9,-0.656836,2.144947,1.074198,2.002524


Let's see how our dataframe looks like now:

In [29]:
sample_df2.head()

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,,0.633979
1,,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,,0.193964,0.484125,


Why didn't it change?

### Dummy Variables
This is a handy dataframe tool to convert categorical variables into dummy/indicator variables. Let's add some categorical variables to sample_df2.

In [30]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [31]:
categorical_df = sample_df2

In [32]:
positive_index = sample_df2.B[sample_df2.B < 0].index
negative_index = sample_df2.B[sample_df2.B > 0].index

In [33]:
categorical_df.B[positive_index] = 'Positive'
categorical_df.B[negative_index] = 'Negative'

In [34]:
categorical_df = categorical_df.fillna(0)

In [35]:
categorical_df.head()

Unnamed: 0,A,B,C,D
0,0.478231,Positive,0.0,0.633979
1,0.0,Positive,-0.929868,0.281903
2,0.31826,Negative,3.175487,0.385991
3,-0.601547,Negative,-0.193211,2.122562
4,0.0,Negative,0.484125,0.0


In [36]:
pd.get_dummies( categorical_df, columns=['B'] ).head()

Unnamed: 0,A,C,D,B_0,B_Negative,B_Positive
0,0.478231,0.0,0.633979,0,0,1
1,0.0,-0.929868,0.281903,0,0,1
2,0.31826,3.175487,0.385991,0,1,0
3,-0.601547,-0.193211,2.122562,0,1,0
4,0.0,0.484125,0.0,0,1,0


In [37]:
categorical_df = pd.get_dummies( categorical_df, columns=['B'] )

## 4. Pandas Operations

We can perform basic descriptive statistics using .mean(), .min(), .max() or .describe().
When you see a homework question starting with "Describe your data set", it is good practice to first run the .describe() command.

In [38]:
sample_df2.mean()

A    0.144476
C    0.818609
D    0.531015
dtype: float64

In [39]:
sample_df2.min()

A   -0.656836
C   -0.929868
D   -0.964378
dtype: float64

In [40]:
sample_df2.max()

A    2.575139
C    3.175487
D    2.122562
dtype: float64

In [41]:
sample_df2.describe()

Unnamed: 0,A,C,D
count,10.0,9.0,11.0
mean,0.144476,0.818609,0.531015
std,0.958821,1.173128,0.982383
min,-0.656836,-0.929868,-0.964378
25%,-0.465614,0.285758,-0.032399
50%,-0.059214,0.801501,0.281903
75%,0.354792,1.074198,1.090275
max,2.575139,3.175487,2.122562


## 5. Dataframe Merging

Because life is difficult, you will end up merging various data set in preparation to an ML application. Merging is never easy, always messy, seldom successful at first try.

### Concatenating dataframes

In [42]:
sample_df

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,-1.613626,0.633979
1,-1.841121,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,-1.456766,0.193964,0.484125,-1.098988
5,0.36697,1.895161,1.770509,-0.362223
6,0.228527,-0.443287,0.801501,1.54657
7,2.575139,-1.851373,-1.061542,0.259036
8,-0.472712,-1.545774,-1.725854,-0.237869
9,-0.656836,2.144947,1.074198,2.002524


In [43]:
new_df = pd.concat(sample_df, sample_df2)

TypeError: first argument must be an iterable of pandas objects, you passed an object of type "DataFrame"

In [44]:
type(sample_df)

pandas.core.frame.DataFrame

🤦🏻‍♀️ I told you never easy!

In [45]:
new_df = pd.concat([sample_df, sample_df2])

In [46]:
new_df.head()

Unnamed: 0,A,B,C,D
0,0.478231,-0.950501,-1.613626,0.633979
1,-1.841121,-0.106668,-0.929868,0.281903
2,0.31826,0.416277,3.175487,0.385991
3,-0.601547,0.69019,-0.193211,2.122562
4,-1.456766,0.193964,0.484125,-1.098988


In [47]:
new_df.tail()

Unnamed: 0,A,B,C,D
7,2.575139,,,0.259036
8,-0.472712,,,-0.237869
9,-0.656836,Negative,1.074198,2.002524
10,-0.444319,Negative,0.898986,0.173071
11,-0.346956,,0.285758,-0.964378


What did concat do?

### Merging the dataframes

Merging is a directional way of combining data frames.

In [48]:
left = pd.DataFrame({"key": ["val", "val"], "left_val": [1, 2]})

In [49]:
left

Unnamed: 0,key,left_val
0,val,1
1,val,2


In [50]:
right = pd.DataFrame({"key": ["val", "val"], "right_val": [3, 4]})

In [51]:
right

Unnamed: 0,key,right_val
0,val,3
1,val,4


In [52]:
pd.merge(left, right, on="key")

Unnamed: 0,key,left_val,right_val
0,val,1,3
1,val,1,4
2,val,2,3
3,val,2,4


In [53]:
left = pd.DataFrame({"key": ["val1", "val2"], "left_val": [1, 2]})

In [54]:
left

Unnamed: 0,key,left_val
0,val1,1
1,val2,2


In [55]:
right = pd.DataFrame({"key": ["val1", "val2"], "right_val": [3, 4]})

In [56]:
right

Unnamed: 0,key,right_val
0,val1,3
1,val2,4


In [57]:
pd.merge(left, right, on="key")

Unnamed: 0,key,left_val,right_val
0,val1,1,3
1,val2,2,4


Logical right? Let's see what concat would do?

In [58]:
pd.concat([left,right])

Unnamed: 0,key,left_val,right_val
0,val1,1.0,
1,val2,2.0,
0,val1,,3.0
1,val2,,4.0


#nevereasy #alwaysmessy #seldomsuccessful.

## 6. Dataframe Grouping

One of the best features of pandas dataframes is the dataframe grouping.

In [59]:
sample_df3 = df = pd.DataFrame({
                                "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
                                "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
                                "C": np.random.randn(8),
                                "D": np.random.randn(8),
                                }
                               )

In [60]:
sample_df3.groupby("A").median()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-0.116556,-0.889811
foo,-1.119706,-0.701799


In [61]:
sample_df3.groupby("A").sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.546123,-2.93452
foo,-1.451459,-1.285673


In [62]:
sample_df3.groupby(["A","B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.593487,-1.635066
bar,three,1.256167,-0.889811
bar,two,-0.116556,-0.409644
foo,one,-0.060919,-1.560409
foo,three,-1.119706,0.842827
foo,two,-0.270834,-0.568091


## 7. Pandas with a Real Data Set

In [63]:
# Read the iris dataset from a CSV file
iris_df = pd.read_csv('./iris.csv')

# Print data type for df
print( type(iris_df) )

<class 'pandas.core.frame.DataFrame'>


Let's further explore what is in this data set.

In [64]:
iris_df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [65]:
iris_df.shape

(150, 5)

In [66]:
iris_df.min()

sepal_length       4.3
sepal_width          2
petal_length         1
petal_width        0.1
species         setosa
dtype: object

In [67]:
iris_df.max()

sepal_length          7.9
sepal_width           4.4
petal_length          6.9
petal_width           2.5
species         virginica
dtype: object

In [68]:
iris_df.describe()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


In [69]:
# First 5 values of petal length
print(iris_df.petal_length.head())

0    1.4
1    1.4
2    1.3
3    1.5
4    1.4
Name: petal_length, dtype: float64


In [70]:
# Create new petal area feature
iris_df['sepal_area'] = iris_df.sepal_width * iris_df.sepal_length

In [71]:
iris_df['sepal_area'].head()

0    17.85
1    14.70
2    15.04
3    14.26
4    18.00
Name: sepal_area, dtype: float64

### Filtering, Segmentation and Aggregation

Let's create a mask.

In [72]:
sepal_width_mask = iris_df.sepal_width>3.

In [73]:
iris_df[sepal_width_mask]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,sepal_area
0,5.1,3.5,1.4,0.2,setosa,17.85
2,4.7,3.2,1.3,0.2,setosa,15.04
3,4.6,3.1,1.5,0.2,setosa,14.26
4,5.0,3.6,1.4,0.2,setosa,18.00
5,5.4,3.9,1.7,0.4,setosa,21.06
...,...,...,...,...,...,...
140,6.7,3.1,5.6,2.4,virginica,20.77
141,6.9,3.1,5.1,2.3,virginica,21.39
143,6.8,3.2,5.9,2.3,virginica,21.76
144,6.7,3.3,5.7,2.5,virginica,22.11


In [74]:
iris_df[sepal_width_mask].min()

sepal_length       4.4
sepal_width        3.1
petal_length         1
petal_width        0.1
species         setosa
sepal_area       14.08
dtype: object

In [75]:
iris_df.groupby('species').median()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width,sepal_area
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
setosa,5.0,3.4,1.5,0.2,17.17
versicolor,5.9,2.8,4.35,1.3,16.385
virginica,6.5,3.0,5.55,2.0,20.06


👏 Congratulations, you have completed the Pandas Workbook!