# Introduction to Pandas
Pandas is the default library for working with relational data.  The basic object in pandas is a dataframe, "a two-dimensional tabular, column-oriented data structure with both row and column labels."

In [1]:
import pandas as pd
import numpy as np
import random

# Loading Data Into Pandas 
Before doing anything else, we need to know how to load our data into memory.  The most common file in which relational data is stored is a csv.  Pandas includes a built in function to load data from a csv file into a dataframe.

In [2]:
df = pd.read_csv('boston_housing.csv')
print('dataframe shape: ', df.shape)
df.head()

dataframe shape:  (506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


We can take a peek at the first 5 rows of the dataframe using the method DataFrame.head()

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


We can verify that the number of rows and columns matches our expectations using the attribute DataFrame.shape

In [4]:
df.shape

(506, 14)

We can get the number of rows in our dataframe using the function len()

In [5]:
len(df)

506

## Common pitfalls loading data into pandas

### Specifying delimiters
Sometimes csvs use a character other than a comma as a separator, such as a semicolon or pipe.  This can lead to errors if we try to read the csv like normal.

In [6]:
delimdf = pd.read_csv('boston_housing_delimiter.csv')
print(delimdf.shape)
delimdf.head()

(506, 1)


Unnamed: 0,CRIM;ZN;INDUS;CHAS;NOX;RM;AGE;DIS;RAD;TAX;PTRATIO;B;LSTAT;MEDV
0,0.00632;18.0;2.31;0.0;0.5379999999999999;6.575...
1,0.02731;0.0;7.07;0.0;0.469;6.421;78.9;4.9671;2...
2,0.02729;0.0;7.07;0.0;0.469;7.185;61.1;4.9671;2...
3,0.03237;0.0;2.18;0.0;0.458;6.997999999999998;4...
4,0.06905;0.0;2.18;0.0;0.458;7.147;54.2;6.0622;3...


<br>
Notice that we still have 506 rows but only one column.  If we look at the values we notice that they are separated by a semicolon.  The function pd.read_csv has an argument sep with a default value of ','.


In [7]:
delimdf = pd.read_csv('boston_housing_delimiter.csv', sep=';')
print(delimdf.shape)
delimdf.head()

(506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


### Encoding errors
Most csvs you'll encounter use UTF-8 encoding, which is the default for the pd.read_csv() function.  However sometimes you'll get an encoding error.

In [8]:
encdf = pd.read_csv('boston_housing_encoding.csv', encoding='latin1')
print(encdf.shape)
encdf.head()

# this didn't give the error that I expected
# I'll either find a csv with a different encoding or skip the example for this section

(506, 14)


Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


Encoding errors can be fixed by specifying the encoding argument in the pd.read_csv() function.  After 'utf-8', the next most common encoding in my experience is 'latin1', which is especially common in government documents.  After that good luck.

More info on encoding: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

## Loading data from a url

You can point pd.read_csv() to a url instead of a file on your computer.

## Other ways to read data

If you have data in another common file type there is likely a pandas function for reading it into a dataframe, for example pd.read_excel() and pd.read_json()

# Slicing Data in Pandas

Columns in dataframes are a data type called Series, which are similar to a numpy array.  To get a specific column from a dataframe, the syntax is dataframe[column_name] or dataframe.column_name

In [9]:
print(type(df['MEDV']))
df.MEDV.head()

<class 'pandas.core.series.Series'>


0    24.0
1    21.6
2    34.7
3    33.4
4    36.2
Name: MEDV, dtype: float64

Multiple columns can be selected using a list.

In [10]:
df[['AGE', 'DIS', 'RAD', 'TAX']].head()

Unnamed: 0,AGE,DIS,RAD,TAX
0,65.2,4.09,1.0,296.0
1,78.9,4.9671,2.0,242.0
2,61.1,4.9671,2.0,242.0
3,45.8,6.0622,3.0,222.0
4,54.2,6.0622,3.0,222.0


We pass a Boolean series to the dataframe to filter the dataframe to rows that meet a specific condition.  For example, first we get a series from the age column of proportions greater than 95.

In [11]:
df.AGE > 95

0      False
1      False
2      False
3      False
4      False
       ...  
501    False
502    False
503    False
504    False
505    False
Name: AGE, Length: 506, dtype: bool

We can pass this series into the dataframe like we did the list of columns.

In [12]:
df[df.AGE > 95].head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
20,1.25179,0.0,8.14,0.0,0.538,5.57,98.1,3.7979,4.0,307.0,21.0,376.57,21.02,13.6
23,0.98843,0.0,8.14,0.0,0.538,5.813,100.0,4.0952,4.0,307.0,21.0,394.54,19.88,14.5
31,1.35472,0.0,8.14,0.0,0.538,6.072,100.0,4.175,4.0,307.0,21.0,376.73,13.04,14.5


If we want only the MEDV of the rows that have an AGE > 95

In [13]:
df[df.AGE > 95].MEDV.head()

7     27.1
8     16.5
20    13.6
23    14.5
31    14.5
Name: MEDV, dtype: float64

You can compare to all sorts of things and combine them using & and |

In [14]:
df[(df.RAD == 5.0) & (df.DIS > df.DIS.mean())].head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
6,0.08829,12.5,7.87,0.0,0.524,6.012,66.6,5.5605,5.0,311.0,15.2,395.6,12.43,22.9
7,0.14455,12.5,7.87,0.0,0.524,6.172,96.1,5.9505,5.0,311.0,15.2,396.9,19.15,27.1
8,0.21124,12.5,7.87,0.0,0.524,5.631,100.0,6.0821,5.0,311.0,15.2,386.63,29.93,16.5
9,0.17004,12.5,7.87,0.0,0.524,6.004,85.9,6.5921,5.0,311.0,15.2,386.71,17.1,18.9
10,0.22489,12.5,7.87,0.0,0.524,6.377,94.3,6.3467,5.0,311.0,15.2,392.52,20.45,15.0


# Useful Methods on Dataframes

DataFrame.describe() provides useful summary statistics about each of the columns in the dataframe, DataFrame.info() provides information about data types

In [15]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB


It's often helpful to have a list of columns in your dataframe, for example to iterate over.

In [17]:
df.columns

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT', 'MEDV'],
      dtype='object')

In [18]:
for column_name in df.columns:
    print(column_name, df[column_name].sum())

CRIM 1828.4429200000004
ZN 5750.0
INDUS 5635.209999999999
CHAS 35.0
NOX 280.6757
RM 3180.025
AGE 34698.9
DIS 1920.2916
RAD 4832.0
TAX 206568.0
PTRATIO 9338.5
B 180477.06000000003
LSTAT 6402.450000000001
MEDV 11401.600000000002


Data often comes with column names that are either hard to understand or too long to consistently type out. Dataframe.rename() let's us rename the columns.  There are a couple ways to use it, but the easiest to me is to pass to the columns argument a dictionary with the old name as a key and the new name as a value.  This also lets us practice dictionary comprehension.

In [19]:
df.rename(columns = {key:key.lower() for key in df.columns}).head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


Methods like rename usually return a dataframe.  Above, the dataframe was returned, but our original dataframe df was not actually altered.  To make the change permanent, we can set df equal to the returned dataframe or we can use the argument inplace = True.

It's sometimes helpful to have a copy of a dataframe with DataFrame.copy()

When you make changes to a dataframe and want to save it to an external file, you can use DataFrame.to_csv(filepath)

# Useful Methods on Series

Most of the methods that do the actual heavy lifting are on Series rather than dataframes.  Descriptive statistics like sum, mean, median, max are all very straightforward.

In [15]:
df.MEDV.median()

21.2

Series.unique() returns an array with each unique value in the series.

In [21]:
df.RAD.unique()

array([ 1.,  2.,  3.,  5.,  4.,  8.,  6.,  7., 24.])

Series.value_counts() returns a series with the value as an index and the number of times that value occurs in the column as the value

In [98]:
df.RAD.value_counts()

24.0    132
5.0     115
4.0     110
3.0      38
6.0      26
8.0      24
2.0      24
1.0      20
7.0      17
Name: RAD, dtype: int64

Series.sort_values() returns a sorted series. Series.sort_values() also works on a dataframe, use the argument'by' to specify the column on which to sort the rest of the dataframe

In [105]:
df.MEDV.sort_values().head()

398    5.0
405    5.0
400    5.6
399    6.3
414    7.0
Name: MEDV, dtype: float64

In [106]:
df.sort_values(by = 'MEDV').head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
398,38.3518,0.0,18.1,0.0,0.693,5.453,100.0,1.4896,24.0,666.0,20.2,396.9,30.59,5.0
405,67.9208,0.0,18.1,0.0,0.693,5.683,100.0,1.4254,24.0,666.0,20.2,384.97,22.98,5.0
400,25.0461,0.0,18.1,0.0,0.693,5.987,100.0,1.5888,24.0,666.0,20.2,396.9,26.77,5.6
399,9.91655,0.0,18.1,0.0,0.693,5.852,77.8,1.5004,24.0,666.0,20.2,338.16,29.97,6.3
414,45.7461,0.0,18.1,0.0,0.693,4.519,100.0,1.6582,24.0,666.0,20.2,88.27,36.98,7.0


series.duplicated() can be used to find duplicate values, which is useful for example when you want to check if a person's name appears twice in your data.

## Changing, Adding, and Dropping Columns

You can change a column in your dataframe simply using '='

When changing columns or adding new columns, pandas will yell at you if you use the df.column_name notation, so instead use df[column_name].

In [26]:
df['RAD'] = df.RAD.astype(int)
df.RAD.head()

0    1
1    2
2    2
3    3
4    3
Name: RAD, dtype: int64

In [27]:
df['RAD'] = df.RAD.astype(float)

You can add a new column to your dataframe simply by setting dataframe[new_column_name] = new_column

In [28]:
df['int_RAD'] = df.RAD.astype(int)

In [29]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,int_RAD
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,1
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,2
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,2
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,3
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,3


You can delete an unwanted column by using df.drop(columns = [column_name1, column_name2])

In [30]:
df = df.drop(columns = 'int_RAD')

In [31]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


## Apply

Some of the most powerful data transformations we can do make use of the apply method.  Apply is a very robust method that can do all sorts of things, but in a nutshell it takes a function and applies it to each element of a Series.  Apply can be used with functions you define or with lambda functions defined in place.

In [43]:
# creating a dummy variable for if the row has a MEDV greater than the mean MEDV value
df['MEDV_greater_than_mean'] = df.MEDV.apply(lambda x: 1 if x > df.MEDV.mean() else 0)
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,MEDV_greater_than_mean
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,1
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,1
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,1
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,1


In [45]:
df = df.drop(columns='MEDV_greater_than_mean')

You can call apply on a whole dataframe, using the axis argument to specify if you want the function to be applied to the rows or the columns.

## Groupby

Like apply, groupby is a robust method that can do a lot, roughly speaking it allows you to group a dataframe by the values in a column and apply a function to those groups.

In [41]:
# by itself groupby just returns a groupby object
df.groupby(by = 'RAD')

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x7f1e6247dcc0>

In [42]:
# but we can call methods on this groupby object like it's a dataframe
df.groupby(by='RAD').MEDV.mean()

RAD
1.0     24.365000
2.0     26.833333
3.0     27.928947
4.0     21.387273
5.0     25.706957
6.0     20.976923
7.0     27.105882
8.0     30.358333
24.0    16.403788
Name: MEDV, dtype: float64

## Dealing with Missing Values

Real world data sets often come with missing values and it's important to be able to deal with these.

First, let's introduce some missing values into our data.

In [69]:
zip_codes = [10001, 10002, 10003, 10004, 10005, np.nan]
df['zip_code'] = pd.Series([random.choice(zip_codes) for i in df.index])
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,zip_code
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0,10005.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,10001.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7,10004.0
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4,10002.0
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,


If we call df.info() or df.describe() now, it will give us information about the missing numbers

In [70]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 15 columns):
CRIM        506 non-null float64
ZN          506 non-null float64
INDUS       506 non-null float64
CHAS        506 non-null float64
NOX         506 non-null float64
RM          506 non-null float64
AGE         506 non-null float64
DIS         506 non-null float64
RAD         506 non-null float64
TAX         506 non-null float64
PTRATIO     506 non-null float64
B           506 non-null float64
LSTAT       506 non-null float64
MEDV        506 non-null float64
zip_code    419 non-null float64
dtypes: float64(15)
memory usage: 59.4 KB


In [56]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.613524,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.601545,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.677083,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


The method Series.isnull() returns a boolean series with True if the value is missing.

In [71]:
df.zip_code.isnull().head()

0    False
1    False
2    False
3    False
4     True
Name: zip_code, dtype: bool

We can count the number of null values in a column but just summing the boolean series.

In [72]:
df.zip_code.isnull().sum()

87

We can subset our dataframe to investigate the rows with missing values (in this case they are random but sometimes this might give you insight)

In [73]:
df[df.zip_code.isnull()].head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,zip_code
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,
5,0.02985,0.0,2.18,0.0,0.458,6.43,58.7,6.0622,3.0,222.0,18.7,394.12,5.21,28.7,
11,0.11747,12.5,7.87,0.0,0.524,6.009,82.9,6.2267,5.0,311.0,15.2,396.9,13.27,18.9,
24,0.75026,0.0,8.14,0.0,0.538,5.924,94.1,4.3996,4.0,307.0,21.0,394.33,16.3,15.6,
28,0.77299,0.0,8.14,0.0,0.538,6.495,94.4,4.4547,4.0,307.0,21.0,387.94,12.8,18.4,


Many functions include arguments to specify what to do with nan values.

In [74]:
df.zip_code.value_counts()

10005.0    93
10003.0    87
10002.0    81
10004.0    79
10001.0    79
Name: zip_code, dtype: int64

In [75]:
df.zip_code.value_counts(dropna = False)

 10005.0    93
 10003.0    87
NaN         87
 10002.0    81
 10004.0    79
 10001.0    79
Name: zip_code, dtype: int64

Depending on the default value of this dropna argument, you may get errors.

If we don't want the rows with missing values in our dataframe, we can drop them entirely using DataFrame.dropna()

## Creating New Dataframes

Sometimes you need to create a dataframe from scratch.  My preferred way to do this is by using the pd.DataFrame.from_dict() function.  Let's say we look up information about population by zipcode. 

In [87]:
zip_dict = {'zip_code':[10001, 10002, 10003, 10004, 10005], 'population':[101200, 98700, 67500, 12900, 79400],
           'area':[10, 11, 12, 13, 14]}
zip_dict['pop_density'] = [round(i/j, 1) for i, j in zip(zip_dict['population'], zip_dict['area'])]
zip_df = pd.DataFrame.from_dict(zip_dict, orient = 'columns')
zip_df

Unnamed: 0,zip_code,population,area,pop_density
0,10001,101200,10,10120.0
1,10002,98700,11,8972.7
2,10003,67500,12,5625.0
3,10004,12900,13,992.3
4,10005,79400,14,5671.4


## Merging Dataframes

While SQL is much more effecient for doing joins if all of your data is available in a database, sometimes you want to join data from two separate sources.  In this case, doing the join in python will be easier than trying to create a new database.  We can do data joins in python using the DataFrame.merge() function.  The important arguments are 'right' -- the data you want to join -- 'how' -- the type of join -- and 'on' -- the column on which to join.  The column names for on must be identical or must be specified using 'right_on' and 'left_on.'

In [88]:
df.merge(right = zip_df, how = 'inner', on='zip_code')

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV,zip_code,population,area,pop_density
0,0.00632,18.0,2.31,0.0,0.5380,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98,24.0,10005.0,79400,14,5671.4
1,0.08829,12.5,7.87,0.0,0.5240,6.012,66.6,5.5605,5.0,311.0,15.2,395.60,12.43,22.9,10005.0,79400,14,5671.4
2,0.14455,12.5,7.87,0.0,0.5240,6.172,96.1,5.9505,5.0,311.0,15.2,396.90,19.15,27.1,10005.0,79400,14,5671.4
3,0.22489,12.5,7.87,0.0,0.5240,6.377,94.3,6.3467,5.0,311.0,15.2,392.52,20.45,15.0,10005.0,79400,14,5671.4
4,0.80271,0.0,8.14,0.0,0.5380,5.456,36.6,3.7965,4.0,307.0,21.0,288.99,11.69,20.2,10005.0,79400,14,5671.4
5,1.23247,0.0,8.14,0.0,0.5380,6.142,91.7,3.9769,4.0,307.0,21.0,396.90,18.72,15.2,10005.0,79400,14,5671.4
6,0.67191,0.0,8.14,0.0,0.5380,5.813,90.3,4.6820,4.0,307.0,21.0,376.88,14.81,16.6,10005.0,79400,14,5671.4
7,1.00245,0.0,8.14,0.0,0.5380,6.674,87.3,4.2390,4.0,307.0,21.0,380.23,11.98,21.0,10005.0,79400,14,5671.4
8,0.17142,0.0,6.91,0.0,0.4480,5.682,33.8,5.1004,3.0,233.0,17.9,396.90,10.21,19.3,10005.0,79400,14,5671.4
9,0.22927,0.0,6.91,0.0,0.4480,6.030,85.5,5.6894,3.0,233.0,17.9,392.74,18.80,16.6,10005.0,79400,14,5671.4


In [91]:
df = df.merge(right = zip_df, how = 'left', on='zip_code')
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,...,B,LSTAT,MEDV,zip_code,population_x,area_x,pop_density_x,population_y,area_y,pop_density_y
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,...,396.9,4.98,24.0,10005.0,79400.0,14.0,5671.4,79400.0,14.0,5671.4
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,...,396.9,9.14,21.6,10001.0,101200.0,10.0,10120.0,101200.0,10.0,10120.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,...,392.83,4.03,34.7,10004.0,12900.0,13.0,992.3,12900.0,13.0,992.3
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,...,394.63,2.94,33.4,10002.0,98700.0,11.0,8972.7,98700.0,11.0,8972.7
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,...,396.9,5.33,36.2,,,,,,,


In [93]:
# I made a mistake here thinking that merge was not inplace, I'll leave it since it's illustrative.