# Pandas II - Data Cleaning

_May 13, 2020_

Agenda today:
- Introduction to lambda function
- Introduction to data cleaning in pandas

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Part I. Lambda function
lambda functions are known as anonymous functions in Python. It allows you to write one-line functions that are used together with `map()`, `filter()`.

Syntax of lambda function: `lambda arguments:expressions`. 

In [2]:
# lambda function with one argument
#take in an integer, add 10
func_1 = lambda x: x + 10

In [4]:
# using the function
func_1(1)
(lambda x: x + 10)(10)

20

In [None]:
# lambda function with multiple arguments
func_2 = None

In [6]:
# exercise: turn the below function into a lambda function
def count_zeros(li):
    """
    return a count of how many zeros are in a list
    """
    count = sum(x == 0 for x in li)
    return count

In [7]:
count_zeros([1,2,4,0,0,0])

3

In [21]:
li = [1,2,4,0,0,0]
len(list(filter(lambda x: x==0, li)))


3

In [26]:
sum(map(lambda x: x==0, li))

3

## Part II. Data Cleaning in Pandas
You might wonder what the usage of lambda functions are - they are incredibly useful when applied to data cleaning in Pandas. You can apply it to columns or the entire dataframe to get results you need. For example, you might want to convert a column with $USD to Euros, or temperature expressed in Celsius to Fehrenheit. You will learn three new functions:

- `Apply()` - on both series and dataframe

- `Applymap()` - only on dataframes

- `Map()` - only on series

In [27]:
# import the dataframe 
df = pd.read_csv('auto-mpg.csv')
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [28]:
# examine the first few rows of it 
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino


In [30]:
# check the datatypes of the df
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   mpg           398 non-null    float64
 1   cylinders     398 non-null    int64  
 2   displacement  398 non-null    float64
 3   horsepower    398 non-null    object 
 4   weight        398 non-null    int64  
 5   acceleration  398 non-null    float64
 6   model year    398 non-null    int64  
 7   origin        398 non-null    int64  
 8   car name      398 non-null    object 
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


In [31]:
# check the df of columns
df.columns

Index(['mpg', 'cylinders', 'displacement', 'horsepower', 'weight',
       'acceleration', 'model year', 'origin', 'car name'],
      dtype='object')

In [33]:
# check whether you have missing values
df.isnull().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model year      0
origin          0
car name        0
dtype: int64

In [34]:
# creating new columns - show the broadcasting property of pandas
df['usable?'] = 'Yes'
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable?
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes


In [None]:
# check the dataframe


In [41]:
# time to use lambda and apply! with apply, applymap, and map, you never need to "iterate through the rows"

# create a new column called weight_in_tons, which uses `weight` column and multiply it by 0.0005

# 1 lb = 0.0005

df['new_cylinders'] = df.cylinders.apply(lambda x: x - 1)
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable?,new_cylinders,weight_in_tons
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes,7,1.752
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes,7,1.8465
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes,7,1.718
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes,7,1.7165
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes,7,1.7245


In [42]:
df['weight_in_tons'] = df.weight.apply(lambda x: x * 0.0005)
df.head

<bound method NDFrame.head of       mpg  cylinders  displacement horsepower  weight  acceleration  \
0    18.0          8         307.0        130    3504          12.0   
1    15.0          8         350.0        165    3693          11.5   
2    18.0          8         318.0        150    3436          11.0   
3    16.0          8         304.0        150    3433          12.0   
4    17.0          8         302.0        140    3449          10.5   
..    ...        ...           ...        ...     ...           ...   
393  27.0          4         140.0         86    2790          15.6   
394  44.0          4          97.0         52    2130          24.6   
395  32.0          4         135.0         84    2295          11.6   
396  28.0          4         120.0         79    2625          18.6   
397  31.0          4         119.0         82    2720          19.4   

     model year  origin                   car name usable?  new_cylinders  \
0            70       1  chevrolet cheve

In [55]:
# exercise - create a new column called "years old", which determines how old a car is 

# if the car is modeled in`70`, it would be 50 years old 
df['years_old'] = df['model year'].apply(lambda x: 2020 - (x+1900))
df.head(300)

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model year,origin,car name,usable?,new_cylinders,weight_in_tons,years_old
0,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu,Yes,7,1.7520,50
1,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320,Yes,7,1.8465,50
2,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite,Yes,7,1.7180,50
3,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst,Yes,7,1.7165,50
4,17.0,8,302.0,140,3449,10.5,70,1,ford torino,Yes,7,1.7245,50
...,...,...,...,...,...,...,...,...,...,...,...,...,...
295,35.7,4,98.0,80,1915,14.4,79,1,dodge colt hatchback custom,Yes,3,0.9575,41
296,27.4,4,121.0,80,2670,15.0,79,1,amc spirit dl,Yes,3,1.3350,41
297,25.4,5,183.0,77,3530,20.1,79,2,mercedes benz 300d,Yes,4,1.7650,41
298,23.0,8,350.0,125,3900,17.4,79,1,cadillac eldorado,Yes,7,1.9500,41


#### What's Next?
- Pandas Groupby functions and aggregation
- Combining multiple dataframes

In [49]:
df.shape

(398, 13)