# Data Cleaning & Quality
<hr>

## Missing Data and Dealing with Outliers

### Cleaning Data
- Understand the **data quality**
- improving the quality of data
- Dealing with *missing data* (NA)
    - **Replacing:** with mean or median values
    - **Interpolation:** of the values
- Dealing with **data outliers**
    - Wrong values
- Removing **duplicates**
- The process requires **domain knowledge**

### Missing Data
- Two types of missing data will be considered.
    1. NaN data
    1. Rows in line series data

In [1]:
import numpy as np
import pandas as pd

In [2]:
# Not a Number
df = pd.DataFrame({'a': [np.nan, 2, 3], 'b':[4,5,np.nan]})
df

Unnamed: 0,a,b
0,,4.0
1,2.0,5.0
2,3.0,


In [5]:
# missing full rows of data in timeseries data
df = pd.DataFrame([i for i in range(10)], index=pd.date_range('2023-01-01',periods=10))
df = df.drop(['2023-01-03','2023-01-05','2023-01-06'])
df

Unnamed: 0,0
2023-01-01,0
2023-01-02,1
2023-01-04,3
2023-01-07,6
2023-01-08,7
2023-01-09,8
2023-01-10,9


Missing row values in a time series data is where `interpolation` comes into play. 

### Outliers
- To determine and outlier, it often requires domain knowledge

In [6]:
df = pd.DataFrame({'weight (kg)': [86, 83,8, 78,109,96, 0]})
df

Unnamed: 0,weight (kg)
0,86
1,83
2,8
3,78
4,109
5,96
6,0


<hr>

### Demonstrate how it affects results
- The goal is to predict the house price
- Explore how dealing with missing values impacts the prediction of a linear regression model

In [7]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [9]:
data = pd.read_csv('./data/home-data/train.csv',index_col=0)
data

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,8,2007,WD,Normal,175000
1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,Inside,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,Inside,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,4,2010,WD,Normal,142125


In [10]:
data = data.select_dtypes(include='number')
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1460 entries, 1 to 1460
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   LotFrontage    1201 non-null   float64
 2   LotArea        1460 non-null   int64  
 3   OverallQual    1460 non-null   int64  
 4   OverallCond    1460 non-null   int64  
 5   YearBuilt      1460 non-null   int64  
 6   YearRemodAdd   1460 non-null   int64  
 7   MasVnrArea     1452 non-null   float64
 8   BsmtFinSF1     1460 non-null   int64  
 9   BsmtFinSF2     1460 non-null   int64  
 10  BsmtUnfSF      1460 non-null   int64  
 11  TotalBsmtSF    1460 non-null   int64  
 12  1stFlrSF       1460 non-null   int64  
 13  2ndFlrSF       1460 non-null   int64  
 14  LowQualFinSF   1460 non-null   int64  
 15  GrLivArea      1460 non-null   int64  
 16  BsmtFullBath   1460 non-null   int64  
 17  BsmtHalfBath   1460 non-null   int64  
 18  FullBath     

In [12]:
data.corr()['SalePrice'].sort_values(ascending=False)

SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
ScreenPorch      0.111447
PoolArea         0.092404
MoSold           0.046432
3SsnPorch        0.044584
BsmtFinSF2      -0.011378
BsmtHalfBath    -0.016844
MiscVal         -0.021190
LowQualFinSF    -0.025606
YrSold          -0.028923
OverallCond     -0.077856
MSSubClass      -0.084284
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64

<hr>

### Helper Function
* Implement a helper function to calculate the $r^2$ score
- It takes in independent features `X` and dependent feature `y`
- Split into training and testing data
- Fit the training set
- Predict the test set
- Return the $r^2$ score

In [16]:
def regression_score(X,y):
    np.random.seed(42)
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)
    model = LinearRegression()
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    return r2_score(y_pred, y_test)

**Calculation**
- Find the $r^2 score$ using `data.dropna()`
- Then with `data.fillna(data.mean())`
- Then with `data.fillna(data.mode().iloc[0])`

In [17]:
# r-squared score when NaN are deleted
test_base = data.dropna()
regression_score(test_base.drop('SalePrice',axis=1),test_base[['SalePrice']])

0.6548289068325825

In [18]:
# r-squared score when NaN are replaced with mean
test_base = data.fillna(data.mean())
regression_score(test_base.drop('SalePrice',axis=1),test_base[['SalePrice']])

0.7439346373898401

In [None]:
# r-squared score when NaN are deleted
test_base = data.dropna()
regression_score(test_base.drop('SalePrice',axis=1),test_base[['SalePrice']])