<a href="https://colab.research.google.com/github/AllieUbisse/end-to-end-ml/blob/master/notebooks/1_EDA_%26_Pre_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# imports and data loading

In [0]:
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#  preprocessing and evaluation
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import  Normalizer
from sklearn.preprocessing import  Binarizer



# model



In [0]:
url_train = '/content/sample_data/california_housing_train.csv'
train = pd.read_csv(url_train)

# set display format
pd.set_option('display.width', 100)
pd.set_option('precision', 3)



# EDA (Task 2)

## Understand Your Data With Descriptive Statistics

1. Take a peek at your raw data.
2.Review the dimensions of your dataset.
3. Review the data types of attributes in your data.
4. Summarize the distribution of instances across classes in your dataset.
5. Summarize your data using descriptive statistics.
6. Understand the relationships in your data using correlations.
7. Review the skew of the distributions of each attribute.


In [48]:
# peak of data
train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.494,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.651,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.192,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


In [25]:
# dimension of data
train.shape

(17000, 9)

In [58]:
# data types of attributes
# train.dtypes
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           17000 non-null  float64
 1   latitude            17000 non-null  float64
 2   housing_median_age  17000 non-null  float64
 3   total_rooms         17000 non-null  float64
 4   total_bedrooms      17000 non-null  float64
 5   population          17000 non-null  float64
 6   households          17000 non-null  float64
 7   median_income       17000 non-null  float64
 8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB


In [29]:
# class distribution 
train.groupby(by='median_house_value').size()

median_house_value
14999.0       4
17500.0       1
22500.0       3
25000.0       1
26600.0       1
           ... 
498800.0      1
499000.0      1
499100.0      1
500000.0     22
500001.0    814
Length: 3694, dtype: int64

> On classification problems you need to know how balanced the class values are. 
- **Highly imbalanced problems** (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.
- may lead to **overfiting**, the **model being bias** on the majority(**dominant**) class. 

In [49]:
# summary statistics
train.describe().iloc[:,2:]

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0,17000.0
mean,28.589,2643.664,539.411,1429.574,501.222,3.884,207300.912
std,12.587,2179.947,421.499,1147.853,384.521,1.908,115983.764
min,1.0,2.0,1.0,3.0,1.0,0.5,14999.0
25%,18.0,1462.0,297.0,790.0,282.0,2.566,119400.0
50%,29.0,2127.0,434.0,1167.0,409.0,3.545,180400.0
75%,37.0,3151.25,648.25,1721.0,605.25,4.767,265000.0
max,52.0,37937.0,6445.0,35682.0,6082.0,15.0,500001.0


In [54]:
# correlation of features with the target
corr_matrix = train.corr(method='pearson')
corr_matrix['median_house_value'].sort_values(ascending=False)

median_house_value    1.000
median_income         0.692
total_rooms           0.131
housing_median_age    0.107
households            0.061
total_bedrooms        0.046
population           -0.028
longitude            -0.045
latitude             -0.145
Name: median_house_value, dtype: float64

>  A **correlation** of **-1 or 1** shows a full **negative (-1)** or **positive (1)** correlation respectively. Whereas a value of **0 shows no correlation** at all.
-  **positive correlation (towards +1)**: 
  - strong positive correlation **{0.5,...,1}**
  - weak positive correlation **{0.0001,...,0.4}**
-  **negative correlation ( towards -1)**: 
  - strong negative correlation **{-0.5,...,-1}**
  - weak negative correlation **{-0.0001,...,-0.4}**

> Some **machine learning algorithms** like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset.

In [56]:
# skew of the data on each attribute
skew = train.skew()
skew 

longitude            -0.304
latitude              0.472
housing_median_age    0.065
total_rooms           4.003
total_bedrooms        3.323
population            5.187
households            3.343
median_income         1.627
median_house_value    0.973
dtype: float64

> Skew refers to a distribution that is assumed **Gaussian (normal or bell curve)** that is shifted or squashed in one direction or another.
- Many **machine learning algorithms** assume a *Gaussian distribution*.
-  Knowing that an **attribute** **has a skew** may allow you to **perform data preparation to correct the skew** and later **improve the accuracy of your models**.

> **Skewness**
- The skew results show a positive (right skew) / **right tail** 
-  negative (left skew) / **left tail**.
-Values closer to zero show less skew or near normal distribution.

***You will understand better when visualising the Skew of Univariate Distributions***

## VISUALIZATIONS

# DATA PRE-PROCESSING (Task 3)
> **Choose your option wisely**
1.  Rescale Data
2.  Standardize Data
3.  Normalize Data
4.  Binarize Data (Make Binary)




In [75]:
# features / columns names
train.columns.to_list()

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

In [0]:
# Separate target from features
X = train.drop('median_house_value', axis=1)
y = train['median_house_value']

##  1. **Rescale Data**
- When your data is comprised of **attributes with varying scales**, many **machine learning algorithms**
can **benefit** from **rescaling the attributes to all have the same scale**.
-  Often this is referred to as **normalization** and attributes are often rescaled into the **range between 0 and 1**.
-  This is **useful for optimization algorithms** used in the core of machine learning algorithms like **gradient descent**.
- It is also useful for *algorithms that weight inputs* like **regression** and **neural networks**
and *algorithms that use distance measures* like **k-Nearest Neighbors**.
- *You can rescale your data using s**cikit-learn** using the **MinMaxScaler** class*

  ```python
  # preprocessing imports
  from sklearn.preprocessing import MinMaxScaler
  ```

**let's do this!**

In [69]:
# Define MinMaxScaler instance
min_max_scaler = MinMaxScaler(feature_range=(0,1))

# Scale only the features X
X_rescaled = min_max_scaler.fit_transform(X)

# summary of rescaled x
pd.DataFrame(X_rescaled, columns=X.columns.to_list()).head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,1.0,0.175,0.275,0.148,0.199,0.028,0.077,0.069
1,0.984,0.198,0.353,0.202,0.295,0.032,0.076,0.091
2,0.975,0.122,0.314,0.019,0.027,0.009,0.019,0.079
3,0.974,0.117,0.255,0.04,0.052,0.014,0.037,0.186
4,0.974,0.109,0.373,0.038,0.05,0.017,0.043,0.098


## 2.  **Standardize Data**

- Standardization is a useful technique to transform attributes with a **Gaussian distribution and differing means and standard deviations**  to a standard Gaussian distribution with:
 - **mean=0** 
 - **standard deviation= 1**. 
- It is most suitable for techniques that assume a Gaussian
distribution in the input variables and work better with rescaled data, such as:
 - **linear regression**
 - **logistic regression**
 - **linear discriminate analysis**.
- You can standardize data using **scikit-learn**
with the **StandardScaler** class
```python
  # preprocessing imports
  from sklearn.preprocessing import StandardScaler
```

**let's do it!**

In [74]:
# define the Standardize instance with (0 mean, 1 stdev)
standard_scaler = StandardScaler()

# fit the features X only
# standard_scaler.fit_transform(X) # one step process only for training X
standard_scaler.fit(X)

# Transform X
X_standard = standard_scaler.transform(X) 

# summary of standard x
pd.DataFrame(X_standard, columns=X.columns.to_list()).head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,2.619,-0.672,-1.08,1.362,1.764,-0.361,-0.076,-1.253
1,2.54,-0.573,-0.762,2.297,3.23,-0.262,-0.099,-1.081
2,2.495,-0.905,-0.921,-0.882,-0.867,-0.955,-0.999,-1.17
3,2.49,-0.929,-1.159,-0.524,-0.48,-0.797,-0.716,-0.363
4,2.49,-0.962,-0.682,-0.546,-0.506,-0.702,-0.622,-1.026


## 3. Normalize Data

Normalizing in scikit-learn refers to rescaling each observation (row) to have a length of 1 (called
a unit norm or a vector with the length of 1 in linear algebra).
- This pre-processing method
can be useful for **sparse datasets (lots of zeros)** with attributes of varying scales 
- when using: 
  - **algorithms that weight input values** such as **neural networks**
  - **algorithms that use distance measures**  such as **k-Nearest Neighbors**. 
  
You can normalize data in Python with **scikit-learn** using the **Normalizer** class
```python
  # preprocessing imports
  from sklearn.preprocessing import  Normalizer
```

**let's do it**



In [79]:
# define the Normalizer instance
normalizer_scaler = Normalizer()

# fit the features X only
# normalizer_scaler.fit_transform(X) # one step process only for training X
normalizer_scaler.fit(X)

# Transform X
X_normalized = normalizer_scaler.transform(X) 

# summary of standard x
pd.DataFrame(X_normalized, columns=X.columns.to_list()).head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,-0.019,0.006,0.003,0.957,0.219,0.173,0.08,0.0002546
1,-0.014,0.004,0.002,0.959,0.238,0.142,0.058,0.0002281
2,-0.138,0.041,0.02,0.868,0.21,0.402,0.141,0.001991
3,-0.07,0.02,0.009,0.914,0.205,0.314,0.138,0.001943
4,-0.07,0.02,0.012,0.886,0.199,0.38,0.16,0.001173


## 4. Binarize Data (Make Binary)

> You can transform your data using a **binary threshold**.
 - All values **above** the threshold are **marked 1**.
 - All values **equal to or below** are **marked 0**.

> This is called **binarizing** your data or **thresholding** your data. 
- It can be useful when you have probabilities that you want to make
into crisp values.
- It is also useful when feature engineering and you want to add new features
that indicate something meaningful.

 You can create new binary attributes in Python using
**scikit-learn** with the **Binarizer** class

```python
   # preprocessing imports
  from sklearn.preprocessing import  Binarizer
```

In [83]:
# define the Binarizer instance
# Note threshold=1.5 was randomly chosen but in your case you will use your project requirements
binarizer_scaler = Binarizer(threshold=1.5) 

# fit the features X only
X_normalized = binarizer_scaler.fit_transform(X) # one step process only for training X
 
# summary of standard x
pd.DataFrame(X_normalized, columns=X.columns.to_list()).head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income
0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0
1,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
2,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
3,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Credits:
[sample book of machine learning mastery with python](https://s3.amazonaws.com/MLMastery/machine_learning_mastery_with_python_sample.pdf?__s=xbk5xvmjh72bie7r2u3u)