# Prepare the Data for Machine Learning Algorithm
### Instead of just doing this manually, you should write functions to do that, for several good reasons:
        - This will allow you to reproduce these transformations easily on any dataset (e.g., the next time you get a fresh dataset).
        - You will gradually build a library of transformation functions that you can reuse in future projects.
        - You can use these functions in your live system to transform the new data before feeding it to your algorithms.
        - This will make it possible for you to easily try various transformations and see which combination of transformations works best.
    

In [60]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

housing = pd.read_csv(r"C:\Users\georg\Desktop\end_end\datasets\housing\housing.csv") 

In [61]:
housing["income_category"] =pd.cut(housing["median_income"],bins=[0,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
split_indices = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split_indices.split(housing,housing["income_category"]): 
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index] 
# standard by now

### We will revert to a clean training set but : will separate the the attributes that will be used for predictions (PREDICTORS) and the labels or target values

In [62]:
housing = strat_train_set.drop("median_house_value",axis=1) 
# drop() removes rows or column by specifying the name and the axis 0 = row   1 = column  
# !! drop also creates a copy of the data so the original strat_train_test is not affected
housing_labels = strat_train_set["median_house_value"].copy() 

### Data Cleaning - Missing features  ( values missing in some attribute instances)
### 3 options:
        - get rid of the corresponding instances
        - get rid of the whole attribute
        - set the values to some value ( mean,zero,median etc)

#### total_bedrooms attribute has some missing values, so let’s fix this.

In [63]:
housing["total_bedrooms"].isnull().sum()  
# isnull() check for Nan/missing values 
# sum() just adds them togheter

158

In [64]:
housing.dropna(subset=["total_bedrooms"]) # option 1
# # dropna() removes missing values     subset parameter takes in an array in our case column total_bedrooms and removes the row where we have no values
# !!! this will not modify anything because dropna() function returns a new dataset without the NaN/missing values. So u have to assign it to a variable.
housing["total_bedrooms"].isnull().sum()

158

In [65]:
housing_with_dropna = housing.dropna(subset=["total_bedrooms"]) 
housing_with_dropna["total_bedrooms"].isnull().sum()
# now it worked 

0

In [66]:
housing.drop("total_bedrooms", axis=1) # option 2
# same as dropna does not work without assign 
housing["total_bedrooms"].isnull().sum()  

158

In [67]:
median = housing["total_bedrooms"].median() # option 3   
# # we calculate the median of the total values that are recorder in the column 
print(median)
housing["total_bedrooms"].fillna(median, inplace=True)

433.0


#### Don’t forget to save the median value that you have computed. You will need it later to replace missing values in the test set when you want to evaluate your system, and also once the system goes live to replace missing values in new data.

In [68]:
housing["total_bedrooms"].isnull().sum()
# this worked because of the inplace parameter , which if set to True will modify the  original dataframe 

0

### Scikit-Learn provides a handy class to take care of missing values: SimpleImputer
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

In [69]:
from sklearn.impute import SimpleImputer  # importing transformer for completing missing values.

imputer = SimpleImputer(strategy="median")   # we have to create an instance , strategy is the method u want to apply to all attributes with missing values.

# !!!!  this works only on numerical data so ocean_proximity will need to be separated from the data(for now)
housing_numerical = housing.drop("ocean_proximity", axis=1)

In [70]:
imputer.fit(housing_numerical)
imputer.statistics_   #  array of all the median values of each attribute

array([-118.51  ,   34.26  ,   29.    , 2119.5   ,  433.    , 1164.    ,
        408.    ,    3.5409,    3.    ])

### The imputer has simply computed the median of each attribute and stored the result in its statistics_ instance variable. Only the total_bedrooms attribute had missing values, but we cannot be sure that there won’t be any missing values in new data after the system goes live, so it is safer to apply the imputer to all the numerical attributes

### what u really have to understand is that the imputer is a class not a function. And it works like a TRANSFORMER , it extracts values from some data ( it "trains" on that data) , after that it can TRANSFORM the data with what it learned.
    - in our case it extracted the median values from housing_numerical dataframe 
    - and we will use it to apply does values to the data

In [75]:
housing_numerical.median().values  
# we can see that they are identical so it extracted them correctly
X = imputer.transform(housing_numerical)  # we apply it using the transform() function
X # is just a simple Numpy array with the transformed features

array([[-121.89  ,   37.29  ,   38.    , ...,  339.    ,    2.7042,
           2.    ],
       [-121.93  ,   37.05  ,   14.    , ...,  113.    ,    6.4214,
           5.    ],
       [-117.2   ,   32.77  ,   31.    , ...,  462.    ,    2.8621,
           2.    ],
       ...,
       [-116.4   ,   34.09  ,    9.    , ...,  765.    ,    3.2723,
           3.    ],
       [-118.01  ,   33.82  ,   31.    , ...,  356.    ,    4.0625,
           3.    ],
       [-122.45  ,   37.77  ,   52.    , ...,  639.    ,    3.575 ,
           3.    ]])

In [76]:
# now to put it back in a Pandas dataframe we can 
housing_transformed = pd.DataFrame(X,columns=housing_numerical.columns)

In [79]:
housing_transformed.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
income_category       0
dtype: int64

# In the file number 7 will give more detailed explanations on how tranformers work. And we will try to understand the Scikit-Learn library design principles.