# Although Scikit-Learn provides many useful transformers, you will need to write your own for tasks such as custom cleanup operations or combining specific attributes.

##### !! this part was very confusing at first but it's very important !!
This custom transformers we make have to seamlessly **integrate** with existing Scikit-Learn pipelines and its other. transformers

<br> Now in the book we have _"and since Scikit-Learn relies on **duck typing** (not inheritance), all you need is to create a class and implement three methods: fit() (returning self), transform(), and fit_transform()"_.

<br> **First** what is duck typing : is a concept where the type or the class of an object is less important than the methods it defines. When you use duck typing, you do not check types at all. Instead, you check for the presence of a given method or attribute.
<br> **Second** u can create custom transformers using inheritance, and sometimes its much more easier this way.And in the book they are actually using inheritance as far as i can tell. 

<br> Scikit-Learn provides us with two great base classes, **TransformerMixin** and **BaseEstimator**. Inheriting from TransformerMixin ensures that all we need to do is write our fit and transform methods and we get fit_transform for free. Inheriting from BaseEstimator ensures we get get_params and set_params for free. Since the fit method doesn’t need to do anything but return the object itself, all we really need to do after inheriting from these classes, is define the transform method for our custom transformer and we get a fully functional custom transformer that can be seamlessly integrated with a scikit-learn pipeline!

In [86]:
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

housing = pd.read_csv(r"C:\Users\georg\Desktop\Machine Learning\notebooks_detailed\datasets\housing\housing.csv") 

housing["income_category"] =pd.cut(housing["median_income"],bins=[0,1.5,3.0,4.5,6.,np.inf],labels=[1,2,3,4,5])
split_indices = StratifiedShuffleSplit(n_splits=1,test_size=0.2,random_state=42)
for train_index, test_index in split_indices.split(housing,housing["income_category"]): 
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index] 
# standard by now

In [87]:
from sklearn.base import BaseEstimator, TransformerMixin


In [88]:
housing["ocean_proximity"].value_counts() 
# so let's try to make a custom transformer for this category attribute 
# I personaly don't like the example in the book, not friendly at all in my opinion but we will try to disect it later

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

In [89]:
category_list = list(housing.ocean_proximity.unique()) 
# this looks like ['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']
numerical_values = [0,1,2,3,4]
# new values for each category

class CustomAttribute(BaseEstimator,TransformerMixin):  # we inherit from BaseEstimator and TransformerMixin 
    # we get fit_transform from TransformerMixin and get_params() and set_params() from BaseEstimator but will not use parameters
    # in this simple case
    
    def __init__(self):
        print("Initiated") #just prints something 
        
    def fit(self,X,y=None):
        return self
    
    def transform(self,X,y=None):
        X_ = X.copy() # we are making a copy just to be sure
        for c in range(0,len(category_list)): 
            X_.loc[(X_.ocean_proximity == category_list[c]), ["ocean_proximity"]] = numerical_values[c]
            # https://www.analyticsvidhya.com/blog/2020/02/loc-iloc-pandas/    read this its very usefull
            # in short we assigned the new numerical value to each categorical value using .loc  
            # loc[(condition for a column),[column target]] = new value
        
        return X_

In [90]:
x = CustomAttribute()
y =x.fit_transform(housing)

Initiated


In [91]:
y.ocean_proximity.value_counts()

1    9136
2    6551
3    2658
0    2290
4       5
Name: ocean_proximity, dtype: int64

## Now for the example in the book

In [92]:
housing.iloc[:,3]    # i will use iloc because it seems more logical
print(type(housing.iloc[:,3]))

<class 'pandas.core.series.Series'>


In [93]:
housing.values[:,3]   #this returns a numpy array , it u find this easy to use go ahead choose either one of them
print(type(housing.values[:,3]))

<class 'numpy.ndarray'>


In [94]:
result = housing.iloc[:,3] /  housing.iloc[:,6]   # total_rooms / households
result

0        6.984127
1        6.238137
2        8.288136
3        5.817352
4        6.281853
           ...   
20635    5.045455
20636    6.114035
20637    5.205543
20638    5.329513
20639    5.254717
Length: 20640, dtype: float64

In [99]:
# indexes of the columns we will use / position on axes 1 of this columns 
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self 
    def transform(self, X):
        rooms_per_household = X.iloc[:, rooms_ix] /X.iloc[:, households_ix]
        population_per_household =X.iloc[:, population_ix] /X.iloc[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room =X.iloc[:, bedrooms_ix] / X.iloc[:, rooms_ix]
            return pd.concat([X, rooms_per_household, population_per_household , bedrooms_per_room],axis=1)
        else:
            return  pd.concat([X, rooms_per_household, population_per_household],axis=1)

        
attr_adder = CombinedAttributesAdder()
housing_extra_attribs = attr_adder.fit_transform(housing)

### pd.concat () just brings together all the data 
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [100]:
housing_extra_attribs
# we can see that it added the 2 columns in the dataframe but now we have to rename them acorddingly

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,income_category,0,1,2
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,5,6.984127,2.555556,0.146591
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,5,6.238137,2.109842,0.155797
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,5,8.288136,2.802260,0.129516
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,4,5.817352,2.547945,0.184458
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,3,6.281853,2.181467,0.172096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND,2,5.045455,2.560606,0.224625
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND,2,6.114035,3.122807,0.215208
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND,2,5.205543,2.325635,0.215173
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,2,5.329513,2.123209,0.219892


In [102]:
housing_extra_attribs.columns = list(housing.columns)+["rooms_per_household", "population_per_household", " bedrooms_per_room"]
# we just added the names of the two columns at the end of the original 
housing_extra_attribs

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,income_category,rooms_per_household,population_per_household,bedrooms_per_room
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,5,6.984127,2.555556,0.146591
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,5,6.238137,2.109842,0.155797
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY,5,8.288136,2.802260,0.129516
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY,4,5.817352,2.547945,0.184458
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY,3,6.281853,2.181467,0.172096
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND,2,5.045455,2.560606,0.224625
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND,2,6.114035,3.122807,0.215208
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND,2,5.205543,2.325635,0.215173
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND,2,5.329513,2.123209,0.219892


# What is very important is that the    bedrooms_per_room   is called a Hyperparameter 
### This hyperparameter will allow you to easily find out whether adding this attribute helps the Machine Learning algorithms or not. More generally, you can add a hyperparameter to any data preparation step that you are not 100% sure about.

### This is not exactly what a hyperparameter is suppose to do , but it's a very important concept that we will handle in the future. The base ideea is important for now.