<a href="https://colab.research.google.com/github/MamatkulovBunyodbek1999/X_prepared-data-Machine-Learning/blob/main/Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![Imgur](https://i.imgur.com/5pXzCIu.png)

# Data Science and artificial intelligence  Practicum

## 5-MODUL. Machine Learning

### Pipeline
<img src="https://www.researchgate.net/publication/334565019/figure/fig1/AS:782364141690881@1563541555043/The-Auto-Sklearn-pipeline-12-contains-three-main-building-blocks-a-Data.png"
alt="standartization" width="500" height="250"/>

<img src="https://miro.medium.com/max/1100/1*zBHVqeUkMYlwGwh67cKHGw.png"
alt="standartization" width="500" height="250"/>

<img src="https://i0.wp.com/lifewithdata.com/wp-content/uploads/2021/04/pipeline.png?fit=681%2C431&ssl=1"
alt="standartization" width="500" height="250"/>

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import sklearn # scikit-learn library

In [20]:
# online address of our data set.
URL = "https://github.com/ageron/handson-ml2/blob/master/datasets/housing/housing.csv?raw=true"
df = pd.read_csv(URL)

from sklearn.model_selection import train_test_split   
train_set, test_set = train_test_split(df, test_size=0.2, random_state=42)

housing = train_set.drop("median_house_value", axis=1) 
housing_labels = train_set["median_house_value"].copy()

housing_num = housing.drop("ocean_proximity", axis=1)

# Our manual transformer 

In [21]:
from sklearn.base import BaseEstimator, TransformerMixin
# indexes of the columns that we need.
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    def __init__(self, add_bedrooms_per_room = True):
        self.add_bedrooms_per_room = add_bedrooms_per_room
    def fit(self, X, y=None):
        return self #Our function is only transformer, not an estimator.
    def transform(self, X):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room: # add_bedrooms_per_room column will be freewill.
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]

# Pipeline

###Now we converted string column into integer ! But if we look at the original(string) column and our new arrays, it says that NEAR OCEAN equals 4, INLAND equals 0, and 1 HOUR TO OCEAN equals 1. But actually NEAR OCEAN and 1-HOUR TO OCEAN areas are very close to each other. But when we converted them into arrays, 4 and 0 are too far from each other. If INLAND=1, and 1 HOUR TO OCEAN=0, Then ML is gonna think that these areas are close to each other and gives us wrong predictions. To avoid this mistake, we can use >>>>>
# One Hot Encoder

#StandardScaler / Standartization
###This method is not gonna make numbers between 0 and 1, but brings them much closer to each other.

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

num_pipeline = Pipeline([
          ('imputer', SimpleImputer(strategy='median')), # will change nan values with medians
          ('attribs_adder', CombinedAttributesAdder(add_bedrooms_per_room = True)), # will add new columns 
          ('std_scaler', StandardScaler())             
])

In [23]:
# numeric pipeline will process every step above, and eventually gives us final clear array which is consisted of numbers closer to each other
num_pipeline.fit_transform(housing_num)

array([[ 1.27258656, -1.3728112 ,  0.34849025, ..., -0.17491646,
         0.05137609, -0.2117846 ],
       [ 0.70916212, -0.87669601,  1.61811813, ..., -0.40283542,
        -0.11736222,  0.34218528],
       [-0.44760309, -0.46014647, -1.95271028, ...,  0.08821601,
        -0.03227969, -0.66165785],
       ...,
       [ 0.59946887, -0.75500738,  0.58654547, ..., -0.60675918,
         0.02030568,  0.99951387],
       [-1.18553953,  0.90651045, -1.07984112, ...,  0.40217517,
         0.00707608, -0.79086209],
       [-1.41489815,  0.99543676,  1.85617335, ..., -0.85144571,
        -0.08535429,  1.69520292]])

###Conveyor processing number columns is ready, what about text columns?

###For this, we refer to the special ColumnTransformer object, which is also a pipeline view. Inside the ColumnTransformer we will also add the num_ipeline created above.

# FULL PIPELINE 

In [24]:
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ['ocean_proximity']

full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', OneHotEncoder(), cat_attribs)
])

In [25]:
housing_prepared=full_pipeline.fit_transform(housing)

In [26]:
housing_prepared[0:5, :]

array([[ 1.27258656, -1.3728112 ,  0.34849025,  0.22256942,  0.21122752,
         0.76827628,  0.32290591, -0.326196  , -0.17491646,  0.05137609,
        -0.2117846 ,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 0.70916212, -0.87669601,  1.61811813,  0.34029326,  0.59309419,
        -0.09890135,  0.6720272 , -0.03584338, -0.40283542, -0.11736222,
         0.34218528,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.44760309, -0.46014647, -1.95271028, -0.34259695, -0.49522582,
        -0.44981806, -0.43046109,  0.14470145,  0.08821601, -0.03227969,
        -0.66165785,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [ 1.23269811, -1.38217186,  0.58654547, -0.56148971, -0.40930582,
        -0.00743434, -0.38058662, -1.01786438, -0.60001532,  0.07750687,
         0.78303162,  0.        ,  0.        ,  0.        ,  0.        ,
         1.        ],
       [-0.10855122,  0.5320839 ,  1