<center>
    <h1 id='pipelines' style='color:#7159c1'>🤖 Pipelines 🤖</h1>
    <i>Basic Step-by-Step Process of Data Preparation for Machine Learning</i>
</center>

When working with Machine Learning, you normally apply some steps that are common to the most datas. This notebook is a hand for you remember them.

```
- Reading Dataset and Splitting it up
- Separating Numerical Features from Categorical Ones
- Checking out for Inconsistent Data Entry
- Handling Good and Bad Labels
- Pipelines and Preprocessing
- Bundling Preprocessors
- Creating the Base Model
- Getting Better Results with Cross-Validation
- XGBoost
```

<h1 id='0-reading-dataset-and-splitting-it-up' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>0 | Reading Dataset and Splitting it up</h1>

In [1]:
import pandas as pd # pip install pandas
from sklearn.model_selection import train_test_split # pip install sklearn

autos_df = pd.read_csv('./datasets/autos.csv') # if needed, we solve the charset problem here with 'chardet' package
autos_df.head()

Unnamed: 0,symboling,make,fuel_type,aspiration,num_of_doors,body_style,drive_wheels,engine_location,wheel_base,length,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,alfa-romero,gas,std,2,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9,111,5000,21,27,13495
1,3,alfa-romero,gas,std,2,convertible,rwd,front,88.6,168.8,...,130,mpfi,3.47,2.68,9,111,5000,21,27,16500
2,1,alfa-romero,gas,std,2,hatchback,rwd,front,94.5,171.2,...,152,mpfi,2.68,3.47,9,154,5000,19,26,16500
3,2,audi,gas,std,4,sedan,fwd,front,99.8,176.6,...,109,mpfi,3.19,3.4,10,102,5500,24,30,13950
4,2,audi,gas,std,4,sedan,4wd,front,99.4,176.6,...,136,mpfi,3.19,3.4,8,115,5500,18,22,17450


In [2]:
autos_df.dtypes # if needed, we convert some features types, such as to datetime, int64/int32 or float64/float32

symboling              int64
make                  object
fuel_type             object
aspiration            object
num_of_doors           int64
body_style            object
drive_wheels          object
engine_location       object
wheel_base           float64
length               float64
width                float64
height               float64
curb_weight            int64
engine_type           object
num_of_cylinders       int64
engine_size            int64
fuel_system           object
bore                 float64
stroke               float64
compression_ratio      int64
horsepower             int64
peak_rpm               int64
city_mpg               int64
highway_mpg            int64
price                  int64
dtype: object

In [3]:
X = autos_df.copy()
y = X.pop('price')

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, random_state=20242101
    , train_size=0.70
    , test_size=0.30
)

<h1 id='1-separating-numerical-features-from-categorical-ones' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>1 | Separating Numerical Features from Categorical Ones</h1>

In [4]:
numerical_features = [
    feature for feature in X_train.columns
    if X_train[feature].dtype in ['int32', 'int64', 'float32', 'float64']
]

# features for ordinal encoding
categorical_ordinal_features = [
    feature for feature in X_train.columns
    if X_train[feature].dtype in ['object', 'o']
       and X_train[feature].nunique() >= 10
]

# features for one-hot encoding
categorical_onehot_features = [
    feature for feature in X_train.columns
    if X_train[feature].dtype in ['object', 'o']
       and X_train[feature].nunique() < 10
]

print(f'- Numerical Features: {numerical_features}')
print('---')
print(f'- Categorical Features for Ordinal Encoding: {categorical_ordinal_features}')
print('---')
print(f'- Categorical Features for One-Hot Encoding: {categorical_onehot_features}')

- Numerical Features: ['symboling', 'num_of_doors', 'wheel_base', 'length', 'width', 'height', 'curb_weight', 'num_of_cylinders', 'engine_size', 'bore', 'stroke', 'compression_ratio', 'horsepower', 'peak_rpm', 'city_mpg', 'highway_mpg']
---
- Categorical Features for Ordinal Encoding: ['make']
---
- Categorical Features for One-Hot Encoding: ['fuel_type', 'aspiration', 'body_style', 'drive_wheels', 'engine_location', 'engine_type', 'fuel_system']


<h1 id='2-checking-out-for-inconsistent-data-entry' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>2 | Checking out for Inconsistent Data Entry</h1>

In this example, we are checking out the 'make' feature only, however, in a real world analysis, we must do to every categorical one!!

In [5]:
import fuzzywuzzy # pip install fuzzywuzzy
from fuzzywuzzy import process

# Getting the unique makes
train_makes = X_train.make.unique()
train_makes.sort()

valid_makes = X_valid.make.unique()
valid_makes.sort()

print(f'- Makes in Training: {train_makes}')
print('---')
print(f'- Makes in Validation: {valid_makes}')

- Makes in Training: ['alfa-romero' 'audi' 'bmw' 'chevrolet' 'dodge' 'honda' 'isuzu' 'jaguar'
 'mazda' 'mercedes-benz' 'mitsubishi' 'nissan' 'peugot' 'plymouth'
 'porsche' 'saab' 'subaru' 'toyota' 'volkswagen' 'volvo']
---
- Makes in Validation: ['audi' 'chevrolet' 'dodge' 'honda' 'jaguar' 'mazda' 'mercedes-benz'
 'mercury' 'mitsubishi' 'nissan' 'peugot' 'plymouth' 'porsche' 'saab'
 'subaru' 'toyota' 'volkswagen' 'volvo']


In [6]:
# Getting the Matches to the Unique Make
train_matches = fuzzywuzzy.process.extract(
    'alfa-romero'
    , train_makes
    , limit=10
    , scorer=fuzzywuzzy.fuzz.token_sort_ratio
)

valid_matches = fuzzywuzzy.process.extract(
    'saab'
    , valid_makes
    , limit=10
    , scorer=fuzzywuzzy.fuzz.token_sort_ratio
)

print(f'- Train Matches for Alfa-Romero: {train_matches}')
print('---')
print(f'- Validation Matches for Saab: {valid_matches}')

- Train Matches for Alfa-Romero: [('alfa-romero', 100), ('jaguar', 35), ('mercedes-benz', 33), ('plymouth', 32), ('chevrolet', 30), ('volkswagen', 29), ('saab', 27), ('dodge', 25), ('mazda', 25), ('volvo', 25)]
---
- Validation Matches for Saab: [('saab', 100), ('mazda', 44), ('jaguar', 40), ('nissan', 40), ('subaru', 40), ('mitsubishi', 29), ('volkswagen', 29), ('audi', 25), ('honda', 22), ('toyota', 20)]


In [7]:
# Creating Function to Deal with the Inconsistent Entries
def replace_matches_in_feature(df, column, string_to_match, min_ratio):
    # getting list of strings and the matches
    strings = df[column].unique()
    matches = fuzzywuzzy.process.extract(
        string_to_match
        , strings
        , limit=10
        , scorer=fuzzywuzzy.fuzz.token_sort_ratio
    )
    
    # getting the most similar matches
    close_matches = [matche[0] for matche in matches if matche[1] >= min_ratio]
    
    # getting the rows with the close matches and replacing them
    rows_with_matches = df[column].isin(close_matches)
    df.loc[rows_with_matches, column] = string_to_match
    
# Calling the function and only replacing the values if the similarity score
# is equals to or greater than 75%
replace_matches_in_feature(X_train, 'make', 'alfa-romero', 75)
replace_matches_in_feature(X_valid, 'make', 'saab', 75)

<h1 id='3-handling-good-and-bad-labels' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>3 | Handling Good and Bad Labels</h1>

In [8]:
# Getting the Good and Bad Labels
good_labels_ordinal_features = [
    feature for feature in categorical_ordinal_features
    if set(X_valid[feature]).issubset(set(X_train[feature]))
]

bad_labels_ordinal_features = list(set(categorical_ordinal_features) - set(good_labels_ordinal_features))

good_labels_onehot_features = [
    feature for feature in categorical_onehot_features
    if set(X_valid[feature]).issubset(set(X_train[feature]))
]

bad_labels_onehot_features = list(set(categorical_onehot_features) - set(good_labels_onehot_features))

# Dropping the Bad Labels
good_labels_X_train = X_train.drop(bad_labels_ordinal_features, axis=1).copy()
good_labels_X_train = good_labels_X_train.drop(bad_labels_onehot_features, axis=1)

good_labels_X_valid = X_valid.drop(bad_labels_ordinal_features, axis=1).copy()
good_labels_X_valid = good_labels_X_valid.drop(bad_labels_onehot_features, axis=1)

<h1 id='4-pipelines-and-preprocessing' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>4 | Pipelines and Preprocessing</h1>

> **Scaling** - `tt's used to change the RANGE of the datas. The RANGE goes from 0 to 1`;

> **Standardization** - `it's like the Scale, but the scale range doesn't go from 0 to 1, it varies`;

> **Normalization** - `changes the distribution of the datas in order to get a Normal Distribution (Gaussian Distribution or Bell Curve)`.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer

# Numerical Preprocessing Preprocessing
#
# / Imputer (strategy: mean|most_frequent|median|constant)
# / Scale | Standardization | Normalization (uncomment the code to change the preprocessing)
#
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean'))
    , ('scale', MinMaxScaler())
    #('standardization', StandardScaler()),
	#('robustscaler', RobustScaler()),
	#('normalization', Normalizer()),
])

# Ordinal Categorical Features Preprocessing
#
# / Imputer (strategy: most_frequent|constant)
# / Ordinal Encoder
#
categorical_ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
    , ('ordinal_encoder', OrdinalEncoder())
])

# One-Hot Categorical Preprocessor
#
# / Imputer (strategy: most_frequent|constant)
# / One-Hot Encoder (handle_unknown=ignore, sparse=False)
#
categorical_onehot_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent'))
    , ('onehot_encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

<h1 id='5-bundling-the-preprocessors' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>5 | Bundling the Preprocessors</h1>

In [10]:
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_transformer, numerical_features)
        , ('ordinal_categorical', categorical_ordinal_transformer, good_labels_ordinal_features)
        , ('onehot_categorical', categorical_onehot_transformer, good_labels_onehot_features)
    ]
)

<h1 id='6-creating-the-base-model' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>6 | Creating the Base Model</h1>

In [11]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Creating the Base Model and the Pipeline
base_model = RandomForestRegressor(n_estimators=250, random_state=20242101)
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
    , ('model', base_model)
])

# Training and Predicting
pipeline.fit(good_labels_X_train, y_train)
predictions = pipeline.predict(good_labels_X_valid)

# Evaluating
mae = mean_absolute_error(predictions, y_valid)
print(f'- Mean Absolute Error of the Base Model: {mae}')

- Mean Absolute Error of the Base Model: 1339.5122298850572


<h1 id='7-getting-better-results-with-cross-validation' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>7 | Getting Better Results with Cross-Validation</h1>

In [12]:
from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates 'negative' MAE
scores = -1 * cross_val_score(
    pipeline
    , pd.concat([good_labels_X_train, good_labels_X_valid], axis=0)
    , y
    , cv=5 # number of folds
    , scoring='neg_mean_absolute_error'
)

print(f'- Average Mean Absolute Error Across Cross-Validation Experiments: {scores.mean()}')
print('---')
print(f'- All Scores: {scores}')

- Average Mean Absolute Error Across Cross-Validation Experiments: 6589.285280637491
---
- All Scores: [6896.14960806 9021.31911282 4725.00336752 6122.34763233 6181.60668246]


<h1 id='8-xgboost' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>8 | XGBoost</h1>

In [20]:
from xgboost import XGBRegressor # pip install xgboost

xgb_X_train = preprocessor.fit_transform(X_train)
xgb_X_valid = preprocessor.transform(X_valid)

xgb_model = XGBRegressor(
    n_estimators=500
    , learning_rate=0.05
    , early_stopping_rounds=5
    , n_jobs=4
)

xgb_model.fit(
    xgb_X_train
    , y_train
    , eval_set=[(xgb_X_valid, y_valid)]
    , verbose=False
)

xgb_predictions = xgb_model.predict(xgb_X_valid)
mae = mean_absolute_error(xgb_predictions, y_valid)

print(f'- Mean Absolute Error of XGBoost Modedl: {mae}')

- Mean Absolute Error of XGBoost Modedl: 1458.6810597386852


---

<h1 id='reach-me' style='color:#7159c1; border-bottom:3px solid #7159c1; letter-spacing:2px; font-family:JetBrains Mono; font-weight: bold; text-align:left; font-size:240%;padding:0'>📫 | Reach Me</h1>

> **Email** - [csfelix08@gmail.com](mailto:csfelix08@gmail.com?)

> **Linkedin** - [linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)

> **GitHub:** - [CSFelix](https://github.com/CSFelix)

> **Kaggle** - [DSFelix](https://www.kaggle.com/dsfelix)

> **Portfolio** - [CSFelix.io](https://csfelix.github.io/).