# Machine Learning Pipeline using Scikit-Learn

## Introduction to the Machine Learning Pipeline Project

This project focuses on developing a comprehensive machine learning pipeline using Scikit-Learn. The pipeline aims to streamline the preprocessing and modeling stages to enhance efficiency and performance in predicting customer churn. The key components of the pipeline include:

1. **Data Imputation:** Handling missing values to ensure data completeness.
2. **Feature Scaling:** Normalizing data to improve model convergence and performance.
3. **PCA (Principal Component Analysis):** Reducing dimensionality while retaining significant variance.
4. **One-Hot Encoding:** Converting categorical variables into a machine-readable format.
5. **Model Fine-Tuning:** Optimizing the estimator to achieve the best predictive performance.

The dataset used in this project is the Churn Modelling dataset, which contains information about customer behavior and churn status. The pipeline is designed to process both numerical and categorical data, ensuring that all features are appropriately transformed before feeding into the machine learning model.

The main steps include:
- Loading and inspecting the dataset.
- Dropping irrelevant columns.
- Splitting the dataset into training and testing sets.
- Building separate pipelines for numerical and categorical data processing.
- Integrating the pipelines into a unified preprocessing workflow.
- Training and fine-tuning a Random Forest classifier to predict customer churn.

This project demonstrates the power of Scikit-Learn's Pipeline and ColumnTransformer classes in creating an efficient and scalable machine learning workflow.

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

In [2]:
churn_df = pd.read_csv(r"Churn_Modelling.csv")

In [3]:
churn_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
#droppinng the unwanted columns
churn_df.drop(columns = ['RowNumber', 'CustomerId', 'Surname'], inplace = True)

In [7]:
#dividing the dataset into input features and target feature
X = churn_df.drop(columns = ['Exited'])
y = churn_df['Exited']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

print(f'Count of rows in Training Set : {X_train.shape[0]}')
print(f'Count of rows in Testing Set : {X_test.shape[0]}')

Count of rows in Training Set : 8000
Count of rows in Testing Set : 2000


In [8]:
#pipeline for processing numerical data
num_pipeline = Pipeline([
    ('num_imputation', SimpleImputer(strategy ='mean')),
    ('feature_scaling', MinMaxScaler()),
    ('pca', PCA(0.98))
])

num_pipeline

In [16]:
# pipeline for processing categorical data
catg_pipeline = Pipeline([
    ('catg_imputation', SimpleImputer(fill_value = 'missing', strategy='constant')),
    ('one_hot_encoding', OneHotEncoder(sparse_output= False, handle_unknown = 'ignore'))
])

catg_pipeline

In [19]:
num_cols = X.select_dtypes(include= np.number).columns.tolist()
cat_cols = X.select_dtypes(include= 'object').columns.tolist()

In [21]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


In [23]:
preprocessor = ColumnTransformer([
    ('categorical', catg_pipeline, cat_cols),
    ('numerical', num_pipeline, num_cols)
])

pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('estimator', RandomForestClassifier())
])

pipe.fit(X_train, y_train)

In [24]:
pipe.predict(X_test)

array([0, 0, 0, ..., 1, 0, 0], dtype=int64)

In [25]:
pipe.score(X_test, y_test) * 100

86.6

In [30]:
#hyperparameter tuning
parameters = {
    'estimator__n_estimators' : [100, 150, 200],
    'estimator__max_depth' : [5, 7, 10, 15],
    'estimator__min_samples_split' : [2, 3, 4],
    'estimator__max_features' : [2, 4, 6, 8, 10]
}

grid_search = GridSearchCV(
    pipe,
    param_grid= parameters,
    n_jobs = 1
)

grid_search.fit(X_train, y_train)

In [31]:
# ?RandomForestClassifier

In [32]:
grid_search.best_params_

{'estimator__max_depth': 7,
 'estimator__max_features': 10,
 'estimator__min_samples_split': 4,
 'estimator__n_estimators': 100}

In [35]:
pipe2 = Pipeline([
    ('preprocessor', preprocessor),
    ('estimator', RandomForestClassifier(n_estimators = 100,
                                         max_features = 10,
                                         max_depth = 7,
                                         min_samples_split = 4))
])

pipe2.fit(X_train, y_train)