In [14]:
import warnings
warnings.filterwarnings('ignore')

In [15]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

# Pipelines and grid search 

Grid search is a technique used in machine learning to find the best hyperparameters for a given model. Hyperparameters are parameters that are not learned from the data but are set before training the model. These can include things like learning rate, regularization strength, number of trees in a random forest, kernel type in a support vector machine, etc.
How Grid Search Works:

    Specify the hyperparameters to tune: You first decide which hyperparameters you want to optimize and define a list of possible values for each.

    Create a grid: The grid consists of all possible combinations of hyperparameters. For example, if you want to tune two hyperparameters, learning rate and the number of trees in a forest, the grid might look like this:
        Learning rate: [0.001, 0.01, 0.1]
        Number of trees: [10, 50, 100]

    Evaluate all combinations: Grid search systematically evaluates all combinations of hyperparameters by training the model with each combination, often using cross-validation to assess the model’s performance.

    Select the best combination: Once all combinations are tested, the hyperparameter combination that provides the best performance (e.g., highest accuracy, lowest error) is selected.

```python

# Grid Search Random example

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Define the model
model = RandomForestClassifier()

# Specify the hyperparameters to tune
param_grid = {
    'n_estimators': [100, 500, 1000]
    'max_depth': [10, 20, None]
}

# Perform grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get the best parameters
print(grid_search.best_params_)


```

Pros:

    Simple to implement and easy to understand.
    Guarantees finding the best hyperparameter combination within the defined grid.

Cons:

    Can be computationally expensive, especially if the search space is large, because it evaluates all combinations.
    Doesn’t scale well to models with many hyperparameters.

# Pipelines 

In machine learning, a pipeline is a way of organizing and streamlining the various steps in a machine learning workflow. It ensures that all steps, from data preprocessing to model evaluation, are executed in a consistent and reproducible manner.
Components of a Pipeline:

    Data Preprocessing: This includes steps like data cleaning (handling missing values), scaling or normalizing features, encoding categorical variables, and feature selection.

    Model Training: The actual machine learning algorithm (e.g., decision trees, support vector machines) is applied to the processed data to build a model.

    Model Evaluation: This step involves evaluating the model’s performance using metrics like accuracy, precision, recall, etc., typically with a validation set or using cross-validation.

Why Use Pipelines?

    Streamlined Workflow: Pipelines allow you to chain multiple steps (like preprocessing and model training) together into a single object. This reduces the risk of errors when manually performing each step individually.

    Consistency: With a pipeline, you ensure that the same preprocessing steps are applied to both the training and testing data, which is crucial for model generalization.

    Reusability: Pipelines can be reused and shared with others, making it easier to apply the same sequence of operations to different datasets.

    Ease of Hyperparameter Tuning: When performing grid search or other hyperparameter optimization methods, pipelines ensure that all transformations are applied to each fold of the data in the correct order.

```python

# pipeline example 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Sample data (features and target)
X = ...  # feature matrix
y = ...  # target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ("ohe",Onehotencoder())
    ('std', StandardScaler()),  # Step 1: Standardize the features
    ('clf', RandomForestClassifier())  # Step 2: Train a Random Forest model
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = (y_pred == y_test).mean()
print(f'Accuracy: {accuracy}')


```

# Pipelines with grid search

```python 

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'clf__n_estimators': [50, 100],  # 'classifier' is the RandomForest model
    'clf__max_depth': [10, 20]
}

# Perform grid search on the pipeline
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get best parameters
print("Best parameters:", grid_search.best_params_)

```

## Logistic regression without piplines

In [16]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt  
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [17]:
# data loading
df =  pd.read_csv("data/titanic_data.csv",index_col="PassengerId")
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


Age and Cabin contain null values  we drop cabin and impute Age 

In [19]:
df.drop("Cabin",axis=1,inplace=True)

We also drop Embarked and Ticket

In [20]:
df.drop(["Embarked","Ticket","Name"],axis=1,inplace=True)

In [21]:
orginal_df = df

In [22]:
imputer =  SimpleImputer(strategy="mean")
df["Age"] = imputer.fit_transform(df[["Age"]])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(4), object(1)
memory usage: 55.7+ KB


In [23]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,22.0,1,0,7.25
2,1,1,female,38.0,1,0,71.2833
3,1,3,female,26.0,0,0,7.925
4,1,1,female,35.0,1,0,53.1
5,0,3,male,35.0,0,0,8.05


### OHE Categorical columns 

In [24]:
df =  pd.get_dummies(df,columns=["Sex"],drop_first=True,dtype=int)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,22.0,1,0,7.25,1
2,1,1,38.0,1,0,71.2833,0
3,1,3,26.0,0,0,7.925,0
4,1,1,35.0,1,0,53.1,0
5,0,3,35.0,0,0,8.05,1


In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Age       891 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Sex_male  891 non-null    int32  
dtypes: float64(2), int32(1), int64(4)
memory usage: 52.2 KB


In [26]:
X = df.drop("Survived",axis =1)
y = df["Survived"]
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3,22.0,1,0,7.25,1
2,1,38.0,1,0,71.2833,0
3,3,26.0,0,0,7.925,0
4,1,35.0,1,0,53.1,0
5,3,35.0,0,0,8.05,1


In [27]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Age       891 non-null    float64
 2   SibSp     891 non-null    int64  
 3   Parch     891 non-null    int64  
 4   Fare      891 non-null    float64
 5   Sex_male  891 non-null    int32  
dtypes: float64(2), int32(1), int64(3)
memory usage: 45.2 KB


In [28]:
#stardaize numerical colums 
scaler = MinMaxScaler()
X[["Age","Parch","Fare"]] = scaler.fit_transform(X[["Age","Parch","Fare"]])
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3,0.271174,1,0.0,0.014151,1
2,1,0.472229,1,0.0,0.139136,0
3,3,0.321438,0,0.0,0.015469,0
4,1,0.434531,1,0.0,0.103644,0
5,3,0.434531,0,0.0,0.015713,1


In [29]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=20,random_state=42)

In [30]:
model =  LogisticRegression(max_iter=1000)

model.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [31]:
print(classification_report(y_pred=y_pred,y_true=y_test))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



## With a grid Search

In [32]:
param_grid = {
    'C': [0.1, 1, 10],            # Regularization strength (Inverse of regularization strength)
    'solver': ['liblinear', 'saga'],  # Optimization algorithms
    'penalty': ['l2', 'l1'],         # Regularization types
    'class_weight': [None, 'balanced']  # Handle imbalanced classes (optional)
}# Define the hyperparameters to tune


In [33]:
# Set up GridSearchCV

grid_search = GridSearchCV(estimator=LogisticRegression(),param_grid=param_grid,cv=20)


In [34]:
# Fit the model with the best hyperparameters
grid_search.fit(X_train,y_train)

In [35]:
grid_search.best_params_

{'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'saga'}

In [36]:
grid_search.predict(X_test)

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
      dtype=int64)

In [37]:
# Output the best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

# Evaluate the model on the test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print("Test set score: {:.2f}".format(test_score))

Best parameters found:  {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'saga'}
Best cross-validation score: 0.80
Test set score: 0.90


## With Pipelines 

In [38]:
# import pipelines
from sklearn.pipeline import Pipeline


In [39]:
# recheck data 
df.head()


Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,22.0,1,0,7.25,1
2,1,1,38.0,1,0,71.2833,0
3,1,3,26.0,0,0,7.925,0
4,1,1,35.0,1,0,53.1,0
5,0,3,35.0,0,0,8.05,1


In [40]:
# re assign X and y
X=df.drop("Survived",axis=1)
y=df["Survived"]

In [41]:
# train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=20,random_state=42)

In [42]:
# Make Pipeline for stardadization and modeling

pipe = Pipeline([
    ("one",MinMaxScaler()),
    ("model",LogisticRegression())
])



In [43]:
# fit pipeline and make predictions
pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)


In [44]:
print(classification_report(y_pred=y_pred,y_true=y_test))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



## Pipelines with Grid Search

In [45]:
# redifine Param grid
param_grid = {
    'model__C': [0.1, 1, 10],            # Regularization strength (Inverse of regularization strength)
    'model__solver': ['liblinear', 'saga'],  # Optimization algorithms
    'model__penalty': ['l2', 'l1'],         # Regularization types
    'model__class_weight': [None, 'balanced']  # Handle imbalanced classes (optional)
}

In [46]:
# define a grid

grid = GridSearchCV(estimator=pipe,param_grid=param_grid)

In [47]:
# fit and get best param
grid.fit(X_train,y_train)

y_pred = grid.predict(X_test)

In [48]:
print(classification_report(y_pred=y_pred,y_true=y_test))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



## changing models /Steps in a pipeline


In [49]:
# switching to Desciscion tree
from sklearn.tree import DecisionTreeClassifier
pipe.set_params(one=StandardScaler())
pipe.set_params(model=DecisionTreeClassifier())



In [50]:
# fit and make predictions 
pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)



In [51]:
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.75      0.90      0.82        10
           1       0.88      0.70      0.78        10

    accuracy                           0.80        20
   macro avg       0.81      0.80      0.80        20
weighted avg       0.81      0.80      0.80        20



In [52]:
# Todo use swith to any other model and also do a grid search 

## Column Tranformers 

A **ColumnTransformer** in machine learning is used to apply different preprocessing techniques to different subsets of columns (features) in a dataset. It allows you to transform numerical and categorical columns with different operations, such as scaling numerical data or encoding categorical data, in a clean and efficient way.

### Benefits:
1. **Streamlined preprocessing**: You can apply different transformations to different columns in a single, unified step.
2. **Cleaner code**: Organizes preprocessing tasks and avoids manually separating data by column types.
3. **Flexibility**: You can specify custom transformations for each set of columns (e.g., scaling for numerical columns, one-hot encoding for categorical columns).
4. **Improved pipeline integration**: It integrates well within machine learning pipelines, ensuring consistency when training and testing models.

In [53]:
# imports 
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer

Original data inspection

In [54]:
orginal_df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,22.0,1,0,7.25
2,1,1,female,38.0,1,0,71.2833
3,1,3,female,26.0,0,0,7.925
4,1,1,female,35.0,1,0,53.1
5,0,3,male,35.0,0,0,8.05


Custom Function

In [55]:
# custom function 
def plus_one(x):
    return x+50

## Tranformers

In [56]:
#Creating tranformer
transfomer = ColumnTransformer([
    ('ohe',OneHotEncoder(),['Sex']),
    ('impute',SimpleImputer(strategy='mean'),['Age','Fare']),
    ('std',MinMaxScaler(),['Age','Fare'])
])


In [57]:
tranformer2 = ColumnTransformer([
    ("cat",Pipeline([
        ("ohe",OneHotEncoder())
        ])
    ,["Sex"]
    ),
    ("num",Pipeline([
        ("imputer",SimpleImputer(strategy="mean")),
        ("scaler",MinMaxScaler())
        ]),
     ["Age"]
    )
])

In [58]:
# Pipeline
pipe = Pipeline([
    ('pre-pro',transfomer),
    ('model',LogisticRegression())
])

In [59]:
# re assign X and y
X=orginal_df.drop("Survived",axis=1)
y=orginal_df["Survived"]

In [60]:
# train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=20,random_state=42)

In [61]:
# fit and make predictions 
pipe.fit(X_train,y_train)
y_pred = pipe.predict(X_test)

In [62]:
y_pred

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0],
      dtype=int64)

In [63]:
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86        10
           1       0.89      0.80      0.84        10

    accuracy                           0.85        20
   macro avg       0.85      0.85      0.85        20
weighted avg       0.85      0.85      0.85        20



In [64]:
# cat_ord=['grades']
transformer3 = ColumnTransformer([
    ('ohe_sex',Pipeline([
        ('impute_mode',SimpleImputer(strategy='most_frequent')),
        ('ohe',OneHotEncoder())
    ]),['Sex']),
    
    ('pre_age',Pipeline([
        ('impute_mean',SimpleImputer(strategy='mean')),
        ('scaler',MinMaxScaler())
    ]),['Age'])

    # ('ordinal',Pipeline([

    # ]),cat_ord)
])

In [65]:
# Pipeline
pipe3 = Pipeline([
    ('pre-pro',transformer3),
    ('model',LogisticRegression())
])

In [66]:
# re assign X and y
X=orginal_df.drop("Survived",axis=1)
y=orginal_df["Survived"]

In [67]:
# train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=20,random_state=42)

In [68]:
# fit and make predictions 
pipe3.fit(X_train,y_train)
y_pred = pipe3.predict(X_test)

In [69]:
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86        10
           1       0.89      0.80      0.84        10

    accuracy                           0.85        20
   macro avg       0.85      0.85      0.85        20
weighted avg       0.85      0.85      0.85        20



In [70]:
from sklearn.ensemble import RandomForestClassifier
pipe3.set_params(model=RandomForestClassifier())
pipe3.fit(X_train,y_train)
y_pred = pipe3.predict(X_test)
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.67      0.80      0.73        10
           1       0.75      0.60      0.67        10

    accuracy                           0.70        20
   macro avg       0.71      0.70      0.70        20
weighted avg       0.71      0.70      0.70        20

