# What Is a Pipeline?
- A Pipeline allows you to chain multiple steps together:

- 🧼 Preprocessing (e.g., OneHotEncoding, scaling)

- 🔮 Model (e.g., DecisionTree, RandomForest, etc.)

So everything is done in one flow, making the code cleaner and more organized.

In [11]:
import pandas as pd 
df=pd.read_csv("StudentsPerformance.csv")


In [12]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75


In [13]:
df.describe()

Unnamed: 0,math score,reading score,writing score
count,1000.0,1000.0,1000.0
mean,66.089,69.169,68.054
std,15.16308,14.600192,15.195657
min,0.0,17.0,10.0
25%,57.0,59.0,57.75
50%,66.0,70.0,69.0
75%,77.0,79.0,79.0
max,100.0,100.0,100.0


In [14]:
df.isnull().sum()

gender                         0
race/ethnicity                 0
parental level of education    0
lunch                          0
test preparation course        0
math score                     0
reading score                  0
writing score                  0
dtype: int64

In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 8 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   gender                       1000 non-null   object
 1   race/ethnicity               1000 non-null   object
 2   parental level of education  1000 non-null   object
 3   lunch                        1000 non-null   object
 4   test preparation course      1000 non-null   object
 5   math score                   1000 non-null   int64 
 6   reading score                1000 non-null   int64 
 7   writing score                1000 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 62.6+ KB


In [16]:
df['average_score'] = (df['math score'] + df['reading score'] + df['writing score']) / 3
df['performance'] = df['average_score'].apply(lambda x: 'pass' if x >= 50 else 'fail')


In [17]:
df.head()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,average_score,performance
0,female,group B,bachelor's degree,standard,none,72,72,74,72.666667,pass
1,female,group C,some college,standard,completed,69,90,88,82.333333,pass
2,female,group B,master's degree,standard,none,90,95,93,92.666667,pass
3,male,group A,associate's degree,free/reduced,none,47,57,44,49.333333,fail
4,male,group C,some college,standard,none,76,78,75,76.333333,pass


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   gender                       1000 non-null   object 
 1   race/ethnicity               1000 non-null   object 
 2   parental level of education  1000 non-null   object 
 3   lunch                        1000 non-null   object 
 4   test preparation course      1000 non-null   object 
 5   math score                   1000 non-null   int64  
 6   reading score                1000 non-null   int64  
 7   writing score                1000 non-null   int64  
 8   average_score                1000 non-null   float64
 9   performance                  1000 non-null   object 
dtypes: float64(1), int64(3), object(6)
memory usage: 78.3+ KB


In [20]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

In [21]:
# Define features and target
X = df.drop(columns=['average_score', 'performance'])  
y = df['performance']


In [22]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [24]:
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), ['parental level of education', 'gender', 'race/ethnicity', 'lunch', 'test preparation course']),
        ('num', StandardScaler(), ['math score', 'reading score', 'writing score'])
    ]
)


In [73]:
print(X.columns.tolist())



['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score', 'average_score']


In [81]:
print(X_train.columns.tolist())


['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score']


In [25]:

pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('scaling', MinMaxScaler()),
    ('model', DecisionTreeClassifier(random_state=42))
])


In [28]:
pipeline.fit(X_train, y_train)

In [45]:
# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

        fail       0.96      0.85      0.90        27
        pass       0.98      0.99      0.99       173

    accuracy                           0.97       200
   macro avg       0.97      0.92      0.94       200
weighted avg       0.97      0.97      0.97       200



In [46]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.975

# Cross Validation using Pipeline 
- Cross validation is a way to check how well your machine learning model will perform on new, unseen data.

- Instead of just training your model once and testing it once, you split your data into several parts (folds).

- You train your model on some parts and test it on the remaining part.

- You repeat this multiple times, each time with a different test part.

- At the end, you average the results to get a better estimate of how your model performs overall.

In [37]:
from sklearn.model_selection import cross_val_score



In [38]:
# Perform cross-validation
scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')

### Elaboration
- cross_val_score(...)
     - This function helps you test how well your model performs on unseen data by splitting your dataset into smaller parts.
- X, y
    - X: Your features (like gender, parental level of education, test scores, etc.)

   - y: Your target — in this case, the performance label (e.g., good/average/poor).

- cv=5
   - This stands for 5-fold cross-validation:

   - Your data will be split into 5 equal parts.

   - The model will train on 4 parts and test on the remaining 1 part.

   - This process repeats 5 times, each time using a different part for testing.

   - This helps you get a more reliable accuracy score.
- scoring='accuracy'
     - This tells the function to use accuracy (i.e., how many predictions were correct) to evaluate the model.

- scores = ...
    - This saves the 5 accuracy scores (1 from each fold) into a list/array called scores.


In [39]:
print("Cross-validation accuracy scores:", scores)
print("Average accuracy:", scores.mean())

Cross-validation accuracy scores: [0.96  0.98  0.98  0.975 0.98 ]
Average accuracy: 0.975


** Why Cross-Validation Is Important:**
- It avoids overfitting (model being too specific to training data).
- It gives a better estimate of model performance on new data.

# Grid search using pipeline
## What Is GridSearch?
- GridSearchCV stands for Grid Search Cross Validation.
- It is a way to automatically try different combinations of model parameters to - it find the best ones that give the highest accuracy (or any other score).

## Cross validation vs GridSearch validaion
-  1. What Cross-Validation Does
  - Cross-validation splits your data (e.g., into 5 parts).
  - It trains and tests your model multiple times — each time on different data.

- It tells you:

“How well does your current model perform?”
- It's like checking how good your current model is.


- 2. What GridSearch Does
   - GridSearch tries many versions of your model using different settings (called hyperparameters).

- For each version, it uses cross-validation to test it.

- Then it picks the version that gives the best performance.

- It’s like saying:

“Let’s try different models and see which settings give the best results.”

Aspect	                      Cross-Validation	GridSearchCV
Purpose	Evaluate how good your model is	Find the best model settings (hyperparameters)
Uses cross-validation?	✅ Yes	                 ✅ Yes (internally)
Tests different models?	❌ No — uses one model	✅ Yes — tries many combinations
Returns best model?	❌ No — just scores       	✅ Yes — gives best model with best settings



- When to Use Each?
   - Use cross-validation to evaluate your model.
   - Use GridSearchCV when you want to improve your model by finding the best parameters.

In [5]:
from sklearn.model_selection import GridSearchCV

In [48]:

param_grid = {
    'model__max_depth': [3, 5, 10],
    'model__criterion': ['gini', 'entropy']
}

grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

# Step 5: Save the best model
joblib.dump(grid_search.best_estimator_, 'best_pipeline.pkl')

['best_pipeline.pkl']

## Elaboration
**What is param_grid?** 
- It’s a dictionary that defines which parameters to test and what values to try for each.
- model__max_depth
   - This is the maximum depth of the decision tree.

    - max_depth = 3: The tree can only go 3 levels deep.

    - max_depth = 5: It can go deeper, to 5 levels.

    - max_depth = 10: Even deeper tree, more complex model.

-  Deeper trees can fit the data better, but they might also overfit.

- model__criterion
    - This decides how the decision tree splits the data at each node:

      - 'gini': Uses Gini impurity to decide splits (default).

      - 'entropy': Uses Information Gain.

➡️ Both are measures of how "pure" a node is. 'entropy' is a bit more computationally heavy but often works similarly.

#### Why model__max_depth and not just max_depth?
- This is because you're using a pipeline, and your decision tree model is inside the pipeline under the name 'model'. So you have to use pipeline_step__parameter_name.

- If your pipeline step is named 'model', then use:

   - 'model__max_depth'

   - 'model__criterion'


In [95]:
print(grid_search.best_params_)       # Best parameter combination
print(grid_search.best_score_)        # Best cross-validated accuracy
best_model = grid_search.best_estimator_  # Pipeline with best params


{'model__criterion': 'entropy', 'model__max_depth': 10}
0.9862500000000001


In [96]:
best_model = grid_search.best_estimator_
predictions = best_model.predict(X_test)


##### What does grid_search.best_estimator_ contain?
It contains the full pipeline with:

- Best preprocessor (e.g., StandardScaler, PCA)

- Best model (e.g., DecisionTreeClassifier)

- The best parameters chosen from GridSearchCV

So when you save best_estimator_, you're saving everything needed to make predictions

In [97]:
from sklearn.metrics import accuracy_score
test_accuracy = accuracy_score(y_test, predictions)
print("Test Accuracy:", test_accuracy)


Test Accuracy: 0.975


# Exporting the Pipeline
-Exporting a Pipeline in scikit-learn means saving your trained model (and all preprocessing steps in the pipeline) to a file, so that you can use it later without retraining. This is very useful in real-world applications where you want to deploy your model

### Why Export a Pipeline?
- To reuse the trained model without retraining it every time.
- To deploy it in production (e.g., in a web app or mobile app).
- To share the model with others.
- To store the best model found via GridSearchCV.

### What is pickle?
- pickle is a Python module (a tool built into Python) that is used to:
- Convert a Python object into bytes (this process is called serialization).
- Save those bytes to a file.
- Later, it can load (deserialize) the bytes back into the original object.
In short:
pickle helps you save Python data to a file and load it back when needed.

### What is joblib?
- joblib is an external Python library (you have to install it separately) that is similar to pickle, but:
- It is optimized for large objects like machine learning models, NumPy arrays, etc.
- It is faster than pickle for these big data objects.
In short:
joblib is like a specialized version of pickle made for ML models and large data.

- Real-world analogy:
pickle = A basic USB stick (good for general use).

joblib = A high-speed SSD (best for large ML files).



#### Why are pickle and joblib used?
- In machine learning, after you train a model (like a decision tree or SVM), it takes time and resources. You don't want to retrain it every time. So, you save it using pickle or joblib.

### Main reasons to use them:
Purpose	                                Explanation
Save trained models	      So you can reuse them later without training again.
Deploy models	          You can load them in other apps (like websites or apps using your model).
Share models	           You can give the file to someone else who can load and use it.
Save preprocessing steps	Pipelines (e.g. scalers, encoders) can also be saved and reused.



#### Example of joblib:
import joblib
- Save the model
joblib.dump(model, 'my_model.pkl')
- Load it later
loaded_model = joblib.load('my_model.pkl')

#### Example of pickle
import pickle
- Save the model
with open('my_model.pkl', 'wb') as f:
    pickle.dump(model, f)
- Load it back
with open('my_model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)
