**Before you dive into the implementations, I highly recommend first learning the heart of each algorithm—its core idea and how it works. You can explore this through YouTube tutorials, books, or online courses. This repository is meant to complement that knowledge by showing how to translate concepts into working code.**

# Pipelines in Machine Learning

A **pipeline** in machine learning is a sequence of data processing steps that transform raw data into a usable format for modeling and eventually into predictions. It automates the flow of data through various stages, such as preprocessing, feature engineering, model training, and evaluation.

Think of it as an **assembly line in a factory**:
- **Raw data** goes in at one end.
- Each step in the pipeline processes the data.
- The final output is a **trained model** or **predictions**.



## Why Use Pipelines?

1. **Automation**: Reduces manual effort by automating repetitive tasks.
2. **Reproducibility**: Ensures consistency in results by applying the same steps to new data.
3. **Efficiency**: Saves time by integrating multiple steps into a single workflow.
4. **Error Reduction**: Minimizes the risk of mistakes in data preprocessing or feature engineering.
5. **Scalability**: Makes it easier to handle large datasets and complex workflows.



## Key Components of a Machine Learning Pipeline

### 1. **Data Ingestion**
   - Loading raw data from various sources (e.g., databases, CSV files, APIs).
   - Example: Using `pandas` to load a CSV file.

### 2. **Data Preprocessing**
   - Cleaning, transforming, and preparing data for modeling.
   - Steps include handling missing values, encoding categorical variables, scaling numerical features, etc.
   - Example: Using `sklearn.preprocessing` for scaling or encoding.

### 3. **Feature Engineering**
   - Creating new features or selecting relevant features to improve model performance.
   - Example: Extracting date-related features (day, month, year) from a timestamp.

### 4. **Model Training**
   - Training a machine learning model on the processed data.
   - Example: Using `sklearn` to train a Random Forest or Logistic Regression model.

### 5. **Model Evaluation**
   - Assessing the model's performance using metrics like accuracy, precision, recall, or F1-score.
   - Example: Using cross-validation to evaluate the model.

### 6. **Prediction**
   - Using the trained model to make predictions on new data.
   - Example: Deploying the model to predict customer churn.

**Let's see basic implementation given in sklearn**

In [25]:
#import necessary libraries
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

In [43]:
#Load the dataset
iris=load_iris()
X=iris.data
y=iris.target
df=pd.DataFrame(X,y)
df.head()


Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
0,4.9,3.0,1.4,0.2
0,4.7,3.2,1.3,0.2
0,4.6,3.1,1.5,0.2
0,5.0,3.6,1.4,0.2


In [13]:
#split the data
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)

In [14]:
#create a pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()), # to scale the data without reducing outliers effect
    ('model', LogisticRegression())# model
])

In [15]:
pipeline

In [16]:
#train the model
pipeline.fit(X_train, y_train)

In [37]:
#evaluate the model
predictions=pipeline.predict(X_test)
accuracy=accuracy_score(predictions, y_test)
class_report= classification_report(predictions, y_test)
print(f'accuracy: {accuracy}\n {class_report}')

accuracy: 1.0
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       1.00      1.00      1.00         9
           2       1.00      1.00      1.00        11

    accuracy                           1.00        30
   macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30



Let's implement  PCA  to data before feed to model, it reduces the dimensions without much loss of variation in data.

In [48]:
#import pca
from sklearn.decomposition import PCA


In [30]:
#creating the pipeline
pipeline2 = Pipeline([
    ('scaler', StandardScaler()),#scaling the data
    ('pca', PCA(n_components=2)),#reduced to 2 dimensions
    ('model', LogisticRegression()) #model training
])


In [39]:
pipeline2

In [31]:
#train the model
pipeline2.fit(X_train, y_train)
predictions2=pipeline2.predict(X_test)


In [36]:
#evaluate the model
accuracy2= accuracy_score(predictions2, y_test)
class_report2= classification_report(predictions2, y_test)
print(f'accyarcy:{accuracy2}\n {class_report2}')

accyarcy:0.9
               precision    recall  f1-score   support

           0       1.00      1.00      1.00        10
           1       0.78      0.88      0.82         8
           2       0.91      0.83      0.87        12

    accuracy                           0.90        30
   macro avg       0.90      0.90      0.90        30
weighted avg       0.90      0.90      0.90        30



Observations

**Without PCA:** Logistic Regression achieves 100% accuracy because it can handle the 4-dimensional data effectively.

**With PCA:** The accuracy drops  because PCA reduces the data to 2 dimensions, losing some useful information.

Here PCA not necessary, just for implementation only.

In [47]:
#save the model
import joblib
joblib.dump(pipeline, 'pipeline.pkl')
joblib.dump(pipeline2, 'pipeline2.pkl')

['pipeline2.pkl']

This notebook is just for basic implementation , we need to apply pipelines and deploy the model for more experience.