In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import statsmodels.api as sm
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
import scipy.stats as stats
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE

In [None]:
# Fit the preprocessor on the training data only
#preprocessor.fit(X_train)

# Transform the training, validation, and test sets using the same preprocessor

#X_train_preprocessed = preprocessor.transform(X_train)
#X_val_preprocessed = preprocessor.transform(X_val)
#X_test_preprocessed = preprocessor.transform(X_test)

# Output the shapes of the processed datasets to confirm transformation

#print("Training set shape:", X_train_preprocessed.shape)
#print("Validation set shape:", X_val_preprocessed.shape)
#print("Test set shape:", X_test_preprocessed.shape)

In [2]:
data = pd.read_csv('cleaned_data_telecom.csv')  

In [3]:
data_no_total = data.drop(['total_charges'], axis=1).reset_index(drop=True)

In [6]:
from imblearn.pipeline import Pipeline  # Use imblearn's Pipeline to support SMOTE
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler

# Assuming `data_no_total` is your DataFrame and `churn` is the target column
features = data_no_total.columns.drop('churn')
target = 'churn'

# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(data_no_total[features], data_no_total[target], test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Define categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
numerical_features = X_train.select_dtypes(include=[np.number]).columns.tolist()

# Define the ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ]
)
# Create a pipeline with preprocessing, SMOTE, feature selection, and the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
 
])

pipeline.fit(X_train, y_train)

# Predict on validation set
y_val_pred = pipeline.predict(X_val)

# Calculate validation accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

# Display classification report
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred))


Validation Accuracy: 79.24%

Classification Report:
              precision    recall  f1-score   support

          No       0.82      0.92      0.87      1037
         Yes       0.66      0.42      0.52       365

    accuracy                           0.79      1402
   macro avg       0.74      0.67      0.69      1402
weighted avg       0.78      0.79      0.78      1402



In [17]:
pipeline

### Pipeline: Explanation
16. Now let's first explain what a pipeline is: A pipeline in machine learning is a tool for automating a sequence of data transformation and model-building steps. It provides a structured way to define, execute, and reproduce an end-to-end workflow, from data preprocessing to model fitting and evaluation, ensuring that each step is applied consistently and in the right order. Pipelines are especially useful in complex workflows involving multiple preprocessing, transformation, and modeling steps.
17. Key components of a pipeline:
- **Transformers: These are steps that apply transformations to the data, such as:**
- Scaling: Adjusts numerical data to a common scale, often with StandardScaler (normalizes data to have mean 0 and variance 1) or MinMaxScaler (scales data to a [0, 1] range).
- Encoding: Converts categorical data into numerical format, such as one-hot encoding, which is necessary for algorithms that require numerical inputs.
- Imputation: Handles missing values by replacing them with statistical values (like mean, median) or other strategies.
- Feature Selection: Reduces the number of input features by selecting the most relevant ones.
- **Estimator: This is the machine learning model that will learn from the preprocessed data, such as:**
- Linear models: Logistic regression, linear regression.
- Tree-based models: Decision trees, random forests, gradient boosting.
- Support Vector Machines and other classifiers or regressors.
- **Feature selection or data resampling steps (could be done)**
- Resampling: Handle imbalances in the dataset, like oversampling with SMOTE.
- Feature selection : Select or reduce features based on their importance or correlation.

18. Fitting the Pipeline: When fit is called on the pipeline, it sequentially applies each transformation step to the data before passing it to the estimator.
- For example, a pipeline with scaling, encoding, and a classifier will first scale and encode the data before passing it to the classifier to train.
- Predicting with the Pipeline: When predict is called, the pipeline again applies the transformations to the new data in the same order before passing it to the trained estimator for predictions.
18. Pros and cons for pipelines:
- **Benefits of Pipelines**
- Consistency and Reproducibility: All transformations are consistently applied in the same order, ensuring that the validation and test sets are processed the same way as the training set. It’s easier to reproduce results and document each step.
- Avoiding Data Leakage: By separating training-only steps,pipelines reduce the chance of data leakage. For example, when scaling, the StandardScaler only learns the mean and standard deviation from the training set, ensuring that the validation/test sets remain untouched.
- Simplifying Code: Rather than writing repetitive code, pipelines consolidate it into one workflow, making the code cleaner and easier to manage.
- Automating Cross-Validation: Pipelines can be used with techniques like cross-validation, allowing you to tune and evaluate models without having to handle data preprocessing manually for each fold.
- Flexibility with Experimentation: You can easily swap or add steps. For example, changing from SMOTE to a different resampling method (like ADASYN) or from Random Forest to Gradient Boosting without having to reapply preprocessing manually.
- **Cons of Pipelines**
- Inflexibility for Complex Workflows: Pipelines in libraries like scikit-learn are typically linear, meaning each step must proceed sequentially. For complex workflows that require parallel processing or conditional logic (e.g., using different preprocessing steps for different subsets of data), standard pipelines can be limiting.
- Debugging Challenges: When all steps are bundled into a single pipeline, tracing and debugging issues within individual steps becomes harder. For instance, if a specific transformer is failing, it may be less straightforward to isolate and fix the problem within a pipeline.
- Limited Control Over Intermediate Outputs: Pipelines are typically designed to pass data sequentially without allowing access to intermediate outputs. If you need to inspect the transformed data at various stages, it’s not directly possible in a scikit-learn pipeline.
- Not Always Ideal for Experimentation: In exploratory phases, pipelines may restrict flexibility since they standardize preprocessing and modeling into a rigid structure. When testing many models or trying various feature engineering steps, pipelines can feel less agile.


### Pipeline : Creating and working with it 
20. Now let's create the pipeline including only the preprocessor for now just to show what exactly is more simply and straightforward than manually doing it.
- Pipeline is a class in scikit-learn that allows you to chain together multiple steps, like preprocessing and model training, in a sequential and organized way. The steps in the pipeline are executed in the order they’re listed. Each step in the pipeline consists of:
- A name (like 'preprocessor' or 'classifier') — this is an identifier for the step and can be any unique string.
- An operation — this is usually a transformation or a model, like a scaler, encoder, or classifier.

21. Think of it as a list of tasks that need to be done in a certain order to prepare your data and fit a model. Each task (or step) is a tuple, which contains two parts:
- A name for the step — this is just a string you can choose, like 'preprocessor' or 'classifier'.
- The operation or action that step performs — this could be a transformer (like scaling or encoding) or a model (like SVC for classification).

22. We will now declare a variable in which we will save a pipeline (list of operations) that for now will contain only a preprocessor.
- We will declare the name 'preprocessor' that will be doing the operation we created earlier where we defined the preprocessor for scaling and encoding. I do that as I have already declared a functional preprocessor and why not use it. 

In [None]:
pipeline_example = Pipeline([
    ('preprocessor', preprocessor)
])

In [None]:
pipeline_example

23. Now this doesn't speak much does it, it looks simple and straightforward. We see a graph that states that there is preprocessor and in this preprocessor we have both the scaler and the encoder. Behind that though the preporcessor is fit to the training set without explicitly doing it and most importantly, when later we are gonna predict on the validation and test sets(test set in the end). The pipeline automatically applies the transformation from the training set to the other one which we are predicting on. This makes the process extremely straightforward and clean without the need of us explicitly fitting the transformer to the other sets. It automatically fits the transformation only on the training set 
- When working with separate train, validation, and test sets, consistency is key. By using a pipeline, the exact same preprocessing transformations (e.g., scaling and encoding) are applied to each dataset split. This ensures that the model sees data in the same format during training and evaluation, reducing the risk of data leakage or inconsistent transformations.
- The pipeline automatically applies each step in sequence. You specify each transformation in the order you want, and the pipeline handles fitting and transforming the data step-by-step, simplifying your code and reducing human error.
- Since the pipeline follows a defined structure, it prevents common errors like forgetting to apply the scaler or encoder to new data. Once set up, the pipeline will consistently follow the same sequence every time, which reduces the chance of skipping steps or applying them incorrectly.
- Using a pipeline makes the code more readable by abstracting away repetitive tasks. Each transformation step is encapsulated, so you can quickly understand what each part of the code does without getting bogged down in details.

24. I stated many things but haven't showed proof and how exactly it works so let's proceed.
25. Now we have our 'transformer' in the face of the preprocessor, let's now add the 'estimator'.
- I will try with a Random Forest classifier first.
- What I am going to do is just add another tuple in the pipeline with name 'classifier' and next to it the model : RandomForestClassifier(n_estimators=100, random_state=42)
- It's going to look like that

In [None]:
pipeline_example = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42)),
    
])

23. As we can see here we added a classification model to the pipeline pretty easy. All we did is add the tuple in the list, now the pipeline is done but it's not fitted. So we shall now fit it and try to gain some results out of it.
24. Now Pipeline fitting :Training the Pipeline (pipeline.fit(X_train, y_train)):
- The pipeline first applies the preprocessor step on X_train, transforming the data (e.g., scaling, encoding).
- The transformed data is then passed to the classifier step, which trains the Random Forest model on the preprocessed data.

In [None]:
pipeline_example.fit(X_train, y_train)

25. As we can now see the pipeline is fitted, the graph we see is now blue which indicates that it's fittend and that can also be seen on the top right corner of the graph there is and 'i' symbol which shows the current state of the pipeline(fiited or not fitted).
26. Now all that is left is to do a prediction
- When predicting, the pipeline applies the same scaling and encoding transformations to new data before making predictions with the trained model.
- The predict phase of a pipeline in scikit-learn is where the trained pipeline takes in new data and outputs predictions.
- Let's do the prediction and store the result in a variable y_val_predict_example (This saves the result of the prediction on the X_val). We do that so that we can use the variable for (scikit learn's metrics) classification reports and accuracy evaluations(accuracy_score) and feature importances.
- As pipelines are mainly for transforming, fitting and predicting. The metrics we get from other classes.

In [None]:
y_val_pred_example = pipeline_example.predict(X_val)

27. When we call pipeline.predict(X_val), the following happens:
- X_val is the new data (usually the validation or test set) for which we want predictions.
- The pipeline will process this data through all the transformations defined in its steps, ending with the model making predictions based on the processed data

28. And that's all of it it's pretty straightforward and easy to work with it looks clean(in my case not, because I split it all and give a lot of exlpanations in between). But now if i want to exchange the model i just edit the part of the tuple next to the 'classificator' with the model i want and i have another classifying model that gets the transformed data. All of this is consistant and no data is leaked and every step is secured. In our case this is good as we are going to try multiple models and try different approaches and tunings in order to get the best result(best as of combination of relism and predicting power).

29. As I stated earlier these steps that i will now show were already done manually in another notebook, and I can say that it was much harder as the workflow isn't straightforward for every model a new code block, new variables that could be forgotten, really messy workflow that I could somehow keep the track of as the data ain't much and i had a lot of time, but in real world scenario this would be extremely tiring and confusing. And the fact remains that still somewhere in the process I could've leaked data which is bad and using pipelines this is really hard to happen. So that is why I will be proceeding with pipelines even though it could be a little bit limiting for some things.