In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
import statsmodels.api as sm
import re
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
import scipy.stats as stats
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from imblearn.over_sampling import SMOTE

In [2]:
data = pd.read_csv('cleaned_data_telecom.csv')  

In [3]:
data_no_total = data.drop(['total_charges'], axis=1).reset_index(drop=True)

In [67]:
from imblearn.pipeline import Pipeline  # Use imblearn's Pipeline to support SMOTE
from imblearn.over_sampling import SMOTE
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler

# Assuming `data_no_total` is your DataFrame and `churn` is the target column
features = data_no_total.columns.drop('churn')
target = 'churn'

# Split the data into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(data_no_total[features], data_no_total[target], test_size=0.4, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Define categorical and numerical features
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
numerical_features = X_train.select_dtypes(include=[np.number]).columns.tolist()

# Define the ColumnTransformer for preprocessing
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first'), categorical_features)
    ]
)
# Create a pipeline with preprocessing, SMOTE, feature selection, and the model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', SVC(kernel='linear', probability=True, random_state=42))
 
])

pipeline.fit(X_train, y_train)

# Predict on validation set
y_val_pred = pipeline.predict(X_val)

# Calculate validation accuracy
accuracy = accuracy_score(y_val, y_val_pred)
print(f"Validation Accuracy: {accuracy * 100:.2f}%")

# Display classification report
print("\nClassification Report:")
print(classification_report(y_val, y_val_pred))

if hasattr(model, 'feature_importances_'):
    feature_names = pipeline.named_steps['preprocessor'].get_feature_names_out()
    feature_importances = model.feature_importances_
    importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
    print("\nFeature Importances:")
    print(importance_df.sort_values(by='Importance', ascending=False))

Validation Accuracy: 81.88%

Classification Report:
              precision    recall  f1-score   support

          No       0.85      0.91      0.88      1037
         Yes       0.69      0.55      0.61       365

    accuracy                           0.82      1402
   macro avg       0.77      0.73      0.75      1402
weighted avg       0.81      0.82      0.81      1402



In [17]:
pipeline

16. Now let's first explain what a pipeline is: A pipeline in machine learning is a tool for automating a sequence of data transformation and model-building steps. It provides a structured way to define, execute, and reproduce an end-to-end workflow, from data preprocessing to model fitting and evaluation, ensuring that each step is applied consistently and in the right order. Pipelines are especially useful in complex workflows involving multiple preprocessing, transformation, and modeling steps.
17. Key components of a pipeline:
- **Transformers: These are steps that apply transformations to the data, such as:**
- Scaling: Adjusts numerical data to a common scale, often with StandardScaler (normalizes data to have mean 0 and variance 1) or MinMaxScaler (scales data to a [0, 1] range).
- Encoding: Converts categorical data into numerical format, such as one-hot encoding, which is necessary for algorithms that require numerical inputs.
- Imputation: Handles missing values by replacing them with statistical values (like mean, median) or other strategies.
- Feature Selection: Reduces the number of input features by selecting the most relevant ones.
- **Estimator: This is the machine learning model that will learn from the preprocessed data, such as:**
- Linear models: Logistic regression, linear regression.
- Tree-based models: Decision trees, random forests, gradient boosting.
- Support Vector Machines and other classifiers or regressors.
- **Feature selection or data resampling steps (could be done)**
- Resampling: Handle imbalances in the dataset, like oversampling with SMOTE.
- Feature selection : Select or reduce features based on their importance or correlation.

18. Pros and cons for pipelines:
- **Benefits of Pipelines**
- Consistency and Reproducibility: All transformations are consistently applied in the same order, ensuring that the validation and test sets are processed the same way as the training set. It’s easier to reproduce results and document each step.
- Avoiding Data Leakage: By separating training-only steps like SMOTE and feature scaling, pipelines reduce the chance of data leakage. For example, when scaling, the StandardScaler only learns the mean and standard deviation from the training set, ensuring that the validation/test sets remain untouched.
- Simplifying Code: Rather than writing repetitive code, pipelines consolidate it into one workflow, making the code cleaner and easier to manage.
- Automating Cross-Validation: Pipelines can be used with techniques like cross-validation, allowing you to tune and evaluate models without having to handle data preprocessing manually for each fold.
- Flexibility with Experimentation: You can easily swap or add steps. For example, changing from SMOTE to a different resampling method (like ADASYN) or from Random Forest to Gradient Boosting without having to reapply preprocessing manually.
- **Cons of Pipelines**
- Inflexibility for Complex Workflows: Pipelines in libraries like scikit-learn are typically linear, meaning each step must proceed sequentially. For complex workflows that require parallel processing or conditional logic (e.g., using different preprocessing steps for different subsets of data), standard pipelines can be limiting.
- Debugging Challenges: When all steps are bundled into a single pipeline, tracing and debugging issues within individual steps becomes harder. For instance, if a specific transformer is failing, it may be less straightforward to isolate and fix the problem within a pipeline.
- Limited Control Over Intermediate Outputs: Pipelines are typically designed to pass data sequentially without allowing access to intermediate outputs. If you need to inspect the transformed data at various stages, it’s not directly possible in a scikit-learn pipeline.
- Not Always Ideal for Experimentation: In exploratory phases, pipelines may restrict flexibility since they standardize preprocessing and modeling into a rigid structure. When testing many models or trying various feature engineering steps, pipelines can feel cumbersome and less agile.