# **MIT5672_Lab4_EthanRichards**

# Tackle the Telco Customer Churn dataset
In this lab assignment, you will work with the Telco Customer Churn dataset, a resource frequently employed in the telecommunications industry to forecast customer turnover. The dataset offers a range of customer-specific variables such as Monthly Charges and Contract Type, along with a 'Churn' indicator (Yes/No), signaling whether the customer has left the company.

Your objective is to apply five distinct ensemble techniques—Voting, Bagging, Random Forest, AdaBoost, and Stacking—to construct classification models that accurately predict customer churn. Ultimately, you will identify the most effective model based on its accuracy score.


Let's fetch the data and load it:

In [None]:
import pandas as pd

# Read data from URL
url = "https://raw.githubusercontent.com/IBM/telco-customer-churn-on-icp4d/master/data/Telco-Customer-Churn.csv"
df = pd.read_csv(url)

Let's first conduct exploratory data analysis (EDA) to understand the dataset better.

#### **Q1: Show the top few rows of the training set**

In [None]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


#### **Q2: Show basic information, e.g. the index dtype and columns, non-null values and memory usage**

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


#### **Q3: Use a method which returns description of the numerical data in the DataFrame, e.g. count, mean, std, min, 25%, 50%, 75%, max.**

In [None]:
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [None]:
# Define features and target
X = df.drop(['customerID', 'Churn'], axis=1)
y = df['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

In [None]:
# Identify numerical and categorical columns
num_cols = X.select_dtypes(include=['float64', 'int64']).columns
cat_cols = X.select_dtypes(include=['object']).columns

In [None]:
print(num_cols)

Index(['SeniorCitizen', 'tenure', 'MonthlyCharges'], dtype='object')


In [None]:
print(cat_cols)

Index(['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
       'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'PaperlessBilling', 'PaymentMethod', 'TotalCharges'],
      dtype='object')


In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')

df['TotalCharges']

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: TotalCharges, Length: 7043, dtype: float64

#### **Q4: Create preprocessors for both numerical and categorical features by using make_pipeline**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())])



#### **Q5: Combine preprocessors by using ColumnTransformer**

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = ['SeniorCitizen', 'tenure', 'MonthlyCharges']
cat_attribs = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines',
                'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
                'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
                'PaperlessBilling', 'PaymentMethod']

preprocessor = ColumnTransformer([
        ('num', num_transformer, num_attribs),
        ('cat', cat_transformer, cat_attribs)])

preprocessor


#### **Q6: Build based models: LogisticRegression and DecisionTreeClassifier**

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

base_log_reg = LogisticRegression(random_state=42)

base_dt_clf = DecisionTreeClassifier(random_state=42)



In [None]:
!pip install catboost

from catboost import CatBoostRegressor

Collecting catboost
  Downloading catboost-1.2.1-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.1


#### **Q7: Create a dictionary named `ensemble_models` as a container to hold five seperate ensemble models: Voting, Bagging, Random Forest, AdaBoost, and Stacking**

In [None]:
from pandas.core.groupby.ops import final
from sklearn.ensemble import VotingClassifier, BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, StackingClassifier
from catboost import CatBoostRegressor

ensemble_models = {"VotingClassifier": VotingClassifier(estimators = [('Log_reg', base_log_reg),('dt_clf', base_dt_clf)], voting = "hard"),
                   "BaggingClassifer": BaggingClassifier(estimator= DecisionTreeClassifier(), bootstrap=True, n_estimators=500, max_samples=100, n_jobs=-1, random_state=42),
                   "RandomForestClassifier": RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, n_jobs=-1, random_state=42),
                   "AdaBoostClassifer": AdaBoostClassifier(DecisionTreeClassifier(max_depth=2), n_estimators=30, learning_rate=0.5, random_state=42),
                   "StackingClassifier": StackingClassifier(estimators=[("base_log_reg", base_log_reg), ("base_dt_clf", base_dt_clf)]),
                   "StackingClassifier2": StackingClassifier(estimators=[("base_log_reg", base_log_reg),("svc", SVC())])}



                   # "StackingCat": StackingClassifier(estimators=[("cat_boost", CatBoostRegressor(verbose = 0),("base_dt_clf", base_dt_clf))])}
                                                         # final_estimator=RandomForestClassifier(random_state=42) # this actually decreases performance
# [("cat_boost", CatBoostRegressor(verbose = 0),





#### **Q8: Train-test split**

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train_clean = preprocessor.fit_transform(X_train)

X_test_clean = preprocessor.transform(X_test)

#### **Q9:**


1.   Construct a new pipeline which integrates the given `preprocessor` and a `classifier`.
2.   Utilize a `for` loop to iterate through each model in the ensemble_models dictionary.
3.   For each iteration, set the classifier in the pipeline to the current model.
4.   Train the pipeline using the `X_train` and `y_train` datasets.
5.   Compute the accuracy of the trained pipeline on the test dataset (`X_test` and `y_test`).
6.   Print out the accuracy of the model.




**Alternatively, you can create five models (Voting, Bagging, Random Forest, AdaBoost, and Stacking) individually instead of using `for` loop and `pipeline`.**

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
import numpy as np

for name, v in ensemble_models.items():
  v.fit(X_train_clean, y_train)
  y_pred = v.predict(X_test_clean)
  score = accuracy_score(y_test, y_pred)
  print(f"Accuracy of {name} : {score}")

print("--------------------------------------")

# for name, v in ensemble_models.items():
  # scores = cross_val_score(v, X, y, cv=5)
  # scores








Accuracy of VotingClassifier : 0.7913413768630234
Accuracy of BaggingClassifer : 0.8076650106458482
Accuracy of RandomForestClassifier : 0.8034066713981547
Accuracy of AdaBoostClassifer : 0.8055358410220014
Accuracy of StackingClassifier : 0.8218594748048261
Accuracy of StackingClassifier2 : 0.8197303051809794
--------------------------------------


In [None]:
X_train.shape

(5634, 19)

In [None]:
y_train.shape

(5634,)

#### **Q10: Click Share at the top right. Ensure sharing settings are set to "Anyone with the link can edit." Copy the shared link. Submit this link to the Canvas assignment page.**

#### **Bonus question (5pts): how to get feature importance of each variable?**

In [None]:
# Extracting the feature importances

'''
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
importances = pd.DataFrame(data={
    'Attribute': X_train.columns,
    'Importance': model.coef_[0]
})
importances = importances.sort_values(by='Importance', ascending=False)


model = DecisionTreeClassifier()
model.fit(X_train_clean, y_train)
importances = pd.DataFrame(model.feature_importances_)

importances

model = DecisionTreeClassifier()
model.fit(X_train_clean, y_train)
importances = pd.DataFrame(data={
    'Attribute': X_train.columns,
    'Importance': model.feature_importances_
})
importances = importances.sort_values(by='Importance', ascending=False)

'''
# https://betterdatascience.com/feature-importance-python/#:~:text=Method%20%231%20%E2%80%94%20Obtain%20importances%20from,assigned%20to%20each%20input%20value.

"\nmodel = LogisticRegression()\nmodel.fit(X_train_scaled, y_train)\nimportances = pd.DataFrame(data={\n    'Attribute': X_train.columns,\n    'Importance': model.coef_[0]\n})\nimportances = importances.sort_values(by='Importance', ascending=False)\n\n\nmodel = DecisionTreeClassifier()\nmodel.fit(X_train_clean, y_train)\nimportances = pd.DataFrame(model.feature_importances_)\n\nimportances\n\nmodel = DecisionTreeClassifier()\nmodel.fit(X_train_clean, y_train)\nimportances = pd.DataFrame(data={\n    'Attribute': X_train.columns,\n    'Importance': model.feature_importances_\n})\nimportances = importances.sort_values(by='Importance', ascending=False)\n\n"

In [None]:
X_train.columns.shape

(19,)

In [None]:
model.feature_importances_.shape

(44,)

In [None]:
# Getting the feature names after one-hot encoding (for categorical variables)





In [None]:
# Combining numerical and one-hot-encoded categorical feature names



In [None]:
# Sorting feature importances in descending order and taking the indices



AttributeError: ignored

In [None]:
# Printing feature importances




Feature ranking:
1. Feature tenure (0.09738713478805293)
2. Feature MonthlyCharges (0.09088795746478615)
3. Feature Contract_Month-to-month (0.04022445964842482)
4. Feature PaymentMethod_Electronic check (0.023049080518838076)
5. Feature OnlineSecurity_No (0.021380792531041467)
6. Feature TechSupport_No (0.017747767871540313)
7. Feature SeniorCitizen (0.015244380869988734)
8. Feature Contract_Two year (0.015072538527110262)
9. Feature OnlineBackup_No (0.015018239635189447)
10. Feature InternetService_Fiber optic (0.01499271258918124)
11. Feature InternetService_DSL (0.013998784436781413)
12. Feature gender_Male (0.013586002291611214)
13. Feature PaperlessBilling_Yes (0.013489463196401895)
14. Feature TechSupport_Yes (0.01343577109485863)
15. Feature gender_Female (0.013062465760759751)
16. Feature DeviceProtection_No (0.013046970513551812)
17. Feature PaperlessBilling_No (0.012352875692512281)
18. Feature Contract_One year (0.01219052984885086)
19. Feature OnlineSecurity_Yes (0.0120003