## DECISION_TREE_CLASSIFICATION / CLASSification_Handling_Imbalanced_Data_21.Februar


Lab | Classification, Handling Imbalanced Data: 
## ROUND 2: DECISION TREE
For this lab we will build a model on customer churn binary classification problem. You will be using this file: https://drive.google.com/drive/folders/1yLmZrS-uQ2BY98vvlkTsJ4UOfKi_vIaH?usp=sharing

Scenario
You are working as an analyst for an internet service provider. You are provided with this historical data about your company's customers and their churn trends. Your task is to build a machine learning model that will help the company identify customers that are more likely to default/churn and thus prevent losses from such customers.

Instructions
In this lab, we will first take a look at the degree of imbalance in the data and correct it using the techniques we learned in class.

Here is the list of steps to be followed (building a simple model without balancing the data):


## Round 2: Fitting Decision Tree 

Fit a Decision Tree Classifier on the training data.
Check the accuracy on the test data.



In [1]:
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix,  ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score, recall_score

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

%matplotlib inline

In [2]:

df = pd.read_csv('churnData.csv')

In [3]:
len(df.columns)  # 16 Columns

16

In [4]:
df.columns


Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [5]:
df.tail()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
7038,Male,0,Yes,Yes,24,Yes,Yes,No,Yes,Yes,Yes,Yes,One year,84.8,1990.5,No
7039,Female,0,Yes,Yes,72,Yes,No,Yes,Yes,No,Yes,Yes,One year,103.2,7362.9,No
7040,Female,0,Yes,Yes,11,No,Yes,No,No,No,No,No,Month-to-month,29.6,346.45,No
7041,Male,1,Yes,No,4,Yes,No,No,No,No,No,No,Month-to-month,74.4,306.6,Yes
7042,Male,0,No,No,66,Yes,Yes,No,Yes,Yes,Yes,Yes,Two year,105.65,6844.5,No


In [6]:
#change Dataframe to churnData
churnData=df
churnData.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No,Yes,No,No,No,No,Month-to-month,29.85,29.85,No
1,Male,0,No,No,34,Yes,Yes,No,Yes,No,No,No,One year,56.95,1889.5,No
2,Male,0,No,No,2,Yes,Yes,Yes,No,No,No,No,Month-to-month,53.85,108.15,Yes
3,Male,0,No,No,45,No,Yes,No,Yes,Yes,No,No,One year,42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,No,No,No,No,No,Month-to-month,70.7,151.65,Yes


In [7]:
# checking for data type
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object

In [8]:
# converting an object('TotalCharges') to numeric
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')


In [9]:
churnData.dtypes

gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [10]:
#checking for null values
null_counts = df.isnull().sum()
print(null_counts)
# result shows dataframe has 15 coliumns with null values. the oly column without is the TotalCharges

gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64


In [11]:
# replacing the null values with the mean of the column

#df['gender','SeniorCitizen','Partner','Dependents','tenure','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','MonthlyCharges','Churn'].fillna(df['gender','SeniorCitizen','Partner','Dependents','tenure','PhoneService','OnlineSecurity','OnlineBackup','DeviceProtection','TechSupport','StreamingTV','StreamingMovies','Contract','MonthlyCharges','Churn'].mean(), inplace=True)
df['TotalCharges'].fillna(df['TotalCharges'].mean(), inplace=True)

In [12]:
df.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
       'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract',
       'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [13]:
df.dtypes


gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

In [14]:
#Use the following features: tenure, SeniorCitizen, MonthlyCharges and TotalCharges:
#Split the data into a training set and a test set.
from sklearn.model_selection import train_test_split

# Assuming 'df' is your DataFrame and null values in 'TotalCharges' have been handled

# Selecting specific features
X = df[['tenure', 'SeniorCitizen', 'MonthlyCharges', 'TotalCharges']]

# Assuming 'Churn' is the target variable
y = df['Churn']

# Splitting the data into training and test sets
# Adjust the test_size as needed (e.g., 0.2 for 20% test size) and random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", X_train.shape[0])
print("Test set size:", X_test.shape[0])

Training set size: 5634
Test set size: 1409


In [15]:
#Scale the features either by using normalizer or a standard scaler
from sklearn.preprocessing import StandardScaler

# Assuming X_train and X_test are already defined as shown in the previous step

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit on the training data
scaler.fit(X_train)

# Transform both the training and test data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [16]:
# Alternate scaler: Normalizer
from sklearn.preprocessing import Normalizer


# Initialize the Normalizer
normalizer = Normalizer()

# Fit on the training data is not needed as Normalizer works on the rows independently

# Transform both the training and test data
X_train_normalized = normalizer.transform(X_train)
X_test_normalized = normalizer.transform(X_test)



In [17]:
#Fitting a logistic Regression model on the training data.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Assuming X_train_scaled and y_train are your scaled training features and target variable, respectively

# Initialize the Logistic Regression model
log_reg = LogisticRegression()

# Fit the model on the scaled training data
log_reg.fit(X_train_scaled, y_train)


# Make predictions on the scaled training data (to evaluate the model)
y_train_pred = log_reg.predict(X_train_scaled)
y_pred=log_reg.predict(X_test_scaled)

# Evaluate the model's performance using accuracy as an example metric
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy:.4f}")
test_accuracy = accuracy_score (y_test, y_pred)
print(f'Test Accuracy: {test_accuracy:.4f}')   # test accuracy added to enable a test compariso

Training Accuracy: 0.7875
Test Accuracy: 0.8077


In [18]:
#Fitting a Knn Classifier(NOT KnnRegressor please!)model on the training data.
knn_clf = KNeighborsClassifier(n_neighbors=5)

# Fit the model on the scaled training data
knn_clf.fit(X_train_scaled, y_train)

# Make predictions on the scaled training data (for evaluation purposes)
y_train_pred = knn_clf.predict(X_train_scaled)

# Evaluate the model's performance using accuracy as an example metric
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_pred)
print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}") # test accuracy added to enable a test comparison

Training Accuracy: 0.8371
Test Accuracy: 0.8077


In [19]:
##Recommendation would be a Recall metrics, to avoid a FN

## ROUND 2.....21. FEBRUAR.....DECISION TREE 

In [26]:
# Fit a Decision Tree Classifier on the training data and checking the accuracy on the test data.
from sklearn.tree import DecisionTreeClassifier

dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train_scaled, y_train)


In [21]:
from sklearn.metrics import accuracy_score

# Predict on the test data
y_pred_dt = dt_classifier.predict(X_test_scaled)

# Calculate the accuracy
accuracy_dt = accuracy_score(y_test, y_pred_dt)
print(f"Accuracy of Decision Tree Classifier: {accuracy_dt}")


Accuracy of Decision Tree Classifier: 0.7281760113555713


In [27]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score


In [30]:
def train_test_decision_tree(X, y, max_depth_list):
    performance_metrics = []
    for max_depth_value in max_depth_list:
# Decision Tree Classifier
        model = DecisionTreeClassifier (max_depth=max_depth_value)
# Fit on training data
        model. fit(X_train, y_train)
# Making predictions
        y_pred_train_dt = model. predict(X_train)
        y_pred_test_dt = model. predict(X_test)
# Calculate the  performance metrics
        accuracy_train = accuracy_score(y_train, y_pred_train_dt)
        precision_train = precision_score(y_train, y_pred_train_dt)
        recall_train = recall_score(y_train, y_pred_train_dt)
        accuracy_test= accuracy_score(y_test, y_pred_test_dt)
        precision_test = precision_score(y_test, y_pred_test_dt, pos_label='positive', average='binary')
        recall_test = recall_score(y_test, y_pred_test_dt, pos_label='positive', average='binary')
        
        # Append the metrics to the performance_metrics list
        performance_metrics.append({
            'max_depth': max_depth_value,
            'accuracy_train': accuracy_train,
            'precision_train': precision_train,
            'recall_train': recall_train,
            'accuracy_test': accuracy_test,
            'precision_test': precision_test,
            'recall_test': recall_test
        })
    


In [31]:
# Create a function that takes a list of integers as input and 
#trains and tests a Decision Tree Classifier using each integer as max_depth
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

def train_test_decision_tree_with_depths(depths, X_train, y_train, X_test, y_test):
   
    accuracy_scores = {}
    for depth in depths:
        # Initialize the Decision Tree Classifier with the current max_depth
        clf = DecisionTreeClassifier(max_depth=depth, random_state=42)
        
        # Train the model
        clf.fit(X_train, y_train)
        
        # Predict on the test set
        y_pred = clf.predict(X_test)
        
        # Calculate and store the accuracy
        accuracy_scores[depth] = accuracy_score(y_test, y_pred)
    
    return accuracy_scores

# Example usage
depths = [1, 2, 3, 4, 5, 10, 15, 20, None] # None means no maximum depth
accuracy_scores = train_test_decision_tree_with_depths(depths, X_train, y_train, X_test, y_test)

accuracy_scores


{1: 0.7352732434350603,
 2: 0.7920511000709723,
 3: 0.7920511000709723,
 4: 0.7913413768630234,
 5: 0.7906316536550745,
 10: 0.7636621717530163,
 15: 0.7437899219304471,
 20: 0.7345635202271115,
 None: 0.7281760113555713}

In [None]:
## this shows the complexity of the model and hw this influences
## the model's performance as well as highlighting the critical aspects when model-tuning. 