name isaac ndirangu muturi
full time student

# SyriaTel Customer Churn

Links to an external site.

Build a classifier to predict whether a customer will ("soon") stop doing business with SyriaTel, a telecommunications company. This is a binary classification problem.

Most naturally, your audience here would be the telecom business itself, interested in reducing how much money is lost because of customers who don't stick around very long. The question you can ask is: are there any predictable patterns here?

The graded elements for the Jupyter Notebook are:

Business Understanding

Business Problem

Import libraries and modules

Data Understanding

Data Preparation

Explanatory Data Analysis (EDA)

Modeling

Evaluation

Code Quality

# Business Understanding

As a data scientist assigned to investigate customer churn for SyriaTel, my main objective is to analyze the available data and develop a predictive classifier that can accurately determine whether a customer is likely to terminate their relationship with the telecommunications company. By understanding the underlying patterns and reasons behind customer churn, our aim is to assist SyriaTel in reducing financial losses and implementing targeted retention strategies. Through comprehensive data analysis and modeling techniques, we can identify key factors influencing churn and provide actionable insights to the business.

To achieve this goal, I will begin by conducting a thorough examination of the dataset, encompassing customer demographics, usage patterns, billing information, and customer service interactions. This exploratory analysis will enable me to gain a deep understanding of the data, identifying potential features that have a significant impact on customer churn. By leveraging statistical techniques and visualization methods, I can uncover correlations and patterns that will serve as the foundation for the subsequent modeling phase.

Once the dataset has been carefully examined, I will preprocess the data to handle missing values, encode categorical variables, and normalize numerical features. This preprocessing step is crucial to ensure the dataset is suitable for modeling, as it minimizes bias and enhances the quality of the input data. Additionally, I will employ feature selection techniques to identify the most relevant variables or engineer new features that can provide valuable insights into customer churn. This process will involve assessing feature importance, conducting correlation analysis, and incorporating domain knowledge expertise to select the most informative set of features.

After feature selection and engineering, I will select an appropriate machine learning algorithm for the classification task. Depending on the nature of the data and the problem at hand, algorithms such as logistic regression, decision trees, random forests, support vector machines (SVM), or gradient boosting algorithms like XGBoost or LightGBM may be considered. The chosen algorithm will be trained on the preprocessed dataset, employing suitable training techniques such as cross-validation to ensure the model's robustness and generalization capabilities. By iteratively refining the model's parameters and evaluating its performance, we can develop a reliable classifier for predicting customer churn.

# Import libraries and modules

In [1]:
# Import modules & packages

# Data manipulation 
import pandas as pd 
import numpy as np 

# Data visualization
import seaborn as sns 
import matplotlib.pyplot as plt 
import plotly.express as px 
import plotly.colors as colors
import plotly.graph_objs as go
from plotly.offline import iplot

# Modeling
from sklearn.model_selection import train_test_split,cross_val_score,GridSearchCV #splitting the dataset into test-train
from imblearn.over_sampling import SMOTE #SMOTE technique to deal with unbalanced data problem
from sklearn.metrics import accuracy_score,f1_score,recall_score,precision_score,confusion_matrix,roc_curve,roc_auc_score,classification_report # performance metrics
from sklearn.preprocessing import MinMaxScaler, LabelEncoder # to scale the numeric features
from scipy import stats

# Feature Selection, XAI, Feature Importance
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import SelectFromModel

# Algorithms for supervised learning methods
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

# Filtering future warnings
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'mlxtend'

In [None]:
!pip install mlxte

In [None]:
# Read data from csv file & create dataframe. Checking the first 5 rows.
df = pd.read_csv('bigml_59c28831336c6604c800002a.csv')
df.head()

**Summary of Features in the Datset**

state: the state the customer lives in

account length: the number of days the customer has had an account

area code: the area code of the customer

phone number: the phone number of the customer

international plan: true if the customer has the international plan, otherwise false

voice mail plan: true if the customer has the voice mail plan, otherwise false

number vmail messages: the number of voicemails the customer has sent

total day minutes: total number of minutes the customer has been in calls during the day

total day calls: total number of calls the user has done during the day

total day charge: total amount of money the customer was charged by the Telecom company for calls during the day

total eve minutes: total number of minutes the customer has been in calls during the evening

total eve calls: total number of calls the customer has done during the evening

total eve charge: total amount of money the customer was charged by the Telecom company for calls during the evening

total night minutes: total number of minutes the customer has been in calls during the night

total night calls: total number of calls the customer has done during the night

total night charge: total amount of money the customer was charged by the Telecom company for calls during the night

total intl minutes: total number of minutes the user has been in international calls

total intl calls: total number of international calls the customer has done

total intl charge: total amount of money the customer was charged by the Telecom company for international calls

customer service calls: number of calls the customer has made to customer service

churn: true if the customer terminated their contract, otherwise false

In [None]:
# Check shape of dataframe - 3333 rows and 21 columns
df.shape

In [None]:
df.describe() # Concise statistical description of numeric features

# Data Preparation

In [None]:
# Check for missing values, no missing values.
df.isnull().sum()

In [None]:
# Check for duplicated rows, no duplicated rows to deal with.
df.duplicated().sum()

In [None]:
# Remove customer number feature it is contact information on the client and adds no value to the analysis
# Recheck dataframe
df.drop(['phone number'],axis=1,inplace=True)
df.head()

In [None]:
df.select_dtypes('number').columns

In [None]:
df.select_dtypes('object').columns

**Feature Types**



**Continuous Features:**

account length 

number vmail messages

total day minutes

total day calls

total day charge

total eve minutes

total eve calls

total eve charge

total night minutes

total night calls

total night charge

total intl minutes

total intl charge

customer service calls

**Categorical Features:**

state

area code

international plan

voicemail plan

**Transforming "Churn" Feature's Rows into 0s and 1s**

In [None]:
df['churn'] = df['churn'].map({True: 1, False: 0}).astype('int') 
df.head()

# Explanatory Data Analysis (EDA)

In [None]:
# Check the number of unique values in all columns to determine feature type
df.nunique()

**Analysis on 'churn' Feature**

Churn will be used as the dependent variable in this analysis.

Churn indicates if a customer has terminated their contract with SyriaTel. True indicates they have terminated and false indicates they have not and have and have an existing account.

In [None]:
# Countplot of churn feature
print(df.churn.value_counts())
sns.countplot(data=df, x='churn');

Of the 3,333 customers in the dataset, 483 have terminated their contract with SyriaTel. That is 14.5% of customers lost.

The distribution of the binary classes shows a data imbalance. This needs to be addressed before modeling as an unbalanced feature can cause the model to make false predictions.

**Analysis on area code**

In [None]:
# Pie chart of area code feature
area = df['area code'].value_counts()
transuction = area.index
quantity = area.values

# draw pie circule with plotly
figure = px.pie(df,
               values = quantity,
               names = transuction,
               hole = .5,
               title = 'Distribution of Area Code Feature')
figure.show()

Half of the customers have the area code 415.

One fourth of customers have the area code 510 and another fourth have the area code 408.

**Boxplot to see which area code has the highest churn**

In [None]:
# Boxplot to see which area code has the highest churn
plt.figure(figsize=(14,5))
sns.boxplot(data=df,x='churn',y='customer service calls',hue='area code');
plt.legend(loc='upper right');

There are outliers, in all area codes, amongst the customers who have not terminated their accounts.

Of the customers who have terminated their account, they more likely have a 415 or a 510 area code.

In [None]:
# Create numeric & categorical lists
numeric_columns = ['account length','number vmail messages','total day minutes','total day calls','total day charge',
                'total eve minutes','total eve calls','total eve charge','total night minutes','total night calls',
                'total night charge','total intl minutes','total intl calls','total intl charge','customer service calls']
categoric_columns = ['state','area code','international plan','voice mail plan']

**Distrubution Plots for Numeric Features**

In [None]:
f,ax=plt.subplots(2,3,figsize=(19,6),constrained_layout = True)
sns.distplot(df["account length"],bins=20,ax=ax[0,0]);

sns.distplot(df["total day calls"],bins=20,ax=ax[0,1]);

sns.distplot(df["total eve calls"],bins=20,ax=ax[0,2]);

sns.distplot(df["total night calls"],bins=20,ax=ax[1,0]);

sns.distplot(df["total intl calls"],bins=20,ax=ax[1,1]);

sns.distplot(df["customer service calls"],bins=20,ax=ax[1,2]);

For the distribution plots of the features above, all of them except customer service calls, have a normal distribution. Total international calls seems to be skewed to the right side however it is still normally distributed.

Customer service calls has a few peaks, which indicates there are a few modes in the population. This makes sense because customer service calls has to be a integer and not a float number.

**Correlation Heatmap for Numeric Features**

In [None]:
corr_mat = df[numeric_columns].corr()
mask = np.triu(np.ones_like(corr_mat, dtype=bool))
plt.subplots(figsize=(15,12))
sns.heatmap(corr_mat, annot=True, cmap='Blues', square=True,fmt='.0g');
plt.xticks(rotation=90);
plt.yticks(rotation=0);

Most of the features are not correlated however some do share a perfect correlation.

Total day charge and total day minutes features are fully positively correlated.

Total eve charge and total eve minutes features are fully positively correlated.

Total night charge and total night minutes features are fully positively correlated.

Total int charge and total int minutes features are fully positively correlated.

It makes sense for these features to be perfectly correlated because the charge is a direct result of the minutes used.

The perfect correlation of 1 indicates the presence of perfect multicollinearity. It does not have the same impact on nonlinear models as it does on linear models. Some nonlinear models are impacted by perfect multicollinearity whereas others are not.

**Dropping Highly-Correlated Features**

Dropping features that have a correlation of 0.9 or above.

In [None]:
print("The original dataframe has {} columns.".format(df.shape[1]))
# Calculate the correlation matrix and take the absolute value
corr_matrix = df.corr().abs()

# Create a True/False mask and apply it
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
tri_df = corr_matrix.mask(mask)

# List column names of highly correlated features (r > 0.90)
to_drop = [c for c in tri_df.columns if any(tri_df[c] >  0.90)]

reduced_df = df.drop(to_drop, axis=1) # Drop the features
print("The reduced dataframe has {} columns.".format(reduced_df.shape[1]))

**Pairplots for Numeric Features (Hue as "Churn")**

In [None]:
data_temp = df[["account length","total day calls","total eve calls","total night calls",
                "total intl calls","customer service calls","churn"]]
sns.pairplot(data_temp, hue="churn",height=2.5);
plt.show();

There seems to be a evident relationship between customer service calls and true churn values. After 4 calls, customers are a lot more likely to discontinue their service.

**Categorical Features Analysis**

Next, we look at the churn distribution per state, to see how much the state influences our target:

In [None]:
for i in categoric_columns:
    plt.figure(figsize=(10,4))
    sns.countplot(x=i, hue="churn", data=df,order= df[i].value_counts().iloc[0:15].index)
    plt.xticks(rotation=90)
    plt.legend(loc="upper right")
    plt.show()

**One-Hot Encoding**

Transforming categorical features into dummy variables as 0 and 1 to be able to use them in classification models.

In [None]:
dummy_df_area_code = pd.get_dummies(df["area code"],dtype=np.int64,prefix="area_code_is")
dummy_df_international_plan = pd.get_dummies(df["international plan"],dtype=np.int64,prefix="international_plan_is",drop_first = True)
dummy_df_voice_mail_plan = pd.get_dummies(df["voice mail plan"],dtype=np.int64,prefix="voice_mail_plan_is",drop_first = True)


df = pd.concat([df, dummy_df_area_code, dummy_df_international_plan, dummy_df_voice_mail_plan], axis=1)
df = df.loc[:,~df.columns.duplicated()]
df = df.drop(['area code', 'international plan', 'voice mail plan'],axis=1)

df.head()

The "state" column is converted using the LabelEncoder, which replaces each unique label with a unique integer. In this case, a label encode is used instead of dummy variables because of the many distinct values, which when converted into dummy variables would mess up the for example the PCA and the feature importance of the tree-based models.

In [None]:
le = LabelEncoder()
le.fit(df['state'])
df['state'] = le.transform(df['state'])
df.head()

**The following interactive graph shows the distribution of each feature for customer with churn and for the ones without churn. The slider can be used to switch between the different features.**

In [None]:
churn = df[df["churn"] == 1]
no_churn = df[df["churn"] == 0]

In [None]:
colors = colors.DEFAULT_PLOTLY_COLORS
churn_dict = {0: "no churn", 1: "churn"}

In [None]:
def create_churn_trace(col, visible=False):
    return go.Histogram(
        x=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_no_churn_trace(col, visible=False):
    return go.Histogram(
        x=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_not_for_hist = ["state", "churn"]
features_for_hist = [x for x in df.columns if x not in features_not_for_hist]
active_idx = 0
traces_churn = [(create_churn_trace(col) if i != active_idx else create_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
traces_no_churn = [(create_no_churn_trace(col) if i != active_idx else create_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_hist)]
data = traces_churn + traces_no_churn

n_features = len(features_for_hist)
steps = []
for i in range(n_features):
    step = dict(
        method = 'restyle',  
        args = ['visible', [False] * len(data)],
        label = features_for_hist[i],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = active_idx,
    currentvalue = dict(
        prefix = "Feature: ", 
        xanchor= 'center',
    ),
    pad = {"t": 50},
    steps = steps,
)]

layout = dict(
    sliders=sliders,
    yaxis=dict(
        title='#samples',
        automargin=True,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='histogram_slider')

One interesting histogram is of the feature "international_plan". While the proportion of churn for customers which have the international plan is much higher than the proportion of churn for customers without.

The histograms for the "total_day_minutes" and "total_day_charge" are very similar and we can see that the customer with a higher value for these two features are more likely to churn. Interestingly, this does not apply to the number of day calls, which means that these customers seem to do longer calls. The minutes, charge and #calls for other times of the day (i.e. evening, night) do not show different distributions for customers with churn and without churn.

Another interesting pattern is shown by the "total_intl_calls" feature. The data for the customers with churn are more left skewed than the data of the customers of the customer who did not churn.

**Next, we take a look at the box plots for each feature. A box plot visualizes the following statistics**

median
the first quartile (Q1) and the third quartile (Q3) building the interquartile range (IQR)
the lower fence (Q1 - 1.5 IQR) and the upper fence (Q3 + 1.5 IQR)
the maximum and the minimum value

In [None]:
def create_box_churn_trace(col, visible=False):
    return go.Box(
        y=churn[col],
        name='churn',
        marker = dict(color = colors[1]),
        visible=visible,
    )

def create_box_no_churn_trace(col, visible=False):
    return go.Box(
        y=no_churn[col],
        name='no churn',
        marker = dict(color = colors[0]),
        visible = visible,
    )

features_not_for_hist = ["state", "churn"]
features_for_hist = [x for x in df.columns if x not in features_not_for_hist]
# remove features with too less distinct values (e.g. binary features), because boxplot does not make any sense for them
features_for_box = [col for col in features_for_hist if len(churn[col].unique())>5]

active_idx = 0
box_traces_churn = [(create_box_churn_trace(col) if i != active_idx else create_box_churn_trace(col, visible=True)) for i, col in enumerate(features_for_box)]
box_traces_no_churn = [(create_box_no_churn_trace(col) if i != active_idx else create_box_no_churn_trace(col, visible=True)) for i, col in enumerate(features_for_box)]
data = box_traces_churn + box_traces_no_churn

n_features = len(features_for_box)
steps = []
for i in range(n_features):
    step = dict(
        method = 'restyle',  
        args = ['visible', [False] * len(data)],
        label = features_for_box[i],
    )
    step['args'][1][i] = True # Toggle i'th trace to "visible"
    step['args'][1][i + n_features] = True # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(
    active = active_idx,
    currentvalue = dict(
        prefix = "Feature: ", 
        xanchor= 'center',
    ),
    pad = {"t": 50},
    steps = steps,
    len=1,
)]

layout = dict(
    sliders=sliders,
    yaxis=dict(
        title='value',
        automargin=True,
    ),
    legend=dict(
        x=0,
        y=1,
    ),
)

fig = dict(data=data, layout=layout)

iplot(fig, filename='box_slider')

When we look at the box plot for the number of voice mail messages ("number_vmail_messages"), we can see that we have some outliers for the customers with churn, but most of them have send zero voice mail messages. The customers which did not churn instead tend to do more voice mail messages.
Similar to our findings in the histograms, we can see also in the box plot that the median of the total day minutes and the total day charge for churn clients is higher than the one of no-churn clients.

Looking at the total international calls ("total_intl_calls"), the box plot shows that both churn and no-churn customers are doing a similar amount of international calls, but the churn-customers tend to do longer calls as the median of churn customers for the total international minutes is higher than for the no-churn customers.

Finally, the plot for the number of customer service calls shows that clients with churn have a higher median and a higher variance for the customer service calls.

**Outlier Detection & Treatment**

Dropping outliers past 3 standard deviations.

In [None]:
print("Before dropping numerical outliers, length of the dataframe is: ",len(df))
def drop_numerical_outliers(df, z_thresh=3):
    constrains = df.select_dtypes(include=[np.number]).apply(lambda x: np.abs(stats.zscore(x)) < z_thresh) \
        .all(axis=1)
    df.drop(df.index[~constrains], inplace=True)
    
drop_numerical_outliers(df)
print("After dropping numerical outliers, length of the dataframe is: ",len(df))

# Modeling

**Scaling Numerical Features**

Scaling is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variable variance is 1, or scaling the variable so the variable values range from 0 to 1.

In our example, Min-Max Normalization method is applied. MinMaxScaler is used to reduce the effects of outliers in the dataset. By applying the following method, standard deviation issues will be solved.
MinMaxScaler is applied on the columns which is defined in "columns_to_be_scaled" variable below.

In [None]:
transformer = MinMaxScaler()

def scaling(columns):
    return transformer.fit_transform(df[columns].values.reshape(-1,1))

for i in df.select_dtypes(include=[np.number]).columns:
    df[i] = scaling(i)
    df.head()

**Train-Test Split**

Splitting the dataset into training and testing as 75% training and 25% testing

In [None]:
X=df.drop(['churn'],axis=1)
y=df['churn']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=123)

**Applying SMOTE Technique to Resolve Unbalanced 'churn' Feature**

Synthetic Minority Oversampling Technique ("SMOTE") is an oversampling technique where synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

The technique aims to balance class distribution by randomly increasing minority class examples by replicating them.

In [None]:
df.churn.value_counts()

In [None]:
sm = SMOTE(k_neighbors=5, random_state=123)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
print('Before OverSampling, the shape of X_train: {}'.format(X_train.shape))
print('Before OverSampling, the shape of y_train: {}'.format(y_train.shape)) 
print('After OverSampling, the shape of X_train_over: {}'.format(X_train_over.shape))
print('After OverSampling, the shape of y_train_over: {}'.format(y_train_over.shape))

checking for class imbalance again

In [None]:
y_train_over.value_counts()

**creating a function to generate predictions, precision, recall, accuracy, and F1 score**

In [None]:
def model_predictions(model, x_train, x_test, y_train, y_test):
    '''Enter model name and test/train sets to generate predictions, precision, recall, accuracy, and F1 score'''
    model.fit(x_train, y_train)
    y_hat_train = model.predict(x_train)
    y_hat_test = model.predict(x_test)
    print('Training Precision: ', precision_score(y_train, y_hat_train))
    print('Testing Precision: ', precision_score(y_test, y_hat_test))
    print('-----')

    print('Training Recall: ', recall_score(y_train, y_hat_train))
    print('Testing Recall: ', recall_score(y_test, y_hat_test))
    print('-----')

    print('Training Accuracy: ', accuracy_score(y_train, y_hat_train))
    print('Testing Accuracy: ', accuracy_score(y_test, y_hat_test))
    print('-----')

    print('Training F1-Score: ', f1_score(y_train, y_hat_train))
    print('Testing F1-Score: ', f1_score(y_test, y_hat_test))

**baseline Model 1 - Logistic Regression Classifier**

Logistic regression is a classification algorithm, used when the value of the target variable is categorical in nature.

It is most commonly used when the data in question has binary output, so when it belongs to one class or another, or is either a 0 or 1.
This method will be used to create a baseline model.

In [None]:
# Object creation, fitting the data & getting predictions 
lr_vanilla= LogisticRegression()
lr_vanilla.fit(X_train_over,y_train_over) 
y_pred_lr_vanilla = y_pred_lr_vanilla.predict(X_test) 

In [None]:
print(classification_report(y_test, y_pred_lr_vanilla, target_names=['0', '1']))

In [None]:
print("**************** LOGISTIC REGRESSION vanilla CLASSIFIER MODEL RESULTS **************** ")
print('Accuracy score for testing set: ',round(accuracy_score(y_test,y_pred_lr),5))
print('F1 score for testing set: ',round(f1_score(y_test,y_pred_lr),5))
print('Recall score for testing set: ',round(recall_score(y_test,y_pred_lr),5))
print('Precision score for testing set: ',round(precision_score(y_test,y_pred_lr),5))
cm_lr = confusion_matrix(y_test, y_pred_lr)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_lr, annot=True, cmap='Blues', fmt='g', ax=ax)
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

According to the logistic regression classifier model, total day charge, number of voicemail messages and total evening charge are the top three important features.

Model accuracy is 76.5%, which isn't bad. F1 score is only 50.2% which means the test will only be accurate half the times it is ran.

**Hyperparameter Tuning of Logistic Regression Classifier**

3-Fold Cross validated GridSearchCV hyperparameter tuning technique is used.

In [None]:
lr_params = {'penalty': ['l1', 'l2'], 
             'C': np.logspace(0, 4, 5),
             'solver' : ['lbfgs', 'newton-cg', 'liblinear','saga'],
             'max_iter' : [5, 10]}

In [None]:
lr_model2 = RandomForestClassifier()
lr_model_GridSearchCV_Applied = GridSearchCV(rf_model2, lr_params, cv=3, n_jobs=-1, verbose=False)
lr_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
print("Best parameters:"+str(lr_model_GridSearchCV_Applied.best_params_))

lets use the best hyperparameters we found

In [None]:
lr_model_GridSearchCV_Applied = LogisticRegressionClassifier(criterion='gini', max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=5, n_estimators=500)
lr_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
y_pred_GridSearchCV_Applied = lr_model_GridSearchCV_Applied.predict(X_test)

In [None]:
print("**************** HYPERPARAMETER TUNED linear regression MODEL RESULTS ****************")
print('Accuracy score for testing set: ',round(accuracy_score(y_test, y_pred_GridSearchCV_Applied),5))
print('F1 score for testing set: ',round(f1_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Recall score for testing set: ',round(recall_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Precision score for testing set: ',round(precision_score(y_test, y_pred_GridSearchCV_Applied),5))
cm_rf = confusion_matrix(y_test, y_pred_GridSearchCV_Applied)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Oranges', fmt='g', ax=ax);
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

In [None]:
Importance =pd.DataFrame({"Importance": lr_model_GridSearchCV_Applied.feature_importances_*100},index = X_train_over.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

In [None]:
print(classification_report(y_test, y_pred_GridSearchCV_Applied, target_names=['0', '1']))

**Logistic Regression Models' Comparisons**

In [None]:
comparison_frame = pd.DataFrame({'Model':['Logistic Regression Classifier (Default)',
                                          'Logistic Regression Classifier (GridSearchCV Applied)'],
                                 'Accuracy (Test Set)':[0.91929,0.92434],
                                 'F1 Score (Test Set)':[0.74194,0.7619],
                                 'Recall (Test Set)':[0.71318,0.74419], 
                                 'Precision (Test Set)':[0.77311,0.78049]}) 

comparison_frame.style.highlight_max(color = 'lightgreen', axis = 0)

**Model 2 - Random Forest Classifier**

Random forest is an ensemble machine learning algorithm.
A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

In [None]:
# Object creation, fitting the data & getting predictions 
rf_model_vanilla = RandomForestClassifier() 
rf_model_vanilla.fit(X_train_over,y_train_over) 
y_pred_rf = rf_model_vanilla.predict(X_test)

In [None]:
Importance =pd.DataFrame({"Importance": rf_model_vanilla.feature_importances_*100},index = X_train_over.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

In [None]:
print(classification_report(y_test, y_pred_rf, target_names=['0', '1']))

In [None]:
print("**************** RANDOM FOREST vanilla MODEL RESULTS **************** ")
print('Accuracy score for testing set: ',round(accuracy_score(y_test,y_pred_rf),5))
print('F1 score for testing set: ',round(f1_score(y_test,y_pred_rf),5))
print('Recall score for testing set: ',round(recall_score(y_test,y_pred_rf),5))
print('Precision score for testing set: ',round(precision_score(y_test,y_pred_rf),5))
cm_rf = confusion_matrix(y_test, y_pred_rf)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Reds', fmt='g', ax=ax)
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

According to the random forest classifier, total day charge, customer service calles and "international plan is yes" features have the highest impact on the model.

Accuracy and F1 score are much higher for this model, which is good news.

**Hyperparameter Tuning of Random Forest Classifier**

3-Fold Cross validated GridSearchCV hyperparameter tuning technique is used.

In [None]:
rf_params = {"max_depth": [8,15,20],
             "n_estimators":[500,1000],
             "min_samples_split":[5,10,15],
             "min_samples_leaf" : [1, 2, 4],
             "max_features": ['auto', 'sqrt'],
             "criterion":['entropy','gini']}

In [None]:
rf_model2 = RandomForestClassifier()
rf_cv_model = GridSearchCV(rf_model2, rf_params, cv=3, n_jobs=-1, verbose=False)
rf_cv_model.fit(X_train_over,y_train_over)
print("Best parameters:"+str(rf_cv_model.best_params_))

lets use the best hyperparameters we found

In [None]:
rf_model_GridSearchCV_Applied = RandomForestClassifier(criterion='gini', max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=5, n_estimators=500)
rf_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
y_pred_GridSearchCV_Applied = rf_model_final.predict(X_test)

In [None]:
print("**************** HYPERPARAMETER TUNED RANDOM FOREST MODEL RESULTS ****************")
print('Accuracy score for testing set: ',round(accuracy_score(y_test, y_pred_GridSearchCV_Applied),5))
print('F1 score for testing set: ',round(f1_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Recall score for testing set: ',round(recall_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Precision score for testing set: ',round(precision_score(y_test, y_pred_GridSearchCV_Applied),5))
cm_rf = confusion_matrix(y_test, y_pred_GridSearchCV_Applied)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Oranges', fmt='g', ax=ax);
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

In [None]:
Importance =pd.DataFrame({"Importance": rf_model_GridSearchCV_Applied.feature_importances_*100},index = X_train_over.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

In [None]:
print(classification_report(y_test, y_pred_GridSearchCV_Applied, target_names=['0', '1']))

**Random Forest Models' Comparisons**

In [None]:
comparison_frame = pd.DataFrame({'Model':['Random Forest Classifier (Default)',
                                          'Random Forest Classifier (GridSearchCV Applied)'],
                                 'Accuracy (Test Set)':[0.91929,0.92434],
                                 'F1 Score (Test Set)':[0.74194,0.7619],
                                 'Recall (Test Set)':[0.71318,0.74419], 
                                 'Precision (Test Set)':[0.77311,0.78049]}) 

comparison_frame.style.highlight_max(color = 'lightgreen', axis = 0)

**Model 3 - Decision Tree Classifier**

Decision Tree is a Supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems. It is a tree-structured classifier, where internal nodes represent the features of a dataset, branches represent the decision rules and each leaf node represents the outcome.

It is called a decision tree because, similar to a tree, it starts with the root node, which expands on further branches and constructs a tree-like structure.

Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.

The logic behind the decision tree can be easily understood because it shows a tree-like structure.

In [None]:
# Object creation, fitting the data & getting predictions
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train_over,y_train_over)
y_pred_dt = decision_tree.predict(X_test)

In [None]:
feature_names = list(X_train_over.columns)
importances = decision_tree.feature_importances_[0:15]
indices = np.argsort(importances)

plt.figure(figsize=(8,6))
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='green', align='center')
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

In [None]:
print(classification_report(y_test, y_pred_dt, target_names=['0', '1']))

In [None]:
print("**************** DECISION TREE vanilla CLASSIFIER MODEL RESULTS **************** ")
print('Accuracy score for testing set: ',round(accuracy_score(y_test,y_pred_dt),5))
print('F1 score for testing set: ',round(f1_score(y_test,y_pred_dt),5))
print('Recall score for testing set: ',round(recall_score(y_test,y_pred_dt),5))
print('Precision score for testing set: ',round(precision_score(y_test,y_pred_dt),5))
cm_dt = confusion_matrix(y_test, y_pred_dt)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_dt, annot=True, cmap='Greens', fmt='g', ax=ax)
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

According to the decision tree classifier, customer service calls total day charge and total evening charge are the three most important for the model.

The accuracy and F1 score for this model is not as great as model 2.

**Hyperparameter Tuning of Decision Tree Classifier**

In [None]:
dt_params = {
    'max_depth': [2, 3, 5, 10, 20],
    'min_samples_leaf': [5, 10, 20, 50, 100],
    'criterion': ["gini", "entropy"]
    'max_features': ["sqrt"], # just sqrt is used because values of log2 and sqrt are very similar for our number of features (10-19)
    'min_samples_split': [6, 10, 14],
}

In [None]:
dt_model2 = DecisionTreeClassifier()
dt_model_GridSearchCV_Applied = GridSearchCV(dt_model2, dt_params, cv=3, n_jobs=-1, verbose=False)
dt_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
print("Best parameters:"+str(lr_model_GridSearchCV_Applied.best_params_))

lets use the best hyperparameters we found

In [None]:
dt_model_GridSearchCV_Applied = DecisionTreeClassifier(criterion='gini', max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=5, n_estimators=500)
dt_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
y_pred_GridSearchCV_Applied = dt_model_GridSearchCV_Applied.predict(X_test)

In [None]:
print("**************** HYPERPARAMETER TUNED Decision Tree MODEL RESULTS ****************")
print('Accuracy score for testing set: ',round(accuracy_score(y_test, y_pred_GridSearchCV_Applied),5))
print('F1 score for testing set: ',round(f1_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Recall score for testing set: ',round(recall_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Precision score for testing set: ',round(precision_score(y_test, y_pred_GridSearchCV_Applied),5))
cm_rf = confusion_matrix(y_test, y_pred_GridSearchCV_Applied)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Oranges', fmt='g', ax=ax);
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

In [None]:
Importance =pd.DataFrame({"Importance": dt_model_GridSearchCV_Applied.feature_importances_*100},index = X_train_over.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

In [None]:
print(classification_report(y_test, y_pred_GridSearchCV_Applied, target_names=['0', '1']))

**Decision Tree Models' Comparisons**

In [None]:
comparison_frame = pd.DataFrame({'Model':['Decision Tree Classifier (Default)',
                                          'Decision Tree Classifier (GridSearchCV Applied)'],
                                 'Accuracy (Test Set)':[0.91929,0.92434],
                                 'F1 Score (Test Set)':[0.74194,0.7619],
                                 'Recall (Test Set)':[0.71318,0.74419], 
                                 'Precision (Test Set)':[0.77311,0.78049]}) 

comparison_frame.style.highlight_max(color = 'lightgreen', axis = 0)

**Model 4 - K-Nearest Neighbors (KNN)**

K-Nearest Neighbors (KNN) is a simple yet powerful supervised machine learning algorithm used for classification and regression tasks. In the context of customer churn prediction for SyriaTel, KNN can be utilized to classify customers as churned or active based on similarities in their feature values.

In KNN modeling, the algorithm classifies a new data point by comparing it to its K nearest neighbors in the training dataset. The value of K represents the number of neighboring data points considered for classification. The algorithm calculates the distance between the new data point and each of its neighbors using a distance metric such as Euclidean distance. The majority class among the K nearest neighbors determines the class label assigned to the new data point.

One advantage of KNN is its simplicity and intuitive nature. It does not make any underlying assumptions about the data distribution and can capture nonlinear relationships between features and the target variable. 

The choice of K is crucial, as too low or too high values can lead to biased or noisy predictions, respectively. Additionally, KNN is sensitive to the scale of features, and feature normalization may be necessary to ensure equal importance across different variables.

In [None]:
# Fitting our KNN classifier

from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier()
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)

In [None]:

print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(explained_variance_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

In [None]:
#Hyperparameter Tuning using random search 
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import KFold
neighbor_range = np.arange(1, 41)
knn = KNeighborsClassifier()

params = {'n_neighbors' : neighbor_range,
         'weights' : ['uniform', 'distance'],
         'metric' : ['manhattan', 'euclidean', 'minkowski']}

kfolds = KFold(n_splits = 5)
rscv = RandomizedSearchCV(knn, params, random_state = 0)
rscv.fit(X_train, y_train)
print("Best parameters:", rscv.best_params_)

In [None]:
#Fittng the best parameters
knn_b = KNeighborsClassifier(n_neighbors=15, weights='distance',metric='euclidean')
#Train model 
knn_b.fit(X_train,y_train)
#Predict using model 
y_pred = knn_b.predict(X_test)

In [None]:
Knn= accuracy_score(y_test, y_pred)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(explained_variance_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

**Hyperparameter Tuning of K-Nearest Neighbors (KNN)**

In [None]:
knn_params = {'weights' : ['uniform', 'distance'],
              'metric' : ['manhattan', 'euclidean', 'minkowski'],
              'n_neighbors': [5, 15, 25, 35, 45, 55, 65],
              'p': [1, 2, 10]}

In [None]:
knn_model2 = KNeighborsClassifier()
knn_model_GridSearchCV_Applied = GridSearchCV(dt_model2, knn_params, cv=3, n_jobs=-1, verbose=False)
knn_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
print("Best parameters:"+str(knn_model_GridSearchCV_Applied.best_params_))

lets use the best hyperparameters we found

In [None]:
knn_model_GridSearchCV_Applied = KNeighborsClassifier(criterion='gini', max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=5, n_estimators=500)
knn_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
y_pred_GridSearchCV_Applied = knn_model_GridSearchCV_Applied.predict(X_test)

In [None]:
print("**************** HYPERPARAMETER TUNED knn MODEL RESULTS ****************")
print('Accuracy score for testing set: ',round(accuracy_score(y_test, y_pred_GridSearchCV_Applied),5))
print('F1 score for testing set: ',round(f1_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Recall score for testing set: ',round(recall_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Precision score for testing set: ',round(precision_score(y_test, y_pred_GridSearchCV_Applied),5))
cm_rf = confusion_matrix(y_test, y_pred_GridSearchCV_Applied)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Oranges', fmt='g', ax=ax);
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

In [None]:
Importance =pd.DataFrame({"Importance": dt_model_GridSearchCV_Applied.feature_importances_*100},index = X_train_over.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

In [None]:
print(classification_report(y_test, y_pred_GridSearchCV_Applied, target_names=['0', '1']))

**knn Models' Comparisons**

In [None]:
comparison_frame = pd.DataFrame({'Model':['knn Classifier (Default)',
                                          'knn Classifier (GridSearchCV Applied)'],
                                 'Accuracy (Test Set)':[0.91929,0.92434],
                                 'F1 Score (Test Set)':[0.74194,0.7619],
                                 'Recall (Test Set)':[0.71318,0.74419], 
                                 'Precision (Test Set)':[0.77311,0.78049]}) 

comparison_frame.style.highlight_max(color = 'lightgreen', axis = 0)

**Model 5 - Support Vector Machine (SVM)**

Support Vector Machine (SVM) is a powerful supervised machine learning algorithm used for classification and regression tasks. It finds an optimal hyperplane that separates data points with the largest margin, allowing it to handle high-dimensional and nonlinear data effectively. 

SVM utilizes support vectors, which are the data points closest to the decision boundary, to define the separation between classes. It can handle both linearly separable and nonlinear data using different kernel functions such as linear, polynomial, RBF, and sigmoid.

One advantage of SVM is its ability to generalize well to unseen data and handle complex datasets. By maximizing the margin, SVM can provide good generalization performance and be less prone to overfitting. SVM is also robust against outliers due to its focus on support vectors. 

In [None]:
from sklearn.svm import SVC
# Let's now build the svm model 
model = SVC()
# Train the model using the training set
model.fit(X_train,y_train)

# Predict the response for the test set
y_pred = model.predict(X_test)
y_pred

In [None]:
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(explained_variance_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

In [None]:
# Optimizing our model

from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.1, 1],
              'gamma': [1, 0.1]
}
svm_grid = GridSearchCV(model,param_grid=param_grid)
svm_grid.fit(X_train,y_train)

In [None]:
print(svm_grid.best_params_)

In [None]:
#Using the best parameters from hyperparameter tuning:
Final= SVC(C = 1, gamma = 0.1)

#Fitting the model:
Final.fit(X_train,y_train)

#Predicting values:
y_pred = Final.predict(X_test)

In [None]:
Svm= accuracy_score(y_test, y_pred)
print(accuracy_score(y_test,y_pred))
print(confusion_matrix(y_test,y_pred))
print(precision_score(y_test,y_pred))
print(explained_variance_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

**Hyperparameter Tuning of Support Vector Machine (SVM)**

**Hyperparameter Tuning of K-Nearest Neighbors (KNN)**

In [None]:
svm_params = {'gamma': [1, 0.1],
              'kernel': ['linear'],
              'C': [0.1, 1, 10]}

In [None]:
svm_model2 = SVC()
svm_model_GridSearchCV_Applied = GridSearchCV(svm_model2, svm_params, cv=3, n_jobs=-1, verbose=False)
svm_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
print("Best parameters:"+str(svm_model_GridSearchCV_Applied.best_params_))

lets use the best hyperparameters we found

In [None]:
svm_model_GridSearchCV_Applied = SVC(criterion='gini', max_depth=20, max_features='sqrt', min_samples_leaf=1, min_samples_split=5, n_estimators=500)
svm_model_GridSearchCV_Applied.fit(X_train_over,y_train_over)
y_pred_GridSearchCV_Applied = svm_model_GridSearchCV_Applied.predict(X_test)

In [None]:
print("**************** HYPERPARAMETER TUNED svm MODEL RESULTS ****************")
print('Accuracy score for testing set: ',round(accuracy_score(y_test, y_pred_GridSearchCV_Applied),5))
print('F1 score for testing set: ',round(f1_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Recall score for testing set: ',round(recall_score(y_test, y_pred_GridSearchCV_Applied),5))
print('Precision score for testing set: ',round(precision_score(y_test, y_pred_GridSearchCV_Applied),5))
cm_rf = confusion_matrix(y_test, y_pred_GridSearchCV_Applied)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf, annot=True, cmap='Oranges', fmt='g', ax=ax);
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

In [None]:
Importance =pd.DataFrame({"Importance": dt_model_GridSearchCV_Applied.feature_importances_*100},index = X_train_over.columns)
Importance.sort_values(by = "Importance", axis = 0, ascending = True).tail(15).plot(kind ="barh", color = "r",figsize=(9, 5))
plt.title("Feature Importance Levels");
plt.show()

In [None]:
print(classification_report(y_test, y_pred_GridSearchCV_Applied, target_names=['0', '1']))

**svm Models' Comparisons**

In [None]:
comparison_frame = pd.DataFrame({'Model':['svm Classifier (Default)',
                                          'svm Classifier (GridSearchCV Applied)'],
                                 'Accuracy (Test Set)':[0.91929,0.92434],
                                 'F1 Score (Test Set)':[0.74194,0.7619],
                                 'Recall (Test Set)':[0.71318,0.74419], 
                                 'Precision (Test Set)':[0.77311,0.78049]}) 

comparison_frame.style.highlight_max(color = 'lightgreen', axis = 0)

# Evaluation

**Models Comparison**

**ROC Curve**

In [None]:
classifiers = [LogisticRegression(),
               RandomForestClassifier(),
               DecisionTreeClassifier(),
              KNeighborsClassifier,
              SVC()]


# Define a result table as a DataFrame
result_table = pd.DataFrame(columns=['classifiers', 'fpr','tpr','auc'])

# Train the models and record the results
for cls in classifiers:
    model = cls.fit(X_train_over, y_train_over)
    yproba = model.predict_proba(X_test)[::,1]
    
    fpr, tpr, _ = roc_curve(y_test,  yproba)
    auc = roc_auc_score(y_test, yproba)
    
    result_table = result_table.append({'classifiers':cls.__class__.__name__,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'auc':auc}, ignore_index=True)

# Set name of the classifiers as index labels
result_table.set_index('classifiers', inplace=True)

fig = plt.figure(figsize=(8,6))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'], 
             result_table.loc[i]['tpr'], 
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()

The ROC curve illustrates the true positive rate against the false positive rate of our classifier.

The best performing models will have a curve that hugs the upper left of the graph, which is the the random forest classifier in this case.

**Model Comparisons - F1 Score (10-fold cross-validated)**

In [None]:
models = [lr,rf_model_final,decision_tree]

result = []
results = pd.DataFrame(columns= ["Models","F1"])

for model in models:
    names = model.__class__.__name__
    y_pred = model.predict(X_test)
    f1 = cross_val_score(model,X_test,y_test,cv=10,scoring="f1_weighted").mean()  
    result = pd.DataFrame([[names, f1*100]], columns= ["Models","F1"])
    results = results.append(result)
    
sns.barplot(x= 'F1', y = 'Models', data=results, palette="coolwarm")
plt.xlabel('F1 %')
plt.title('F1 of the models');

In [None]:
results.sort_values(by="F1",ascending=False)

F1 score measures the harmonic mean between precision and recall

It is a value between 0 and 1, with 1 being a perfect score and an indication everything was observed correctly.

Random forest classifier had the highest F1 score. false negative have more of a business impact. need to focus on recall

**Model Comparisons - Accuracy (10-fold cross-validated)**

In [None]:
models = [lr,rf_model_final,decision_tree]

result = []
results = pd.DataFrame(columns= ["Models","Accuracy"])

for model in models:
    names = model.__class__.__name__
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)    
    result = pd.DataFrame([[names, accuracy*100]], columns= ["Models","Accuracy"])
    results = results.append(result)
    
    
sns.barplot(x= 'Accuracy', y = 'Models', data=results, palette="coolwarm")
plt.xlabel('Accuracy %')
plt.title('Accuracy of the models');

In [None]:
results.sort_values(by="Accuracy",ascending=False)

Accuracy allows one to measure the total number of prediction a model gets right.

The best performing model will have the highest accuracy.

Of the four models tested, random forest classifier has the highest accuracy.

**Applying SFS (Sequential Feature Selector) Feature Selection Techniques**


Sequential Feature Selector (SFS) is a feature selection technique that iteratively selects the most relevant features for a given task. It reduces dimensionality and improves model performance by choosing features based on predefined criteria. SFS explores different feature combinations and evaluates their impact on model performance. It improves interpretability and computational efficiency by selecting informative features.

During each iteration, SFS evaluates different subsets of features by training and testing a machine learning model. It considers both the individual performance of features and their interactions with other selected features. This way, SFS explores different combinations of features to identify the most informative subset.

In [None]:
reduced_df.columns

In [None]:
rf = RandomForestClassifier(max_depth=20,min_samples_split=5,n_estimators=500,criterion='entropy')
sfs1 = SFS(rf, k_features=10, forward=True, floating=False, verbose=False,scoring='f1',cv=3,n_jobs=-1)
sfs1 = sfs1.fit(X, y)
sfs1.subsets_

In [None]:
sfs1.k_feature_names_

In [None]:
print("Random Forest Model's", sfs1.scoring, "score is:",round(sfs1.k_score_,3))

In [None]:
pd.DataFrame.from_dict(sfs1.get_metric_dict()).T.iloc[0:, 0:]

In [None]:
fig = plot_sfs(sfs1.get_metric_dict(), kind='std_err')

In [None]:
reduced_df.columns

In [None]:
reduced_df_subsets = reduced_df[['number vmail messages',
 'total day charge',
 'total eve charge',
 'total night charge',
 'total intl charge',
 'customer service calls',
 'state_is_AL',
 'state_is_HI',
 'state_is_RI',
 'international_plan_is_yes','churn']]

In [None]:
X_reduced = reduced_df_subsets.drop(['churn'],axis=1)
y_reduced = reduced_df_subsets['churn']

In [None]:
rf_model_SFS_Applied = RandomForestClassifier(max_depth=20,min_samples_split=5,n_estimators=500,criterion='entropy') 
rf_model_SFS_Applied.fit(X_train_sfs,y_train_sfs) # Fitting the data into the algorithm
rf_model_SFS_Applied = rf_model_SFS_Applied.predict(X_test_sfs) # Getting the predictions

In [None]:
print("**************** SFS APPLIED RANDOM FOREST MODEL RESULTS **************** ")
print('Accuracy score for testing set: ',round(accuracy_score(y_test,y_pred_rf_sfs),5))
print('F1 score for testing set: ',round(f1_score(y_test,y_pred_rf_sfs),5))
print('Recall score for testing set: ',round(recall_score(y_test,y_pred_rf_sfs),5))
print('Precision score for testing set: ',round(precision_score(y_test,y_pred_rf_sfs),5))
cm_rf_sfs = confusion_matrix(y_test, y_pred_rf_sfs)
f, ax= plt.subplots(1,1,figsize=(5,3))
sns.heatmap(cm_rf_sfs, annot=True, cmap='Reds', fmt='g', ax=ax);
ax.set_xlabel('Predicted Labels'); ax.set_ylabel('True Labels') ; ax.set_title('Confusion Matrix')
ax.xaxis.set_ticklabels(['0', '1']) ; ax.yaxis.set_ticklabels(['0', '1'])
plt.show();

# Code Quality

# Reccomendations

# Next Steps

# Conclusion