## <font color='#2F4F4F'>1. Defining the Question</font>

### a) Specifying the Data Analysis Question

What is your research question? What problem is it that you are trying to solve?

### b) Defining the Metric for Success

What will convince you that your project has succeeded?

### c) Understanding the Context 

The background information surrounding the problem or research question.

### d) Recording the Experimental Design

The steps you will take from the beginning to the end of this project.

### e) Data Relevance

Is the provided data relevant to the problem or research question?

## <font color='#2F4F4F'>2. Data Cleaning & Preparation</font>

In [None]:
# load libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# to display all columns
pd.set_option('display.max.columns', None)

# to display the entire contents of a cell
pd.set_option('display.max_colwidth', None)

In [None]:
# load and preview dataset
df = pd.read_csv('call-center-data-QueryResult.csv')
df.sample(3)

In [None]:
# load glossary
glossary = pd.read_csv('classification_analysis_glossary.csv',header = None)
glossary

In [None]:
# check dataset shape
df.shape

Our dataset has 12,892 records and 23 variables.

We will drop 'recordid' and 'customer_id' since we have no need of them and they would interfere with our analysis.

In [None]:
df.drop(columns = ['recordid', 'customer_id'], inplace = True)

In [None]:
# preview variable datatypes
df.dtypes

With the exception of the 'international_plan', 'voice_mail_plan', and 'churn' variables which are boolean, this dataset is numerical.

In [None]:
# check for duplicates
df.duplicated().sum()

7,892 duplicated records are found. We will drop them.

In [None]:
df = df.drop_duplicates()
df.shape

In [None]:
# check for missing values
df.isna().sum()

No missing values found. We will look at the unique values in each variable just to be safe.

In [None]:
columns = df.columns

for col in columns:
    print("Variable:", col)
    print("Number of unique values:", df[col].nunique())
    print(df[col].unique())
    print()

We can confirm that there are no missing values in this dataset.

An anomaly has been noted: there appears to be duplicated columns between 'total_intl_minutes' and 'total_intl_minutes_2', and 'total_intl_calls' and 'total_intl_calls_2'. Let's preview them:

In [None]:
# previewing the possibly duplicated columns
df[['total_intl_minutes', 'total_intl_calls', 'total_intl_minutes_2', 'total_intl_calls_2']]

The last two columns appear to be complete copies of the first two. We will confirm this so that we do not blindly drop them.

In [None]:
# selecting the total number of records where the values of 'total_intl_minutes' are equal to the values of 
# 'total_intl_minutes_2', AND the values of 'total_intl_calls' are equal to the values of 'total_intl_calls_2'
df[(df['total_intl_minutes'] == df['total_intl_minutes_2']) & (df['total_intl_calls'] == df['total_intl_calls_2'])].count()

We see that the columns are indeed duplicates so we can safely drop them.

In [None]:
df = df.drop(columns = ['total_intl_minutes_2', 'total_intl_calls_2'])
df.shape

Another anomaly noted is that where there are records of 'total_intl_minutes', 'total_int_calls', and 'total_intl_charge' when 'international_plan' is False.

In [None]:
df[(df['international_plan'] == False) & ((df['total_intl_minutes'] > 0) | (df['total_intl_calls'] > 0) |
                                         df['total_intl_charge'] > 0)]

In [None]:
df.international_plan.value_counts()

Much as we'd like to remove these invalid variables, doing so would result in a huge loss of data. We will therefore leave them as is, but flag them for future work.

We will check to confirm that there are no 'number_vmail_messages' when 'voice_mail_plan' is set to False.

In [None]:
df[(df['voice_mail_plan'] == False) & (df['number_vmail_messages'] > 0)]

After confirming this, we can now drop the 'voice_mail_plan' column.

In [None]:
df.drop(columns = ['voice_mail_plan'], inplace = True)

In [None]:
df.dtypes

In [None]:
# looking for outliers
num_cols = df.columns.to_list()
num_cols.remove('international_plan')
num_cols.remove('churn')

plt.figure(figsize = (14, 6))
df.boxplot(num_cols)
plt.xticks(rotation = 45)
plt.show()

We see that all the remaining variables have outliers, which we will not drop.

We now save our clean dataset to a new CSV file.

In [None]:
# save the data set to a clean CSV file

df.to_csv('call_center_clean.csv', index = False)

df = pd.read_csv('call_center_clean.csv')
df.head()

## <font color='#2F4F4F'>3. Data Analysis</font>

### 3.1 Univariate Analysis

In [None]:
# get the summary statistics
df.describe()

In [None]:
print(df.area_code.value_counts())

plt.figure(figsize = (6, 6))
df.area_code.value_counts().plot(kind = 'pie', autopct = '%1.1f%%')
plt.title('Pie Chart of Area Code')
plt.show()

Area Code 415 consists of almost half of the area codes in this dataset. Area Code 510 very slightly outnumbers Area Code 408.

In [None]:
print(df.international_plan.value_counts())

plt.figure(figsize = (6, 6))
df.international_plan.value_counts().plot(kind = 'bar', rot = 0, color = ['skyblue', 'darkorange'])
plt.title('Distribution of International Plan')
plt.xlabel('International Plan')
plt.show()

Very few of the customers are subscribed to an international plan.

In [None]:
print(df.number_customer_service_calls.value_counts())

plt.figure(figsize = (8, 8))
df.number_customer_service_calls.value_counts().plot(kind = 'bar', rot = 0)
plt.xlabel("Number of Calls to Customer Service")
plt.show()

Most customers made exactly 1 call to customer service followed by those who made 2 calls, and then those who made 0 calls. Those who made more than 5 calls make up the minority.

In [None]:
print(df.churn.value_counts())

plt.figure(figsize = (6, 6))
df.churn.value_counts().plot(kind ='bar', rot = 0, color = ['darkgreen', 'darkred'])
plt.xlabel("Churn")
plt.show()

Majority of the customers in this dataset have not churned, thereby making this dataset very biased.

In [None]:
# plotting the histograms of all our numerical variables with the
# exception of 'area_code' and 'number_customer_service_calls'
num_cols.remove('area_code')
num_cols.remove('number_customer_service_calls')

fig, axes = plt.subplots(nrows = 7, ncols = 2, figsize = (14, 30))
plt.suptitle('Countplots of Tests Measured', fontsize = 20, y = 1.01, color = 'blue')

colors = ['#00FF7F', '#8B0000', '#C71585', '#0000FF', '#DB7093', '#FFFF00', '#FF4500',
          '#7B68EE', '#FF00FF', '#ADFF2F', '#FFD700', '#A52A2A', '#2F4F4F', '#8B008B']
for ax, column, color in zip(axes.flatten(), num_cols, colors):
    sns.distplot(df[column], ax = ax, color = color, hist_kws = dict(alpha = 0.75))
    
plt.tight_layout()

Majority of the numerical variables have normal distributions. Apart from having most of its values in the 0-5 bin, the 'number_vmail_messages' variable appears to be normally distributed. The 'total_intl_calls' variable is skewed to the right and is not continuous.

### 3.2 Bivariata Analysis

We will make 'churn' our target variable and look at how the other variables relate to it.

In [None]:
# churn by area code
plt.figure(figsize = (8, 6))
churn_area_code = sns.countplot('area_code', hue = 'churn', data = df)
churn_area_code.set(title = "Churn by Area Code", ylabel = 'Area Code')
plt.show()

Area code 415 reports the highest churn rates.

In [None]:
# churn by international plan
plt.figure(figsize = (8, 6))
churn_area_code = sns.countplot('international_plan', hue = 'churn', data = df)
churn_area_code.set(title = "Churn by International Plan", ylabel = 'International Plan')
plt.show()

Those without international plans churned more than those with international plans.

In [None]:
# churn by number of customer service calls
plt.figure(figsize = (8, 6))
churn_area_code = sns.countplot('number_customer_service_calls', hue = 'churn', data = df)
churn_area_code.set(title = "Churn by Number of Customer Service Calls",
                    ylabel = 'Number of Customer Service Calls')
plt.show()

The interesting thing to note here is that those who made 0 or 2 calls to customer service churned at around the same rate. Similarly, those who made more than 3 calls reported high churn rates.

### 3.3 Feature Engineering & Test for Multicollinearity

Before we can carry out the test for multicollinearity (a requirement for logistic regression), we need to convert the values of 'international_plan' and 'churn' to binary.

In [None]:
df['international_plan'] = df['international_plan'].replace({False : 0, True : 1})
df['churn'] = df['churn'].replace({False : 0, True : 1})
df.head()

In [None]:
# checking the correlations between the numerical variables
YOUR CODE HERE

# plotting the correlations onto a heatmap
YOUR CODE HERE

We see some perfect correlations between the following variables:
- 'total_day_minutes' and 'total_day_charge'
- 'total_eve_minutes' and 'total_eve_charge'
- 'total_night_minutes' and 'total_night_charge'
- 'total_intl_minutes' and 'total_intl_charge'

We will drop the minutes.

In [None]:
# drop the columns with minutes, e.g., 'total_day_minutes', etc.
YOUR CODE HERE

In [None]:
# checking the correlations between the numerical variables
YOUR CODE HERE

# plotting the correlations onto a heatmap
YOUR CODE HERE

We will then check the Variance Inflation Factor (VIF) scores to ensure there is no high multicollinearity.

In [None]:
# calculate VIF and plot the heatmap
YOUR CODE HERE

We don't see any VIF score of 5 and above, which means our dataset does not have high levels of multicollinearity. We are, therefore, good to go.

## <font color='#2F4F4F'>4. Data Modeling</font>

We will carry out 5 types of classification analysis, namely:
1. Logistic Regression
2. Gaussian Naive Bayes (NB) classification
3. Decision Trees Classification
4. K-Nearest Neighbors (KNN) Classification
5. Support Vector Machine (SVM) Classification

We will then compare the different classification models to assess the best performing one(s).

In [None]:
# dividing our dataset into features (X) and target (y)

YOUR CODE HERE

In [None]:
# splitting into 80-20 train-test sets

YOUR CODE HERE

In [None]:
# performing feature scaling on our training data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# fitting and transforming X_train while transforming X_test
YOUR CODE HERE

In [None]:
# loading our classification libraries
YOUR CODE HERE

# instantiating our classifiers
YOUR CODE HERE

# fitting our classifiers to the training data
YOUR CODE HERE

# making predictions
YOUR CODE HERE

In [None]:
# printing the classification report for each classifier to assess performance
from sklearn.metrics import classification_report

# classification report for Logistic Regression
print("Logistic Regression classification report:")
YOUR CODE HERE

# classification report for Gaussian Naive Bayes Classifier
print("Gaussian Naive Bayes classification report:")
YOUR CODE HERE

# classification report for Decision Tree Classifier
print("Decision Tree classification report:")
YOUR CODE HERE

# classification report for K-Nearest Neighbors Classifier
print("K-Nearest Neighbors classification report:")
YOUR CODE HERE

# classification report for Support Vector Machine Classifier
print("Support Vector Machine classification report:")
YOUR CODE HERE

WHat have you noticed about the performance of the various models?

## <font color='#2F4F4F'>5. Summary of Findings</font>

Include your findings from the analysis and modeling stages.

## <font color='#2F4F4F'>6. Recommendations</font>

What recommendations can you provide?

## <font color='#2F4F4F'>7. Challenging your Solution</font>

What can you do to improve your project?