## **Telco Customer Churn Analysis Predicton**


### **Business Understanding**

##### **Problem Statement**
Customer retention is at the heart of most business models in their effort to increase their profit or revenue margin. Presently, most companies leverage machine learning to build classification models to perform churn analysis on their customers. The highly competitive nature of the telecommunications industry makes retaining customers extremely crucial. This project involves accessing and analyzing customer churn data from multiple sources, building a robust classification model, and helping a telecommunication company predict customer churn to improve retention strategies. The objective is to help a telecommunication company understand customer churn and its impact on profitability. 

##### **Goal and Objectives**

•    To understand the current customer churn rate.

•    To identify factors (such as demographics, usage patterns, etc.) that influence customer churn aiming to gain a deeper understanding of customer behavior

•    To build a predictive machine learning model to predict customer churn for a telecommunications company to forecast which customers are likely to churn

##### **Stakeholders**
•	Company Executives and Management

•	Data Science and Analytics Team

•	Customer Service and Support Teams

•	Marketing and Sales and Advertisement Teams:

•	Finance 

•	Legal and Compliance Team

##### **Key Metrics and Success Criteria**

•  Accuracy Requirement:

•	This model must achieve an accuracy score of at least 85% when evaluated on balanced data, ensuring a high proportion of correct predictions.
•  F1 Score Benchmark

•	Models should attain an F1 score greater than 0.80 (80%), indicating a strong balance between precision and recall, which is crucial for handling both false positives and false negatives effectively.

•  ROC Curve Standard:

•	An ROC curve with an area under the curve (AUC) of 80% is desired, demonstrating the model's ability to generalize well and maintain a good balance between sensitivity and specificity.

•  Baseline Models Requirement:

•	At least four different baseline models should be developed to serve as benchmarks. These could include logistic regression, decision trees, support vector machines, and k-nearest neighbors, providing a range of reference points for comparison.

•  Hyperparameter Tuning Condition:

Hyperparameter tuning will be conducted only on those baseline models that achieve an F1 score above the 0.80 threshold. This ensures that tuning efforts are concentrated on models that show initial promise and meet the performance criteria.


##### **Hypothesis**

- Null Hypothesis (Ho):There is a no significant relationship between the total amount charged to a customer and their likelihood of churning.
    
- Alternative Hypothesis (H1):- There is a significant relationship between the total amount charged to a customer and their likelihood of churning


     




##### **Analytical Questions**
I. What is the overall churn rate?

- As part of our goals we would like to understand the overall churn rate. This will help us understand the current state of customer churn and identify areas for improvement.

II. What are the key demographic and behavioral characteristics of customers who churn compared to those who stay, and how do these  characteristics vary across different customer segments?

- Insights derrived can include whether certain age groups or regions are more prone to churn, or if specific behaviours influence with higher churn rates. 

III. Which factors have the highest influence on customer churn, and how do they interact with each other?

- This will help us understand how different customer segments interact with each other and identify areas for improvement in customer retention.

IV. How does the length of time a customer has been with the company (tenure) impact their likelihood of churning? Are newer customers more likely to churn than long-term customers

- This will help us understand the customer lifecycle and identify areas for improvement in customer retention.

V. What is the overall churn rate compared across different contract types?

- Insights derived can include whether customers with shorter or longer contracts are more likely to churn.

##### **Scope and Constraints**
Some constraints of this project include, computational resources, model complexity, time limitations, stakeholder expectations, and ethical and legal considerations.

##### **Additional Information**

This project is to be completed in 4 weeks 




### **Data Understanding**

In [None]:
 #Importation of libraries 
 #Data manipulation and analysis
import pandas as pd
import numpy as np
 
# Database connectivity
import pyodbc
 
# Database ORM (optional)
from sqlalchemy import create_engine
 
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
 
# Machine learning 
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
 


: 

#### Database connectivity

In [None]:
#Connecting to the first database
# Now the sql query to get the data is what what you see below.
# Define the connection string
server = 'dap-projects-database.database.windows.net'
database = 'dapDB'
username = 'LP2_project'
password = 'Stat$AndD@t@Rul3'
conn_str = f'DRIVER={{ODBC Driver 17 for SQL Server}};SERVER={server};DATABASE={database};UID={username};PWD={password}'

# Connect to the database
conn = pyodbc.connect(conn_str)

# Query the data
query = "SELECT * FROM dbo.LP2_Telco_churn_first_3000"
data1 = pd.read_sql(query, conn)

# Close the connection
conn.close()

# Display the data
print(data1.head())


: 

In [None]:
#Connecting to the second database

# URL of the CSV file
data2= "https://raw.githubusercontent.com/Azubi-Africa/Career_Accelerator_LP2-Classifcation/main/LP2_Telco-churn-second-2000.csv"

# Load the dataframe from the URL
data2 = pd.read_csv(data2)

# Display the first few rows to verify
print(data2.head())


: 

In [None]:
#Connecting to the third database

# file path of the Excel file
file_path = r"C:\Users\USER\Desktop\Telco-churn-last-2000.xlsx"

# Load Excel file into a DataFrame
Test_data = pd.read_excel(file_path)

# Display the first few rows of the DataFrame
print(Test_data.head())


: 

#### **Exploratory Data Analysis (EDA)**

- Data Quality Assement ,EDA & Data Cleaning

In [None]:
#Checking basic information for the firstdata set
data1.info()


: 

In [None]:
#Checking basic information for basic infromation in the second data set
data2.info()


: 

In [None]:
#Checking basic information for the third data set
Test_data.info()


: 

In [None]:
#concatenating data1 and data2

train_data = pd.concat([data1, data2], ignore_index=True)
train_data.tail()

: 

In [None]:
#check shape of train_data

train_data.shape


: 

In [None]:
#Check basic information for the train data set
train_data.info()



: 

In [None]:
#check for missing values in the train_data 

train_data.isnull().sum()

: 

In [None]:
#check for percentage of missing values in the train_data

(train_data.isnull().sum() / train_data.shape[0]) * 100

: 

In [None]:
#Check for duplicates in the train_data

train_data.duplicated().sum()

: 

In [None]:
#check for unique values in each column of the train_data

train_data.nunique()

: 

In [None]:
#Describe the train_data

train_data.describe().T

: 

In [None]:
#descibe the categorical values in the train_data

train_data.describe(include='object').T

: 

In [None]:
#Describing the whole train_data

train_data.describe(include='all').T

: 

In [None]:
#list out all columns in train_data

train_data.columns

: 

In [None]:
#list out numerical columns in train_data

numerical_columns = train_data.select_dtypes(include=['int64', 'float64']).columns
numerical_columns

: 

In [None]:
#list out categorical columns in train_data

categorical_columns = train_data.select_dtypes(include=['object']).columns
categorical_columns

: 

#### Data Cleaning & Preprocessing

In [None]:
for column in categorical_columns:
    print(f"{column}")
    print(f'There are {train_data[column].unique().size} unique values')
    print(f'{train_data[column].unique()}')
    print('='* 50)
    


: 

In [None]:
#converting TotalCharges to numeric type float
train_data['TotalCharges'] = pd.to_numeric(train_data['TotalCharges'], errors='coerce')

# Check the data types after conversion
print(train_data.dtypes)


: 

In [None]:
# Check for null values in the TotalCharges column
null_counts = train_data['TotalCharges'].isnull().sum()

print(f"Number of null values in TotalCharges column: {null_counts}")

: 

In [None]:
# Define a dictionary to map boolean and None values to meaningful categories
mapping_new_cat_values = {
    'Partner': {True: 'Yes', False: 'No', 'No': 'No', 'Yes': 'Yes'},
    'Dependents': {True: 'Yes', False: 'No', 'No': 'No', 'Yes': 'Yes'},
    'PhoneService': {True: 'Yes', False: 'No', 'No': 'No', 'Yes': 'Yes'},
    'MultipleLines': {False: 'No', True: 'Yes', 'No': 'No', 'No phone service': 'No phone service', 'Yes': 'Yes'},
    'OnlineSecurity': {False: 'No', True: 'Yes', 'No': 'No', 'Yes': 'Yes', 'No internet service': 'No internet service'},
    'OnlineBackup': {False: 'No', True: 'Yes', 'No': 'No', 'Yes': 'Yes', 'No internet service': 'No internet service'},
    'DeviceProtection': {False: 'No', True: 'Yes', 'No': 'No', 'Yes': 'Yes', 'No internet service': 'No internet service'},
    'TechSupport': {False: 'No', True: 'Yes', 'No': 'No', 'Yes': 'Yes', 'No internet service': 'No internet service'},
    'StreamingTV': {False: 'No', True: 'Yes', 'No': 'No', 'Yes': 'Yes', 'No internet service': 'No internet service'},
    'StreamingMovies': {False: 'No', True: 'Yes', 'No': 'No', 'Yes': 'Yes', 'No internet service': 'No internet service'},
    'PaperlessBilling': {True: 'Yes', False: 'No', 'No': 'No', 'Yes': 'Yes'},
    'Churn': {True: 'Yes', False: 'No', 'No': 'No', 'Yes': 'Yes'},
    
}



train_data.tail()


: 

In [None]:
#check for categorical values in the train_data

numerical_columns = train_data.select_dtypes(include=['int64', 'float64']).columns
numerical_columns

categorical_columns = train_data.select_dtypes(include=['object']).columns
categorical_columns

: 

In [None]:
for column in categorical_columns:
    print(f"{column}")
    print(f'There are {train_data[column].unique().size} unique values')
    print(f'{train_data[column].unique()}')
    print('='* 50)
    

: 

In [None]:
#Drop customerID column

train_data = train_data.drop('customerID', axis=1)


: 

In [None]:
#check for missing values in the train_data

train_data.isnull().sum()

: 

In [None]:
#drop missing value i

: 

In [None]:
#Check updated info on train_data

train_data.info()

: 

In [None]:
#Understanding the target variable 'Churn'

# Calculate and print value counts for the 'Churn' column
churn_value_counts = train_data['Churn'].value_counts()
print("Churn Value Counts:\n", churn_value_counts)
print("=" * 50)

# Print unique values for the 'Churn' column
unique_churn_values = train_data['Churn'].unique()
print("Unique Churn Values:", unique_churn_values)


: 

### **Univariate Analysis**

In [None]:
#Plot histograms for categorical columns 
for column in categorical_columns:
    plt.figure(figsize=(8, 5))
    
    # Calculate value counts and percentages
    counts = train_data[column].value_counts(normalize=True) * 100
    
    # Plotting
    counts.plot(kind='bar')
    plt.title(f'Histogram of {column}')
    plt.xlabel(column)
    plt.ylabel('Percentage')
    plt.xticks(rotation=45)
    
    # Annotate with percentages
    for i, v in enumerate(counts):
        plt.text(i, v + 1, f'{v:.1f}%', ha='center', va='bottom', fontsize=8)
    
    plt.show()


: 

- Univariate analysis for numerical column 

In [None]:
#Histogram distribution for numerical columns

numerical_columns = train_data.select_dtypes(include=['float64', 'int64']).columns

# Plot histograms for numerical columns with percentage annotations
for column in numerical_columns:
    plt.figure(figsize=(8, 5))
    
    # Plotting histogram
    plt.hist(train_data[column], bins=10, edgecolor='black')
    plt.title(f'Histogram of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    
    # Calculate percentage values for each bin
    counts, bins, _ = plt.hist(train_data[column], bins=10, edgecolor='black')
    bin_centers = 0.5 * (bins[:-1] + bins[1:])
    percentages = [(count / len(train_data[column])) * 100 for count in counts]

    #show percentage annotations on values only
    for bin_center, percentage in zip(bin_centers, percentages):
        if percentage > 0.01:
            plt.text(bin_center, max(counts) * 0.95, f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)
    
    plt.show()
    
   
    
    plt.show()


: 

In [None]:
#Further analysing using kdeplot for numerical columns

for column in numerical_columns:
    sns.kdeplot(data=train_data, x=column, hue='Churn', fill=True, alpha=0.5)
    plt.title(f'KDE plot of {column} by Churn')
    plt.xlabel(column)
    plt.ylabel('Density')
    plt.show()
    

: 

In [None]:
mnjxnkdn

: 

In [None]:
#Check for outliers in numerical columns using box plots

for column in numerical_columns:
    sns.boxplot(data=train_data, x='Churn', y=column)
    plt.title(f'Box plot of {column} by Churn')
    plt.xlabel('Churn')
    plt.ylabel(column)
    plt.show()
    

: 

In [None]:
#Check for outliers in numerical columns using box plots

for column in numerical_columns:
    sns.boxplot(data=train_data, x='Churn', y=column)
    plt.title(f'Box plot of {column} by Churn')
    plt.xlabel('Churn')
    plt.ylabel(column)
    plt.show()
    

: 

In [None]:
#Check for outliers in numerical columns using box plots

for column in numerical_columns:
    sns.boxplot(data=train_data, x='Churn', y=column)
    plt.title(f'Box plot of {column} by Churn')
    plt.xlabel('Churn')
    plt.ylabel(column)
    plt.show()
    

: 

#### **Bivariate Analysis**

In [None]:
# Correlation matrix for numerical columns

correlation_matrix = train_data[numerical_columns].corr()

# Plotting correlation matrix using heatmap

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix for Numerical Columns')
plt.show()

: 

In [2]:
#Mulivariate analysis

sns.pairplot(train_data, hue='Churn', diag_kind='kde')
plt.title
plt.show()

NameError: name 'sns' is not defined

#### Answering Analytical Questions

I. What is the overall churn rate?



In [None]:
#what is the overall churn rate?

churn_rate = churn_value_counts['Yes'] / len(train_data) * 100
print(f"Overall Churn Rate: {churn_rate:.1f}%")


: 

II. What are the key demographic and behavioral characteristics of customers who churn compared to those who stay, and how do these  characteristics vary across different customer segments?


In [None]:
# Key demographic and behavioral characteristics of customers who churn compared to those who stay

train_data['Churn_Category'] = train_data['Churn'].map({'Yes': 'Churn', 'No': 'Stay'})

# Perform bivariate analysis between 'Churn_Category' and other categorical columns
for column in categorical_columns:
    cross_tab = pd.crosstab(train_data[column], train_data['Churn_Category'], normalize='index') * 100
    print(f"Cross-Tabulation of Churn_Category vs {column}:")
    print(cross_tab)
    print("=" * 50)
    
    # Plotting a stacked bar chart to visualize the relationship
    cross_tab.plot(kind='bar', stacked=True, figsize=(10, 6))
    plt.title(f'Stacked Bar Chart of Churn_Category vs {column}')
    plt.xlabel(column)
    plt.ylabel('Percentage')
    plt.legend(title='Churn_Category')
    plt.show()


: 

III. Which factors have the highest influence on customer churn and how do they interact with each other?

In [None]:

# factors that have the highest influence on customer churn and how do they interact with each other

# correlation matrix for numerical columns

correlation_matrix = train_data[numerical_columns].corr()

# Plotting correlation matrix using heatmap

plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix for Numerical Columns')
plt.show()



: 

IV. How does the length of time a customer has been with the company (tenure) impact their likelihood of churning? Are newer customers more likely to churn than long-term customers

In [None]:
#how does the length of time a customer has been with the company (tenure) impact their likelihood of churning? Are newer customers more likely to churn than long-term customers

#length of time a customer has been with the company (tenure) impact their likelihood of churning

sns.boxplot(data=train_data, x='Churn', y='tenure')
plt.title('Box plot of tenure by Churn')
plt.xlabel('Churn')
plt.ylabel('tenure')
plt.show()

# Are newer customers more likely to churn than long-term customers

sns.kdeplot(data=train_data, x='tenure', hue='Churn', fill=True, alpha=0.5)
plt.title('KDE plot of tenure by Churn')
plt.xlabel('tenure')
plt.ylabel('Density')
plt.show()


: 


V. What is the overall churn rate compared across different contract types?

In [None]:
#what 

: 

#### **Hypothesis Testing**

- Null Hypothesis (Ho):There is a no significant relationship between the total amount charged to a customer and their likelihood of churning.
    
- Alternative Hypothesis (H1):- There is a significant relationship between the total amount charged to a customer and their likelihood of churning


In [None]:
import pandas as pd
from scipy.stats import mannwhitneyu

# Drop rows with missing values in 'TotalCharges' or 'Churn'
train_data = train_data.dropna(subset=['TotalCharges', 'Churn'])

# Separate the data into two groups based on the 'Churn' column
churn_yes = train_data[train_data['Churn'] == 'Yes']['TotalCharges']
churn_no = train_data[train_data['Churn'] == 'No']['TotalCharges']

# Perform the Mann-Whitney U test
stat, p_value = mannwhitneyu(churn_yes, churn_no)

print(f'Mann-Whitney U test statistic: {stat}')
print(f'p-value: {p_value}')

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis (Ho). There is a significant relationship between TotalCharges and Churn.")
else:
    print("Fail to reject the null hypothesis (Ho). There is no significant relationship between TotalCharges and Churn.")



: 