<a href="https://colab.research.google.com/github/Omprakash977/EDA/blob/main/Telecom_Churn_Analysis_Capstone_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Orange S.A., formerly France Télécom S.A., is a French multinational telecommunications corporation. The Orange Telecom's Churn Dataset, consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription.

## <b> Explore and analyze the data to discover key factors responsible for customer churn and come up with ways/recommendations to ensure customer retention. </b>

The preliminary analysis of data to discover relationships between measures in the data and to gain an insight on the trends, patterns, and relationships among various entities present in the data set with the help of statistics and visualization tools is called Exploratory Data Analysis (EDA).

In [None]:
# Importing required libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Mounting drive
from google.colab import drive
drive.mount('/content/drive')

# Reading and understanding the data

In [None]:
# Data reading
df = pd.read_csv('/content/drive/MyDrive/Datasets/Telecom Churn.csv')

In [None]:
# Exploring the first five rows of the dataset
df.head()

In [None]:
# Exploring the last five values of the dataset
df.tail()

In [None]:
# Inspecting the length and number of columns of the dataset
df.shape

The dataset contains 3333 rows and 20 columns.

In [None]:
# Checking null values and data types of each columns
df.info()

The columns are of different datatypes like object, integer, float and boolean. 

There are no null values present in any columns. So we don't need to worry about the null value treatment.

In [None]:
# Finding statistical measures of numerical columns
df.describe()

In [None]:
# Checking number of unique values of columns
df.nunique()

In [None]:
# Detecting duplicate rows
df.duplicated().sum()

The dataset doesn't contain any duplicate rows.

# Univariate Analysis


Our main motive through this step is to derive the data, define and summarize it, and analyze the pattern present in it. In a dataset, it explores each variable separately. 

In [None]:
# Inspecting State column using bar plot
fig_dims = (20,6)
fig = plt.subplots(figsize=fig_dims)
ax = sns.barplot(df['State'].value_counts().keys(),df['State'].value_counts())
ax.set_xlabel('State',color = 'g', size = 20)
ax.set_ylabel('Number of users',color = 'g', size = 20)
plt.title("Number of users in each state",size = 15, color = 'royalblue')
plt.show()

The top 5 states with maximum number of customers are WV,MN,NY,AL and OH.
There are least number of customers in PA,LA and CA states. 

In [None]:
# Inspecting Account length column using histplot
sns.set(rc={'figure.figsize':(10,6)})
sns.histplot(df['Account length'])
plt.xlabel("Account length",size = 15)
plt.ylabel("Number of users",size = 15)
plt.title("Histogram plot of account length",size = 15, color = 'royalblue')
plt.show()

Most of the customers have account for a duration of 100 to 150 days.

There are not a single customer having account duration for more than a year.

In [None]:
# Inspecting international plan column and Voice mail plan column using bar plot

fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# bar plot of international plan column
sns.barplot(ax=axes[0], x = df['International plan'].value_counts().keys(), y = df['International plan'].value_counts())
axes[0].set_xlabel('International Plan',color = 'g', size = 15)
axes[0].set_ylabel('Number of users',color = 'g', size = 15)

# bar plot of Voice mail plan column
sns.barplot(ax=axes[1], x = df['Voice mail plan'].value_counts().keys(),y = df['Voice mail plan'].value_counts())
axes[1].set_xlabel('Voice mail plan',color = 'g', size = 15)
axes[1].set_ylabel('Number of users',color = 'g', size = 15)

plt.show()


From the above plot it is clear that there are very few customers who have opted for international plan. And also there are not many customers who have opted for voice mail plan.


In [None]:
# Comparison of total time spent on calls with respect to different times of a day using distplot

sns.set(rc={'figure.figsize':(20,6)})
sns.distplot(df['Total day minutes'],bins = 15,label = 'day')
sns.distplot(df['Total eve minutes'],bins = 15, color = 'g',label = 'evening')
sns.distplot(df['Total night minutes'], bins = 15, color = 'r',label = 'night')
plt.xlabel("Total time spent on calls(in minutes)",size = 15)
plt.ylabel("Density",size = 15)
plt.title("Total time spent on calls with respect to different times of a day",size = 15, color = 'royalblue')
plt.legend(fontsize=15)
plt.show()

Total time spent on calls in evening and night are a bit higher.

In [None]:
# Comparison of number of calls with respect to different times of a day using distplot
sns.set(rc={'figure.figsize':(20,8)})
sns.distplot(df['Total day calls'],bins = 10,label = 'day')
sns.distplot(df['Total eve calls'],bins = 10, color = 'g',label = 'evening')
sns.distplot(df['Total night calls'], bins = 10, color = 'r',label = 'night')
plt.xlabel("Total number of calls",size = 15)
plt.ylabel("Density",size = 15)
plt.title("Number of calls with respect to different times of a day",size = 15, color = 'royalblue')
plt.legend(fontsize=15)
plt.show()


Total number of calls are nearly same for all 3 times of a day.

In [None]:
# Comparison of total charges on calls with respect to different times of a day using distplot

sns.set(rc={'figure.figsize':(20,8)})
sns.distplot(df['Total day charge'],bins = 10,label = 'day')
sns.distplot(df['Total eve charge'],bins = 10, color = 'g',label = 'evening')
sns.distplot(df['Total night charge'], bins = 10, color = 'r',label = 'night')
plt.xlabel("Total charges on calls",size = 15)
plt.ylabel("Density",size = 15)
plt.title("Total charges on calls with respect to different times of a day",size = 15, color = 'royalblue')
plt.legend(fontsize=15)
plt.show()

The charges in night are less as compared to other times of the day.

In [None]:
#defining Seaborn color palette to use
sns.set(rc={'figure.figsize':(40,8)})
colors = sns.color_palette('pastel')[0:9]

#creating pie chart for customer service column
plt.pie(df['Customer service calls'].value_counts(), labels = df['Customer service calls'].value_counts().keys(),
        explode = (0.01,0.01,0.01,0.01,0.1,0.3,0.3,0.3,0.3,0.3), colors = colors, autopct='%.0f%%')
plt.title('Percentage of customers who have called customer service')
plt.show()

There are very few customers(around 8%) who have called customer service more than 3 times.

In [None]:
#defining Seaborn color palette to use
colors = sns.color_palette('pastel')[2:4]

#creating pie chart for Churn column
plt.pie(df['Churn'].value_counts(), labels = df['Churn'].value_counts().keys(),colors = colors, explode = (0.2,0), shadow = True, autopct='%.0f%%')
plt.title('Percentage of customers churned')
plt.show()

There are 14% chustomers who have churned.

# Influence of numerical columns on churn

In [None]:
# KDE plot

sns.set(rc={'figure.figsize':(10,4)})
col = list(df.columns)
col1 = [i for i in col if i not in ['State','Area code','International plan','Voice mail plan',
 'Number vmail messages','Customer service calls','Churn']]
for i in col1:
  fig, ax = plt.subplots()

  sns.kdeplot(df[df["Churn"]==False][i],shade=True, color="blue", label= "not churned", ax=ax)
  sns.kdeplot(df[df["Churn"]==True][i],shade=True, color="green", label= "churned", ax=ax)

  ax.set_xlabel(i)
  ax.set_ylabel("Density")
  plt.legend()

  fig.suptitle(f"{i} vs churn")

From the above plots it is clear that:
* When account length is more i.e. 100-130, customers are most likely to Churn.
* Churn is high when total day charges and total night charges are high but that is not the case in total evening charges.
* The Churn is high when total international call are less.

# Influence of categorical columns on churn

In [None]:
# Creating dataframe having data of churned customers
df_churn = df[df['Churn']==True]

In [None]:
# Plotting top 10 states where churn is more.

sns.barplot(df_churn['State'].value_counts()[:10].keys(),df_churn['State'].value_counts()[:10])
plt.xlabel("State",size = 15)
plt.ylabel("Number of users",size = 15)
plt.title("Top 10 states with more number of churn",size = 15, color = 'royalblue')
plt.show()

As churn is high in the states shown in above plot, Telecom company need to focus more on these states.

In [None]:
# Catplot of International plan vs churn

fig, ax = plt.subplots()
sns.catplot("Churn", hue="International plan", data=df, kind="count", palette={'No':"yellow", 'Yes':"red"}, ax=ax)
plt.close(1) # catplot creates an extra figure we don't need
plt.xlabel("Churn",size = 15)
plt.ylabel("Number of users",size = 15)
plt.title("International plan vs churn",size = 15, color = 'royalblue')
plt.show()

* Customers who have churned among those most of them have not opted for international plan.

* There are very few customers opted for international plans among the customers who haven't churn.  

In [None]:
# Catplot of Voice mail plan vs churn

fig, ax = plt.subplots()
sns.catplot("Churn", hue="Voice mail plan", data=df, kind="count", palette={'No':"green", 'Yes':"blue"}, ax=ax)
plt.close(1) # to remove the extra figure
plt.xlabel("Churn",size = 15)
plt.ylabel("Number of users",size = 15)
plt.title("Voice mail plan vs churn",size = 15, color = 'royalblue')
plt.show()

Customers who have churned among those most of them have not opted for voice mail plan.

In [None]:
# Catplot of Customer service calls vs churn

fig, ax = plt.subplots()
sns.catplot("Churn", hue="Customer service calls", data=df, kind="count", ax=ax)
plt.close(1) # catplot creates an extra figure we don't need
plt.xlabel("Churn",size = 15)
plt.ylabel("Number of users",size = 15)
plt.title("Customer service calls vs churn",size = 15, color = 'royalblue')
plt.show()

Customers who have churned among those most of them have called customer service once.

# Correlation matrix

In [None]:
sns.set(rc={'figure.figsize':(20,6)})
sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)
plt.show()

Here total charge and total minutes for day,evening, night and international calls are highly correlated. So for further analysis we can remove either of these variables.

In [None]:
df1= df[[i for i in df.columns if i not in ['Total day minutes','Total eve minutes','Total night minutes','Total intl minutes']]]

In [None]:
sns.heatmap(df1.corr(), cmap="YlGnBu", annot=True)
plt.show()

From this matrix it can be concluded that Total day charge and Customer service calls are affecting the churn more as compared to other variables.

# Outlier detection

In [None]:
sns.set(rc={'figure.figsize':(8,4)})
plt.boxplot((df['Account length'], df['Number vmail messages'], df['Customer service calls']), labels = ['Account length','Number vmail messages',
                                                                                                         'Customer service calls'])
plt.show()

In [None]:
plt.boxplot((df['Total day minutes'], df['Total eve minutes'], df['Total night minutes'], df['Total intl minutes']),
            labels = ['Total day minutes','Total eve minutes','Total night minutes','Total intl minutes'])
plt.show()

In [None]:
plt.boxplot((df['Total day calls'], df['Total eve calls'], df['Total night calls'], df['Total intl calls']),
            labels = ['Total day calls', 'Total eve calls', 'Total night calls','Total intl calls'])
plt.show()

In [None]:
plt.boxplot((df['Total day charge'], df['Total eve charge'], df['Total night charge'], df['Total intl charge']),
            labels = ['Total day charge','Total eve charge','Total night charge','Total intl charge'])
plt.show()

There are some outliers present in both sides of maximum variables. We can remove or replace these values with suitable statistical metrix for further analysis to get better results.

It can be concluded from the plot that the charges in days are higher which is shown previously using distplot.

# Conclusion from EDA

* 14% of customers are churning.

* There are some states like New Jersey, Texas etc. where the churn is more.
* The company need to grab more customers from states like California, Louisiana etc. 
 
* Most of the customers have account for duration of 3-5 months. There are not a single customer having account length of more than a year.

* The charges in night are less as compared to other times of the day.

* There are very few customers(around 8%) who have called customer service more than 3 times.

* Churn is high when total international calls are less. But there are very few customers who have opted for this plan.

* As higher day charge leads to more churn, the company need to reduce day charges in order to reduce customer churn.

* Customers who have churned among those most of them have called customer service once.

* Total day charge and customer service calls are correlated with churn with correlation coefficient of 0.21.

* There are some outliers present in both sides of  maximum variables.