# **Project Name**    -  **Telecom Churn Analysis**



##### **Project Type**    - EDA
##### **Contribution**    - Individual


#**Project Summary -**



Customer churn is a big problem in any industry and one of the most important concerns for the Telecom industry. The effect on the revenues of the companies, because of this customer churns is huge, especially in the telecom field, that's why these companies are seeking to develop a predictive potential customer churn. In this highly competitive market, the telecommunications industry experiences an average of 15-25% annual churn rate, and it costs 5-10 times more to acquire a new customer than to retain an existing one, that's why customer retention has now become even more important than customer acquisition.**

Therefore, finding those factors that increase customer churn is important to take necessary actions to reduce this churn. The main goal of our project is to develop an understanding of the cause of customer churn which assists telecom operators to predict customers who are most likely subject to churn, and what to do to retain the most valuable customer.

# **GitHub Link -**

Provide your GitHub Link here.

#### **Define Your Business Objective?**

**Maximize:** Company's profit by retaining customer

 **Minimize:** Customer churn by identifying the key cause of the problem

Finding factors and cause those influence customers to churn.

Retain churn customers by taking appropriate steps

Providing offers based on affecting factors.

# ***Let's Begin !***

#**Loading data and Importing modules**

In [None]:
# Import libraries
import pandas as pd
import numpy as np
#import visualization libraries
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


In [None]:
# Mount the drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Import data files
df=pd.read_csv('/content/Telecom Churn.csv')

## ***1. Know Your Data***

### Dataset First View

In [None]:
# Dataset First Look
df.head()

In [None]:
df.tail()

In [None]:
df.shape

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print(f'Shape of Telecom Churn Data is {df.shape}')

**Breakdown of Our Features:**

**STATE**: 51 Unique States name

**Account Length**: Length of The Account

**Area Code**: Code Number of Area having some States

**International Plan**: Yes Indicate International Plan is Present and No Indicates no subscription for Internatinal Plan

**Voice Mail Plan**: Yes Indicates Voice Mail Plan is Present and No Indicates no subscription for Voice Mail Plan

**Number vmail messages**: Number of Voice Mail Messages ranging from 0 to 50

**Total day minutes**: Total Number of Minutes Spent in Morning

**Total day calls**: Total Number of Calls made in Morning.

**Total day charge**: Total Charge to the Customers in Morning.

**Total eve minutes**: Total Number of Minutes Spent in Evening

**Total eve calls**: Total Number of Calls made r in Evening.

**Total eve charge**: Total Charge to the Customers in Morning.

**Total night minutes**: Total Number of Minutes Spent in the Night.

**Total night calls**: Total Number of Calls made in Night.

**Total night charge**: Total Charge to the Customers in Night.

Customer service calls Number of customer service calls made by customer

Churn Customer Churn, True means churned customer, False means retained customer

### Dataset Information

In [None]:
# Dataset Info
df.info()

In [None]:
df.nunique()

In [None]:
df.describe(include = 'all')

In [None]:
#Printing the count of true and false in 'churn' feature
print(df.Churn.value_counts())

## ***2. Understanding Your Variables***

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
df.isna().sum()

#as we see there is no missing values present any column.

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
df.isnull().sum()

In [None]:
# Visualizing the missing values
missing = pd.DataFrame((df.isnull().sum())*100/df.shape[0]).reset_index()
plt.figure(figsize=(16,5))
ax = sns.pointplot(x='index',y=0,data=missing)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()


In [None]:
# Checking Duplicate values
len(df[df.duplicated()])

### What did you know about your dataset?

*As of now There are 3333 rows and 20 columns in above dataset.*

*out of which there are 1 boolean data type i.e churn*

*8 float data type*,

*8integer data type,*

*3 object data type i.e catagarical value are there.*

*There are no missing value present so no need to do the missing value imputation,*

*And also there are no duplicate value present.*

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
# Write your code to make your dataset analysis ready.
df['Churn'].unique()

In [None]:
# Printing the count of true and false in 'churn' feature
print(df.Churn.value_counts())

In [None]:
# Pie chart to analyze churn
df['Churn'].value_counts().plot.pie(explode=[0.05,0.5], autopct='%1.1f%%',  startangle=90,shadow=True, figsize=(8,8))
plt.title('Pie Chart for Churn')
plt.show()

In [None]:
#Donut plot to analyze churn
data = df['Churn'].value_counts()
explode = (0,0.2)
plt.pie(data, explode = explode,autopct='%1.1f%%',shadow=True,radius = 2.0, labels = ['Not churned customer','Churned customer'],colors=['royalblue' ,'lime'])
circle = plt.Circle((0,0),1, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.title('Donut Plot for Churn')
plt.show()

In [None]:
# let's see churn by using countplot
sns.countplot(x=df.Churn)

After analyzing the churn column, we had little to say like almost 15% of customers have churned. let's see what other features say to us and what relation we get after correlated with churn

**Analyzing State Column**

In [None]:
# printing the unique value of state column
df['State'].nunique()

In [None]:
# comparison churn with state by using countplot
sns.set(style='darkgrid')
plt.figure(figsize=(15,8))
ax = sns.countplot(x='State', hue='Churn', data=df)
plt.show()

In [None]:
s1=df['State'].unique()
s2=df.groupby(['State'])['Churn'].mean()
plt.rcParams['figure.figsize']=(18,7)
plt.plot(s1,s2,color='r', marker='o', linewidth=2, markersize=12)
plt.title('States churn rate', fontsize=20)
plt.xlabel('state', fontsize=15)
plt.ylabel('churn rate', fontsize = 15)
plt.show()

In [None]:
plt.rcParams['figure.figsize']=(12,7)
color = plt.cm.copper(np.linspace(0,0.5,20))
((df.groupby(['State'])['Churn'].mean())*100).sort_values(ascending = False).head(6).plot.bar(color = ['violet','indigo','b','g','y','orange','r'])
plt.title('State with most churn percentage', fontsize =20)
plt.xlabel('state', fontsize = 15)
plt.ylabel('percentage', fontsize = 15)
plt.show()

In [None]:
# calculate State VS Churn percentage
State_data = pd.crosstab(df['State'],df['Churn'])
State_data['Percentage_Churn'] = State_data.apply(lambda x : x[1]*100/(x[0]+x[1]),axis = 1)
print(State_data)

In [None]:
#show the most churn state of top 10 by ascending the above list
df.groupby(['State'])['Churn'].mean().sort_values(ascending = False).head(10)

There is 51 unique state present who have different churn rate.

From the above analysis CA, NJ, TX, MD, SC, MI are the ones who have a higher churn rate of more than 21.

The reason for this churn rate from a particular state may be due to the low coverage of the cellular network.

**Analyzing "Area Code" column**

In [None]:
#claculate Area code vs Churn percentage
Area_code_data = pd.crosstab(df["Area code"],df["Churn"])
Area_code_data['Percentage_Churn'] = Area_code_data.apply(lambda x : x[1]*100/(x[0]+x[1]),axis = 1)
print(Area_code_data)

In [None]:
sns.set(style="darkgrid")
ax = sns.countplot(x='Area code', hue="Churn", data=df)
plt.show()

In the above data, we notice that there is only 3 unique value are there i.e408,415,510 and the churn rate of these area codes are almost same.

we don't think there is any kind of relation present between the "area code" and"churn" due to which the customer leaves the operator.

**Analyzing "Account Length" column**

In [None]:
#Separating churn and non churn customers
churn_df     = df[df["Churn"] == bool(True)]
not_churn_df = df[df["Churn"] == bool(False)]

In [None]:
#Account length vs Churn
sns.distplot(df['Account length'])

In [None]:
#comparison of churned account length and not churned account length
sns.distplot(df['Account length'],color = 'yellow',label="All")
sns.distplot(churn_df['Account length'],color = "red",hist=False,label="Churned")
sns.distplot(not_churn_df['Account length'],color = 'green',hist= False,label="Not churned")
plt.legend()

After analyzing various aspects of the "account length" column we didn't found any useful relation to churn. so we aren't able to build any connection to the churn as of now. let's see what other features say about the churn.

**Analyzing "International Plan" column**

In [None]:
#Show count value of 'yes','no'
df['International plan'].value_counts()

In [None]:
#Show the unique data of "International plan"
df["International plan"].unique()

In [None]:
#Calculate the International Plan vs Churn percentage
International_plan_data = pd.crosstab(df["International plan"],df["Churn"])
International_plan_data['Percentage Churn'] = International_plan_data.apply(lambda x : x[1]*100/(x[0]+x[1]),axis = 1)
print(International_plan_data)

In [None]:
#To get the Donut Plot to analyze International Plan
data = df['International plan'].value_counts()
explode = (0, 0.2)
plt.pie(data, explode = explode,autopct='%1.1f%%',shadow=True,radius = 2.0, labels = ['No','Yes'],colors=['skyblue' ,'orange'])
circle = plt.Circle( (0,0), 1, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.title('Donut Plot for International plan')
plt.show()

In [None]:
#Analysing by using countplot
sns.countplot(x='International plan',hue="Churn",data = df)

There are 3010 customers who dont have a international plan.

There are 323 customers who have a international plan.

Among those who have a international plan 42.4 % people churn.

Whereas among those who dont have a international plan only 11.4 % people churn.

So basically the people who bought International plans are churning in big numbers.

Probably because of connectivity issues or high call charge.

**Analyzing "Voice Mail Plan" column**

In [None]:
#show the unique value of the "Voice mail plan" column
df["Voice mail plan"].unique()

In [None]:
#Calculate the Voice Mail Plan vs Churn percentage
Voice_mail_plan_data = pd.crosstab(df["Voice mail plan"],df["Churn"])
Voice_mail_plan_data['Percentage Churn'] = Voice_mail_plan_data.apply(lambda x : x[1]*100/(x[0]+x[1]),axis = 1)
print(Voice_mail_plan_data)

In [None]:
#To get the Donut Plot to analyze Voice mail plan
data = df['Voice mail plan'].value_counts()
explode = (0, 0.2)
plt.pie(data, explode = explode,autopct='%1.1f%%',startangle=90,shadow=True,radius = 2.0, labels = ['NO','YES'],colors=['skyblue','red'])
circle = plt.Circle( (0,0), 1, color='white')
p=plt.gcf()
p.gca().add_artist(circle)
plt.title('Donut Plot for Voice mail plan')
plt.show()

In [None]:
#Analysing by using countplot
sns.countplot(x='Voice mail plan',hue="Churn",data = df)

As we can see there is are no clear relation between voice mail plan and churn so we can't clearly say anything so let's move to the next voice mail feature i.e number of voice mail, let's see what it gives to us.

**Analyzing "Number vmail messages" column**

In [None]:
#show the data of 'Number vmail messages'
df['Number vmail messages'].unique()

In [None]:
#Printing the data of 'Number vmail messages'
df['Number vmail messages'].value_counts()

In [None]:
#Show the details of 'Number vmail messages' data
df['Number vmail messages'].describe()

In [None]:
#Analysing by using displot diagram
sns.distplot(df['Number vmail messages'])

In [None]:
#Analysing by using boxplot diagram between 'number vmail messages' and 'churn'
fig = plt.figure(figsize =(10, 8))
df.boxplot(column='Number vmail messages', by='Churn')
fig.suptitle('Number vmail message', fontsize=14, fontweight='bold')
plt.show()

After analyzing the above voice mail feature data we get an insight that when there are more than 20 voice-mail messages then there is a churn

For that, we need to improve the voice mail quality.

**Analyzing "Customer service calls" column**

In [None]:
#Printing the data of customer service calls
df['Customer service calls'].value_counts()

In [None]:
#Calculating the Customer service calls vs Churn percentage
Customer_service_calls_data = pd.crosstab(df['Customer service calls'],df["Churn"])
Customer_service_calls_data['Percentage_Churn'] = Customer_service_calls_data.apply(lambda x : x[1]*100/(x[0]+x[1]),axis = 1)
print(Customer_service_calls_data)

In [None]:
#Analysing using countplot
sns.countplot(x='Customer service calls',hue="Churn",data = df)

It is observed from the above analysis that, mostly because of bad customer service, people tend to leave the operator.

The above data indicating that those customers who called the service center 5 times or above those customer churn percentage is higher than 60%,

And customers who have called once also have a high churn rate indicating their issue was not solved in the first attempt.

So operator should work to improve the service call.

**Analyzing all calls minutes,all calls, all calls charge together**



As these data sets are numerical data type, so for analysing with the 'churn' which is a catagorical data set, We are using mean, median, and box plots.

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total day calls'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total day minutes'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total day charge'].mean())

In [None]:
#show the relation using scatter plot
sns.scatterplot(x="Total day minutes", y="Total day charge", hue="Churn", data=df,palette='hls')

In [None]:
#show the relation using box plot plot
sns.boxplot(x='Total day minutes', y='Total eve charge', hue='Churn', data= df, palette='hls')

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(['Churn'])['Total eve calls'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total eve minutes'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(['Churn'])['Total eve charge'].mean())

In [None]:
#show the relation using scatter plot
sns.scatterplot(x="Total eve minutes", y="Total eve charge", hue="Churn", data=df,palette='hls')

In [None]:
#show the relation using box plot plot
sns.boxplot(x="Total eve minutes", y="Total eve charge", hue="Churn", data=df,palette='hls')

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total night calls'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total night charge'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total night minutes'].mean())

In [None]:
#show the relation using scatter plot
sns.scatterplot(x="Total night minutes", y="Total night charge", hue="Churn", data=df,palette='hls')

In [None]:
#show the relation using box plot
sns.boxplot(x="Total night minutes", y="Total night charge", hue="Churn", data=df,palette='hls')

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total intl minutes'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total intl minutes'].mean())

In [None]:
#Print the mean value of churned and not churned customer
print(df.groupby(["Churn"])['Total intl minutes'].mean())

In [None]:
#show the relation using scatter plot
sns.scatterplot(x="Total intl minutes", y="Total intl charge", hue="Churn", data=df,palette='hls')

In [None]:
#show the relation using box plot
sns.boxplot(x="Total intl minutes", y="Total intl charge", hue="Churn", data=df,palette='hls')

In [None]:
#Deriving a relation between overall call charge and overall call minutes
day_charge_perm = df['Total day charge'].mean()/df['Total day minutes'].mean()
eve_charge_perm = df['Total eve charge'].mean()/df['Total eve minutes'].mean()
night_charge_perm = df['Total night charge'].mean()/df['Total night minutes'].mean()
int_charge_perm= df['Total intl charge'].mean()/df['Total intl minutes'].mean()

In [None]:
print([day_charge_perm,eve_charge_perm,night_charge_perm,int_charge_perm])

In [None]:
sns.barplot(x=['Day','Evening','Night','International'],y=[day_charge_perm,eve_charge_perm,night_charge_perm,int_charge_perm])

After analyzing the above dataset we have noticed that total day/night/eve minutes/call/charges are not put any kind of cause for churn rate. But international call charges are high as compare to others it's an obvious thing but that may be a cause for international plan customers to churn out.

#**Graphical Analysis**

**UNIVARIATE ANALYSIS**

In Univariate Analysis we analyze data over a single column from the numerical dataset, for this we use 3 types of plot which are box plot, strip plot, dis plot.

In [None]:
#Printing boxplot for each numerical column present in the data set
df1=df.select_dtypes(exclude=['object','bool'])
for column in df1:
        plt.figure(figsize=(17,1))
        sns.boxplot(data=df1, x=column)
plt.show()

In [None]:
#Printing displot for each numerical column present in the data set
df1=df.select_dtypes(exclude=['object','bool'])
for column in df1:
        plt.figure(figsize=(17,1))
        sns.displot(data=df1, x=column)
plt.show()

In [None]:
#Printing strip plot for each numerical column present in the data set
df1=df.select_dtypes(exclude=['object','bool'])
for column in df1:
        plt.figure(figsize=(17,1))
        sns.stripplot(data=df1, x=column)
plt.show()

**BIVARIATE ANALYSIS**

In Bivariate Analysis we analyze data by taking two columns into consideration from a dataset, here we only take numerical data type column, for this visualization we use **Box plot,scatter plot**

In [None]:
# Plot a boxplot for churn column  by each numerical feature present in the data set
df2= df.describe().columns
for col in df2:
  fig=plt.figure(figsize=(17,3))
  ax=fig.gca()
  feature=df[col]
  label=df['Churn']
  correlation= feature.corr(label)
  plt.scatter(x=feature,y=label)
  plt.xlabel(col)
  plt.ylabel('Churn')
plt.show()

In [None]:
#Plot the box plot for churn vs all numerical column
for col in df2:
  fig=plt.figure(figsize=(17,10))
  ax=fig.gca()
  #feature=df[col]
  #label=df['Churn']
  df.boxplot(column = 'Churn', by = col, ax = ax)
  plt.xlabel(col)
  plt.ylabel('Churn')
plt.show()

**Multivariate Analysis**

In Multivariate Analysis we analyze data by taking more than two columns into consideration from a dataset,for this we using **correlation plot,correlation matrix, correletaion heatmap, pair plot**.

In [None]:
# visualization using correlation plot
plt.figure(figsize=(19,8))
df.corr()['Churn'].sort_values(ascending = False).plot(kind='bar',color = ['red','blue','yellow','indigo','orange','brown','pink'])

In [None]:
## plot the Correlation matrix
plt.figure(figsize=(17,8))
correlation=df.corr()
sns.heatmap(abs(correlation), annot=True, cmap='coolwarm')

In [None]:
#create a correlation heatmap
#Assigning true=1 and False=0 to churn variable
df['Churn'] = df['Churn'].replace({bool(True):1,bool(False):0})
plt.figure(figsize=(17,9))
sns.heatmap(df.corr(), cmap="Paired",annot=False)
plt.title("Correlation Heatmap", fontsize=20)

In [None]:
#plot the pair plot for all coloumn
sns.pairplot(df, height=3)

# **Conclusion**

After performing exploratory data analysis on the data set, this is what we have incurred from data:

There are some states where the churn rate is high as compared to others may be due to low network coverage.

**Area code and Account length do not play any kind of role regarding the churn rate so,it's redundant data columns**.

In the International plan those customers who have this plan are churn more and also the international calling charges are also high so the customer who has the plan unsatisfied with network issues and high call charge* **IN the voice mail section when there are more than 20 voice-mail messages then there is a churn so it basically means that the quality of voice mail is not good. *Total day call minutes, total day calls, Total day charge, Total eve minutes, Total eve calls, Total eve charge, Total night minutes, Total night calls, Total night charge, these columns didn't play any kind of role regarding the churn rate.* **In international calls data shows that the churn rate of those customers is high, those who take the international plan so it means that in international call charges are high also there is a call drop or network issue.* **In Customer service calls data shows us that whenever an unsatisfied customer called the service center the churn rate is high, which means the service center didn't resolve the customer issue.*Write the conclusion here.

### ***Hurrah! You have successfully completed your EDA Capstone Project !!!***