<a href="https://colab.research.google.com/github/Aakash0505/CAPSTONE_1/blob/main/Telecom_Churn_Analysis_Capstone_Project_AAKASH_AGRAWAL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## <b> Orange S.A., formerly France Télécom S.A., is a French multinational telecommunications corporation. The Orange Telecom's Churn Dataset, consists of cleaned customer activity data (features), along with a churn label specifying whether a customer canceled the subscription.

## <b> Explore and analyze the data to discover key factors responsible for customer churn and come up with ways/recommendations to ensure customer retention. </b>

#Exploratory Data Analysis of churn dataset
Exploratory Data Analysis (EDA) is an approach to analyse data. The first and foremost task that the data analysts does is to view the data and tries to make some sense out of it. Later we figure out what questions we want to ask and how to use the available data to get the answers we need.

EDA helps us to: 1) Delve into the data set 2)Examine the relationships among the variables 3)Identify any interesting observation 4) Develop an initial idea of possible associations among the predictors and the target variable.

The telecom market in the US is saturated and customer growth rates are low. They key focus of market players therefore is on retention and churn control. This project explores the churn dataset to identify the key drivers of churn and grab key insights from the dataset.

# **`Importing All the neccessary libraries which is useful in our project.`**

In [None]:
#importing the required libraries
import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.ticker as mtick  
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# `Reading the dataset using pandas`

In [None]:
#reading a csv file and creating pandas dataframe
data=pd.read_csv("/content/drive/MyDrive/Telecom Churn .csv")

Checking the dataset which we have loaded.
 

In [None]:
#head provides the first 5 rows of dataset
data.head()

In [None]:
# tail provides the last 5 rows of the dataset
data.tail()

# **General Overview of the dataset**

In [None]:
#row and column
data.shape

 we have 3333 rows with 20 columns in our dataset 

In [None]:
# information about the data
data.info()

# Checking the null values

In [None]:
# checking the sum of null values
data.isnull().sum()

In [None]:
#visualizing the null checking process
missing = pd.DataFrame((data.isnull().sum())*100/data.shape[0]).reset_index()
plt.figure(figsize=(16,5))
ax = sns.pointplot('index',0,data=missing)
plt.xticks(rotation =90,fontsize =7)
plt.title("Percentage of Missing values")
plt.ylabel("PERCENTAGE")
plt.show()

since there are no null values in any of the columns of our data. 
we don't have to do any data cleaning. 

If there exists a null/missing value in the dataset then we can replace the missing value with mean of the attribute using the following command:

In [None]:
"""data_null # Dataframe
data_null.mean() # mean of the dataframe
data_null.fillna(data_null.mean()) # filling the null values with mean values with the mean of the dataframe"""

# **`Statistical summary of our dataset`**

In [None]:
#Descriptive Analysis of thr data
data.describe().transpose()

# some important points after descriptive analysis



*   mean account_lenght of data is 101 and 25% having more than 127
*   mean of customer service calls is very less i.e. 1
*   on comparing the day,even and night call we can say that the charges at the day are high and eve are medium and night call charges are very low
* international call charges are very high as compared to day,eve and night.







In [None]:
# check the datatypes of columns
data.dtypes

# **`Plotting the graph to represent the no. of customer churned`**

In [None]:
# histogram representing the no. of people churned
data['Churn'].value_counts().plot(kind='barh', figsize=(8, 6))
plt.xlabel("Count", labelpad=14)
plt.ylabel("Target Variable", labelpad=14)
plt.title("Count of TARGET Variable per category", y=1.02)

let's see the percentage of our target variable


In [None]:
# in form of percentage
100*data['Churn'].value_counts()/len(data['Churn'])

14.49% of costumer churned

Let’s see the value count of our target variable

In [None]:
data["Churn"].value_counts()

In [None]:
# Descriptive Analysis of object and boolean type data
data.describe(include=['object', 'bool'])

Descriptitve analaysis of non numeric values shows us the unique value, top value and it's frequency.

# **`checking the correlation between the variables`**


In [None]:
# Correlation Plot 
corr=data.corr()
corr.style.background_gradient().set_precision(1)

It's quite evident from the graph that there exists a perfect linear relationship between between Mins and its corresponding Charge.

# **`Now we are moving on to Visualization part,where we will compare the different attributes and understand the problem of churning`**

In [None]:
# Box Plot for Account Length attribute 
data.boxplot(column='Account length',by='Churn')

The mean of Account length is almost similar to for both churn groups.

In [None]:
#create dataframe for account length 
def account_cat(a) :
  ac_cat = ''
  if a < 30 :
    ac_cat = 'new'
  elif a < 90 :
    ac_cat = 'below 3 months'
  else :
    ac_cat = 'regular user'
  
  return ac_cat 

In [None]:
data['account category'] = data['Account length'].apply(lambda x : account_cat(x))
data.head()

In [None]:
sns.countplot(x="account category",data=data)

In [None]:
sns.countplot(x="account category", hue='Churn', data=data)

In [None]:
# Box Plot for CustServ Calls attribute 
data.boxplot(column='Customer service calls',by='Churn')

There is a considerable amount of difference in mean of Customer service calls for both the churn groups.

In [None]:
# Box Plot for Total day minutes attribute 
data.boxplot(column='Total day minutes',by='Churn')

The Day Min mean is different for both the churn group.

In [None]:
# Box Plot for Total eve minutes attribute 
data.boxplot(column='Total eve minutes',by='Churn')

There is a slight amount of difference in mean of Evening mins for both the churn groups. The Box plot also show us the outliers and range of each variable.

In [None]:
# Box Plot for Total night minutes  attribute 
data.boxplot(column='Total night minutes',by='Churn')

There is a slight amount of difference in mean of night mins for both the churn groups. The Box plot also show us the outliers and range of each variable.

## **`Observing relationship between vmail number and churn.`**

In [None]:
# Box Plot for Total night minutes  attribute 
data.boxplot(column='Number vmail messages',by='Churn')

very few churned customer were using vmail message service

In [None]:
sns.countplot(x="Voice mail plan", hue='Churn', data=data)

# `Graphical representation of each variable`

In [None]:
# Histogram for each variable
data.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8)

The charge and min variable have similar distribution as they are perfectly correlated.

In [None]:
# International Plan 
pd.crosstab(data['Churn'], data["International plan"], margins=True)

Let’s see how churn rate is related to the International plan variable. We see that, with International Plan, the churn rate is much higher. Clearly, those who have selected the International Plan have a greater chance of leaving the company’s service than do those who do not have the International Plan. Perhaps large and poorly controlled expenses with international calls are very conflict-prone and lead to dissatisfaction among the telecom operator’s customers.

In [None]:
# Histogram of International Plan grouped by churn
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 6)
sns.countplot(x="International plan", hue='Churn', data=data)


In [None]:
# Customer Service Calls 
pd.crosstab(data['Churn'], data["Customer service calls"], margins=True)

In [None]:
# Histogram of State grouped by churn
plt.rcParams['figure.figsize'] = (15,8)
sns.countplot(x="State", hue='Churn', data=data);

The graph clearly says that State is not the reason for customer churn

In [None]:
# Histogram of Area Code grouped by churn
plt.rcParams['figure.figsize'] = (8, 6)
sns.countplot(x="Area code", hue='Churn', data=data);

Same in the case of Area code.

In [None]:
# Histogram of Customer Serice calls grouped by churn  
plt.rcParams['figure.figsize'] = (8, 6)
sns.countplot(x="Customer service calls", hue='Churn', data=data);

The picture clearly states that the churn rate strongly increases starting from 4 calls to the service center. Customers who have called customer service three or fewer times have a markedly lower churn rate than that of customers who have called customer service four or more times.

In [None]:
plt.rcParams['figure.figsize'] = (8, 6)
sns.countplot(x="Total day minutes", hue='Churn', data=data);

In [None]:
plt.rcParams['figure.figsize'] = (8, 6)
sns.countplot(x="Total eve minutes", hue='Churn', data=data);

The customers with high day mins have higher churn rate. Same goes for the customers with high Evening mins.

*** Let us consider some of the insights we have gained into the churn data set through the use of exploratory data analysis.***


*   There is no significant relation between account length and churn.
*   The total percentage of customers churned is 14.49 %.
*   The area code field and/or the state field are anomalous, and can be omitted.
*   Customers with the International Plan tend to churn more frequently.
*   Customers with four or more customer service calls churn more than four times as often as do the other customers.
* Customers with high day minutes and evening minutes tend to churn at a higher rate than do the other customers.
There is no obvious association of churn with the variables day calls, evening calls, night calls, international calls, night minutes, international minutes, account length, or voice mail messages.

