# Exploratory Analysis of Telco's Customer Churn (Step-by-Step)

### Introduction

This is my first *Python* practice with Jupyter Notebook and for this reason I wanted to explain step by step the exploratory analysis that I made so those that are beginners like me can have a complete and easy example of how apply python to a real data set. Also, I would really appreciate recommendations or feedback of Kaggle's users that would help improve my python skills.

#### Context

Exploratory analysis of Telco's customer database, with information about the attributes of its customers, services hired, monthly and total charges among others.

#### Aim of the Analysis

The aim of this analysis is to explore the data set and try to find out if there is a strong relation (correlation) between the different customer attributes and the target variable, "Churn".

#### Customer Churn

OK, we are going to analyze Telco's Customer Churn, but *What is Customer Churn???* *Customer Churn* is the percentage of customers that stopped using a company's product or service during a certain time frame. It is possible to calculate churn rate by dividing the number of lost customers during that time period by the total number of customers at the beginning of that time period. Obviously, companies should aim for a churn rate that is as close to 0% as possible.

For example, if you start your quarter with 400 customers and end with 380, your churn rate is 5% because you lost 5% of your customers.

### Step 1 - Import libraries

First of all, we have to import the different libraries (with all their functions) that we are going to use in this analysis:

```python
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
```

- **pandas:** help us to organize data and manipulate the data by putting it in a tabular form.
- **numpy:** is mainly used for working with arrays.
- **math:** give us the possibility of calculate advanced mathematical operations.
- **matplotlib.pyplot:** is a plotting library used for 2D graphics in python programming language.
- **seaborn:** is a data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. 
- **LabelEncoder:** help us to encode target labels with value between 0 and n_classes-1, in others words will convert categorical fields into numerical.

In [None]:
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

### Step 2 - Read document

Once we have imported the libraries, we are going to use the *read_csv* function of pandas to read the [Telco's Customer Churn data](https://www.kaggle.com/blastchar/telco-customer-churn) and create our Panda's DataFrame.

```python
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")
```

This will help us to transform the csv data to a 2-D labeled data structure with columns of potentially different type. In other words, *Pandas Dataframe is an in-memory representation of an excel sheet via Python programming language*.

In [None]:
df = pd.read_csv("../input/telco-customer-churn/WA_Fn-UseC_-Telco-Customer-Churn.csv")

### Step 3 - Data Validation

Just to check if the data on the dataframe was uploaded correctly and if it is consistent with the data source (csv), we can use some simple Pandas commands to get an overview of the total amount of rows, columns and values of it, the differents fields (columns) and also visualize some random rows for an overview.

```python
df.shape 
df.size
df.head(5)
df.sample(5)
```
- **df.shape:** returns the number of rows and columns of the Dataframe.
- **df.size:** returns the total number of elements of the Dataframe (row_count * column_count).
- **df.columns:** returns the column labels of the Dataframe.
- **df.head(x):** allows to visualize the first 'x' rows of the Dataframe.
- **df.sample(x):** returns a random x-sample of the Dataframe.

In [None]:
df.shape

In [None]:
df.size

In [None]:
df.columns

In [None]:
df.head(3)

In [None]:
df.sample(5)

### Step 4 - Data exploration

Once we have verified the data was uploaded OK to the Dataframe, **it's time to explore this data set!** But wait ... *How do I start? What should I keep in mind?*

Sometimes it's hard to know how to start or take the first step on the right direction in order to avoid wasting time on useless tasks. Actually, this is happening to me right know! 

Expert's recommendation is to always keep in mind the aim of the analysis as a whole. Every step should be oriented to answer or explain the problem raised.

OK! I get it ... So we can start by understanding the data presented on the data set, what customer attributes we have available and their relationship with the variable "churn".

```python
df.info()
df.describe()
```

- **df.info():** give us a concise summary of a DataFrame with the columns count, non-null values and it's dtypes.
- **df.describe():** generates descriptive statistics that include those that summarize the central tendency, dispersion and shape of a dataset’s distribution. Is worth to mention that this function analyzes numeric and object series, but some metrics make sense only for numeric fields.

In [None]:
df.info()

In [None]:
df.describe(include='all') # Description of all fields without differentiating their dtypes.

In [None]:
df.describe() #Without the 'include' condition, this function will return only the numeric fields.

Based on the information above, we can confirm that there are no *NULL* values, each of the 21 columns have 7043 rows. Also, we can see that the data type of the differens fields are 18 *object* and 3 *numerics* (2 integer and 1 float).

If we focus only on the numeric fields, from the descriptive statics we can conlude that at least 75% of Telco's customer are not considered as Senior Citizens. *Let's check this and visualize it on a chart using pyplot!*

In [None]:
SeniorCitizens = df.SeniorCitizen[df.SeniorCitizen == 1].count()
print(f'The amount of Senior Citizens customers is {SeniorCitizens}')
SeniorCitizens_Ratio = round(SeniorCitizens / df.SeniorCitizen.count(),2)
print(f'The ratio of Senior Citizens is {SeniorCitizens_Ratio}' )

In [None]:
SeniorCitizens = df.SeniorCitizen[df.SeniorCitizen == 1].count()
NoSeniorCitizens = df.SeniorCitizen[df.SeniorCitizen == 0].count()

x_variables = ['Senior Citiziens', 'No Senior Citizens']
y_variables = [SeniorCitizens, NoSeniorCitizens]
barChar = plt.bar(x_variables,y_variables, color=['Lightblue', 'Lightyellow'])

#Set descriptions:
plt.ylabel('Amount of Customers')
plt.title('Senior Citizens Customers')
#Display values:
plt.text(-0.06, 200, SeniorCitizens, fontsize=10, color= 'White')
plt.text(0.93, 200, NoSeniorCitizens, fontsize=10, color= 'Black')

In case we want to analyze a categorical variable like "Gender" we can make the same example to discover the amount of male and female customers and visualize them on a bar chart, now using *seaborn*.

In [None]:
df_gender = df[['customerID','gender']]
df_gender = df_gender.rename(columns={'customerID':'Amount of Customers'})
gender_count = df_gender.groupby('gender').count()
gender_count['Percentage']=round((gender_count['Amount of Customers']/ df_gender['gender'].count()*100),2)
gender_count

In [None]:
sns.set_palette(['pink', 'lightblue'])
sns.set_context("talk", font_scale=0.8)
plt.figure(figsize=(7,7))
gender_chart = sns.catplot(x="gender",
              hue ="gender",
                 data = df,
             kind="count",
           height=4,
           aspect=1.5).set(title = "Customer Gender")


Also is posible to combine variables to have more details about the customers. For example, below there is a clasification of customer by Gender and Partner.

In [None]:
GenderPartner = df[['customerID','gender','Partner']]
GenderPartner = GenderPartner.rename(columns={'customerID':'Amount of Customers'})
GenderPartner_count = GenderPartner.groupby([GenderPartner.gender, GenderPartner.Partner])[['Amount of Customers']].count()
GenderPartner_count['Percentage']=round((GenderPartner_count['Amount of Customers']/ GenderPartner['gender'].count()*100),2)
GenderPartner_count

### Step 5 - Churn

Considering that this variable is the most important, we should explore it and have some insights:

In [None]:
ChurnCustomers = df.Churn[df.Churn == "Yes"].count()
print(f'The amount of customers that left the company is {ChurnCustomers}')
ChurnCustomers_Ratio = round(ChurnCustomers / df.Churn.count(),2)
print(f'The churn ratio is {ChurnCustomers_Ratio}' )

In [None]:
x_variables = ['Churn', 'No Churn']
y_variables = [df.Churn[df.Churn == "Yes"].count(), df.Churn[df.Churn == "No"].count()]
barChar = plt.bar(x_variables,y_variables, color=['violet','orange'])

#Set descriptions:
plt.ylabel('Amount of Customers')
plt.title('Churn Customers')
#Display values:
plt.text(-0.06, 200,df.Churn[df.Churn == "Yes"].count(), fontsize=10, color= 'White')
plt.text(0.93, 200, df.Churn[df.Churn == "No"].count(), fontsize=10, color= 'White')

As we can see from the data above, the Churn customers represented the 27% of the whole data set. Number that it seems to be pretty significant.
Let's find out the total monthly charge that represents that 27%, I guess the CEO will like to know how much money did the lose on this month after the churn of this 1.869 customers.

In [None]:
TotalMonthlyCharges = df[['Churn','MonthlyCharges']]
Total = TotalMonthlyCharges.groupby('Churn').sum()
Total['Percentage']=round((Total['MonthlyCharges']/ TotalMonthlyCharges['MonthlyCharges'].sum()*100),2)
Total

In [None]:
x_variables2 = ['Churn', 'No Churn']
y_variables2 = [TotalMonthlyCharges[TotalMonthlyCharges.Churn== 'Yes']['MonthlyCharges'].sum(), TotalMonthlyCharges[TotalMonthlyCharges.Churn== 'No']['MonthlyCharges'].sum()]
barChar2 = plt.bar(x_variables2,y_variables2, color=['lightgreen','lightblue'])

#Set descriptions:
plt.ylabel('Monthly Charges')
plt.title('Monthly Charges Churn')
#Display values:
plt.text(-0.15, 10000,TotalMonthlyCharges[TotalMonthlyCharges.Churn== 'Yes']['MonthlyCharges'].sum(), fontsize=10, color= 'Grey')
plt.text(0.85, 10000, TotalMonthlyCharges[TotalMonthlyCharges.Churn== 'No']['MonthlyCharges'].sum(), fontsize=10, color= 'Grey')

As we can see, the percentage of the churned monthly charges is 30.5%, ratio that is above the one of churned customers (27%). So we can say that the customers that left the company have a monthly charge above the mean.

As we have already calculated the relation between gender and partner, we can add to it the variable "Churn" to visualize if those variables have influence on the decision of leave the company.

In [None]:
GenderPartner_Churn = df[['customerID','gender','Partner', 'Churn']]
GenderPartner_Churn = GenderPartner_Churn.rename(columns={'customerID':'Amount of Customers'})
GenderPartner_Churn_count = GenderPartner_Churn.groupby([GenderPartner_Churn.gender, GenderPartner_Churn.Partner, GenderPartner_Churn.Churn])[['Amount of Customers']].count()
GenderPartner_Churn_count['Percentage']=round((GenderPartner_Churn_count['Amount of Customers']/ GenderPartner['Amount of Customers'].count()*100),2)
GenderPartner_Churn_count

From the data above, we can only say that the customers that have a partner are less likley to leave the company. We can confirm this calculating the ratios of churned costumers by the variable "Partner":

In [None]:
Partner_Churn1 = df[df.Partner == 'Yes'][['customerID','Partner', 'Churn']]
Partner_Churn1 = Partner_Churn1.rename(columns={'customerID':'Amount of Customers'})
Partner_Churn_count1 = Partner_Churn1.groupby([Partner_Churn1.Partner, Partner_Churn1.Churn])[['Amount of Customers']].count()
Partner_Churn_count1['Percentage']=round((Partner_Churn_count1['Amount of Customers']/ Partner_Churn1['Amount of Customers'].count()*100),2)
Partner_Churn_count1

In [None]:
Partner_Churn2 = df[df.Partner == 'No'][['customerID','Partner', 'Churn']]
Partner_Churn2 = Partner_Churn2.rename(columns={'customerID':'Amount of Customers'})
Partner_Churn_count2 = Partner_Churn2.groupby([Partner_Churn2.Partner, Partner_Churn2.Churn])[['Amount of Customers']].count()
Partner_Churn_count2['Percentage']=round((Partner_Churn_count2['Amount of Customers']/ Partner_Churn2['Amount of Customers'].count()*100),2)
Partner_Churn_count2

Since the aim of the analysis is to find out the correlation between the differents attributes in this dataset and the target variable *Churn*, first it will be necessary to convert all the non-numerical fields into numerical. So then, we can calculate their correlations.

As we explain at the beginning of this notebook, we wiil use the sklearn.preprocessing.LabelEncoder library that give us the functionality of transform the data type into what we need to calculate the correlations.

In [None]:
no_numerical = (df.dtypes == 'object')
no_numerical_list = list(no_numerical[no_numerical].index)

encdata = df.copy()
enc = LabelEncoder()
columns = df.columns

for col in no_numerical_list:
    encdata[col] = enc.fit_transform(encdata[col])

encdata = pd.DataFrame(encdata, columns=columns)

Once we have made this conversion, we will use Seaborn to display a Heatmap with the correlations between each of the variables. This gives us more information that we need, so we limit this map to the correlations that more matters to us (the ones realted to "Churn").

In [None]:
plt.figure(figsize=(9,9))
sns.heatmap(encdata.corr(), vmin=-1, vmax=1,cmap=sns.diverging_palette(20, 220, n=200))

In [None]:
encadata2 = encdata.corr()
encadata2 = encadata2[['Churn']]
plt.figure(figsize=(3,9))
sns.heatmap(encadata2, annot = True, vmin=-1, vmax=1,cmap=sns.diverging_palette(20, 220, n=200))

### Conclusions

After exploring and analyzing the data set, we can finally conclude that there are no significant attributes that are strongly correlated to the "Churn". Despite that, there some attributes that can influence in a "small" way customers to leave the company. One of them is the "tenure" that has a negative correlation with the "churn" (-0.35), in other words we can say that the customers that decided to leave are more likley to don't spend long time as a customer. Also, as we have verified before, there is a negative correlation between with the people that have Partner (-0.15), those that have a partner are more likley to stay on the company. Taking account that the numerical values for "Contract" are 0 = Month-to-Month, 1 = One Year and 2 = Two Year, we can say that the negative correlation of -0.4 means that the customers with a two-year-contract are less likley to churn.