# [Telecommunication company case study (change title)]

## Abstract

The following report and research proposal has the aim to analyse the data sample [available on IBM Watson website] [REFENCE] on the churn problem in a given telecommunication company and bring the insights from the data analysis. This notebook will guide the reader throught important analytical processes in order to answer the business questions in regards to churn and highlights the findings. 

### The goal and scope of the analysis

Customer churn is an important issue for any business, since it directly impacts the market share of a company and its profitability. Predicting churn and understanding how to work with clients who are about to churn is very beneficial. For this purpose, we will bring the insights from the data that would help define the patterns shared by churning customers and help us understand better what might be a reason for the churn. 

Based on the data we have, we'll focus primarily on giving insights about the following questions: 
1. When do clients churn? 
2. Who are these people who are churning and how different they are from non-churners? 

In regards to the second question above we'll split the client base into various segments and analyse a few extreme profiles to find out any similarities between churning clients. Also, we're interested to characterize the customers who are happy to stay with the company for very long time and are not prone to churn. Hence, we'll analyse in particular, the following segments:
- Clients who are not churning and have the longest tenure 
- Clients who stayed long with company and eventually churned
- Clients who churned quickly 




## Data set description

We have the client data found on IBM Watson from telecommunication companies [LINK]. The dataset has 7043 rows, each belonging to a single client, and 21 columns describing demographic characteristics of the customers, services used, payment patterns and their churn status. 



## 1. Research and Analysis 

### 1.1. Data loading and preprocessing 

Before digging into the analysis, we need to load the data and see what it consists of to make sure that the formatting of the data is suitable for our data analysis. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
sb.set(style='darkgrid', context='poster', font='Monospace')
%matplotlib inline

df=pd.read_csv("../Telco_churn_data.csv")
df.head(n=10)

As we see some of the columns have values "1" / "0" instead of "Yes" / "No" or any other discriptive values. We'll unify the formatting of the columns to make them more understandable to use in the plots later. After that, we will answer reseach questions. 

In [None]:
df.SeniorCitizen.replace([1, 0], ['Senior', 'Non-senior'], inplace=True)
df.Dependents.replace(['Yes', 'No'], ['Dependents', 'No dependents'], inplace=True)
df.Partner.replace(['Yes', 'No'], ['Partner', 'No partner'], inplace=True)

## Research question 1: When do clients churn? 

To understand the distribution of time when clients used to churn, we'll create a new dataframe of clients who churned and build a histogram of their tenure. 

In [None]:
churning_clients = pd.DataFrame(df[df['Churn'] == 'Yes'])

In [None]:
plt.figure(figsize=(10, 5))

plt.hist(churning_clients['tenure'], normed=True, bins = 72, label='Month') 
plt.legend()
plt.title('Histogram: Time of churn')
plt.xlabel('Months') ;

From what we can see on the histogram, the clients tend to churn in the beginning of their journey with the company. It can be an indicator that the expectations of clients regarding the service were not met. The churn rates are the highest withing the first 10 months and then the gradually flat. Let's see how many clients churned during each of the first 5 months.

In [None]:
tenure = pd.DataFrame(churning_clients['tenure'].value_counts())
tenure.sort_index(inplace=True)
tenure.columns = ['Number of clients']
tenure.head(n=5)

The first month is when the vast majority of churning clients decide to quit the company. 

To get the full pucture of client churn over time, we'll use intervals of 12 months and build a barplot observing how clients churn year by year.

In [None]:
grouped_tenure = pd.DataFrame([tenure.iloc[0:12].sum(), 
                               tenure.iloc[13:24].sum(), 
                               tenure.iloc[25:36].sum(), 
                               tenure.iloc[37:48].sum(),
                               tenure.iloc[49:60].sum(),
                               tenure.iloc[61:72].sum()])

new_index = ['1', '2', '3', '4', '5', '6']
## [BINS = 3 OR 6] https://stackoverflow.com/questions/14451185/better-binning-in-pandas
## value_counts(normalize=True)
grouped_tenure.index = new_index 
grouped_tenure.plot.bar(figsize=(15, 7), title='Number of clients churning over years', label='Year');
grouped_tenure.head()

The churn rate over time falls dramatically: from a thousand of clients churning in their first year to less than a hundred in their 6th year. 

Let's now look into factors that could help predict if a client is more likely to churn or to stay with the company. We'll compare demographic characteristics and usage patters of churners and non-churners in general, and then analyse  specific groups of clients, taking into account their churn status.

## Research question 2: Who are the churners and how different they are from non-churners?

### Churners vs non-churners

Let's see if churners vs non-churners are different in terms of their demographics and services they use.

In [None]:
sb.factorplot(data=df, x='Churn', kind='count', col='gender', size=5, aspect=1);
sb.factorplot(data=df, x='Churn', kind='count', col='SeniorCitizen', size=5, aspect=1);
sb.factorplot(data=df, x='Churn', kind='count', col='Dependents', size=5, aspect=1);
sb.factorplot(data=df, x='Churn', kind='count', col='Partner', size=5, aspect=1);

In the demographical data we can't clearly define a factor that would lead to a significant difference for churning and non-churning clients. However, as can be see from the plot that takes seniority of clients as a factor (2), the persentage of churn among senior clients is much more than that of non-senior clients. 

Let's perform the analysis of services usage for churning and non-churning clients.

In [None]:
columns = df.columns

In [None]:
## for column in columns: 
##    sb.factorplot(data=df, x='Churn', col=column, kind='count', size=5, aspect=1)

In [None]:
sb.factorplot(data=df, x='Churn', y='MonthlyCharges', kind='violin', size=5, aspect=1);
sb.factorplot(data=df, x='Churn', col='PhoneService', kind='count', size=5, aspect=1);

### Who are the clients who are not churning and have the longest tenure in the company?
We're interested to identify who are non-churning clients who stay long with the company. We'll take the 25% of longest life clients and conduct the descriptive analysis for this data slice.

In [None]:
non_churning_clients = pd.DataFrame(df[df['Churn'] == 'No'])
plt.figure(figsize=(20, 10))

plt.subplot(1, 2, 1)
plt.boxplot(non_churning_clients['tenure'])
plt.title('Boxplot of tenure of current clients')
plt.ylabel('Months')

plt.subplot(1, 2, 2)
plt.hist(non_churning_clients['tenure'], bins=20, normed=True)
plt.title('Histogram of tenure of current clients')
plt.xlabel("Months")

plt.show() 

## ? How to delete 1 on x?

In [None]:
non_churning_clients['tenure'].describe()

In [None]:
df.groupby('Churn')['tenure'].plot.hist(alpha = 0.7, 
                                        legend=True, 
                                        bins = range(0,70,6), 
                                        figsize=(15, 7), 
                                        normed = True);

The tenure of the current costumers is spread almost evenly across the time period with two peaks in the beginning and at the end. We can infer that the company has evenly retained clients over time. The peak at the ends of the graph can be explained either by the fact that the first clients were very loyal and didn't churn or probably that many more clients joined the company at its start. 

In [None]:
long_non_churn_clients = pd.DataFrame(non_churning_clients[non_churning_clients["tenure"] > 61])

In [None]:
long_non_churn_clients.groupby(['SeniorCitizen', 'gender']).size().plot(kind='barh', 
                                                                        title="Demographics of non-churning clients", 
                                                                        figsize=(15, 7));

factorplot or value counts

horizontal instead of vertical bar!

Sort by values! 

In [None]:
sns.barplot(x=, y=['SeniorCitizen', 'gender'], data=long_non_churn_clients.groupby(['SeniorCitizen', 'gender']),
            label="Total", color="b")

## Possible further research

### Limitations 

There are limitations of the analisys that are imposed by data. Since there's no timestamp on the life of a client with the company, we cannot surely say when did the client churned. Thus, we might have clients who churned for different reasons at different period of time and due to different behavioral, econolical reasons, etc. Also, we don't have the data on the satisfaction of a client and, thus, we cannot really say if a client, who didn't churn yet, is going to stay with the company soon. 

If the data set would have included the information about competitor and any indication of what were the services that the churned client used after, the results of the analysis would have been more fruitful. In this case, it would be possible to identify the reasons for churning more in-depth. 

Unfortunately, we don't have the information on the region where the data was gathered. The insights of the analysis could be extensively used to characterise the telecom market in that location and serve as a basis for further recommendations to the telecom companies. 

### Other areas of research 

The dataset can be used to identify customer segments who are prone to churn, as well as to get the understanding of how various groups of customers use telecommuntication services. Also, this data can be used to identify the services that bring more monetary value to the company, for example clients who tend to have the highest bills. Using this data the company can identify the profile AAA clients and find the areas for improvement to serve these top clients better. 