**1.Intro**

This dataset contains informations about Telco customers (their gender, relationship status etc.). Each row is a unique customer and last column contains if client left the company within last month (churn).

**Basic info:**

* number of rows: 7043
* number of columns: 21

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
sns.set(style="white")


import plotly
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.figure_factory as ff

from plotly.offline import plot, iplot, init_notebook_mode
init_notebook_mode(connected=True)

df = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')

## Data processing

In [None]:
df.head()

In [None]:
df.tail()

Let's look for null values:

In [None]:
df.isnull().sum()

At the first glance it looks like it's not null values but it could be too good to be true so after further inspections I found out that there's a null values in TotalCharges, so I'm going to replace it and convert to float

In [None]:
df['TotalCharges'] = df['TotalCharges'].str.replace(' ', '0').astype('float32')

**Churns in last month - Pie chart**

In [None]:
fig = go.Figure(data=[go.Pie(labels=['No churn', 'Churn'],
                             values=df['Churn'].value_counts().values)])

fig.update_traces(textinfo='value', textfont_size=20,
                  marker=dict(colors=['lime', 'khaki'], 
                  line=dict(color='black', width=3)))

fig.update_layout(
    height=600, width=600, title_text='Churn and no churn - Pie chart',
    xaxis_title='number of songs', yaxis_title='artist', title_x = 0.5,
    
    font=dict(
            family="Times New Roman",
            size=18,
            color="black"),
    
    legend=dict(
            orientation="h",
            yanchor="bottom",
            y=-0.2,
            xanchor="right",
            x=0.75)
)

fig

From Pie chart above we can see that there's 26.5% of current customers are going to churn

In [None]:
df.info()

There's 3 numerical columns which may be interesting for this EDA, it's:

* tenure
* MonthlyCharges
* Total Charges

In [None]:
def kde_plot(column_name, data_frame=df):
            
    """arguments: column_name:str type, 
                  dataframe:pandas Data Frame
       returns: kde plot object"""
    
    #  in case when type is not str
    if type(column_name) != str:
        raise ValueError(f'Expected str type, got: {type(column_name)}')
    
    #  in case when type is correct but colum is not in data frame
    if column_name not in data_frame.columns:
        raise ValueError('column not in DataFrame!')
    
    churn = data_frame[data_frame['Churn'] == 'Yes']
    no_churn = data_frame[data_frame['Churn'] == 'No']

    fig  = ff.create_distplot([churn[column_name], no_churn[column_name]], group_labels = ['Churn', 'No Churn'],
                              bin_size = 7, show_hist=False, show_rug=False, 
                              colors=['rgb(0, 0, 101)','rgb(0,200,200)' ])
    fig.update_layout(width=800, height=400,
                     title=dict(text=f'KDE for {column_name}', x=0.5),
                     font=dict(family='Times New Roman', size=14),
                     legend=dict(bgcolor='lightblue', borderwidth=3),
                      plot_bgcolor='white'
                     )
    fig.update_xaxes(showgrid=False, showline=True, linewidth=2, linecolor='black')
    fig.update_yaxes(showgrid=False, showline=True, linewidth=2, linecolor='black')
    fig.show()

    
columns_to_plot = ['tenure', 'MonthlyCharges', 'TotalCharges']

for column in columns_to_plot:
    kde_plot(column)

the following conclusions can be drawn from the plots above:
* Recent clients are more likely to churn
* More likely to churn was clients which payed less at the beggining and then started to paying more by month
* clients which total charges was bigger are more likely to churn
* MonthlyCharges is important feature

**ECDF plot** of total charges, monthly charges and tenure:

In [None]:
def ecdf_plot(column_name:str, data_frame=df, color='Churn'):
    """arguments: column_name(str type), 
                  data_frame:pandas DataFrame ,
                  color:str type - default Churn
       returns: ecdf plot object"""
    
    #  in case when type is not str
    if type(column_name) != str:
        raise ValueError(f'Expected str type, got: {type(column_name)}')
    
    #  in case when type is correct but colum is not in data frame
    if column_name not in data_frame.columns or color not in data_frame.columns:
        raise ValueError('column not in DataFrame!')
    
    fig = px.ecdf(df, column_name, color=color,
                 color_discrete_sequence=['rgb(0, 0, 101)','rgb(0,200,200)' ])
    
    fig.update_layout(width=800, height=400,
                     title=dict(text=f'{column_name}', x=0.5),
                     font=dict(family='Times New Roman', size=14),
                     legend=dict(bgcolor='lightblue', borderwidth=3),
                      plot_bgcolor='white')
    
    
    fig.show()
    

for column in columns_to_plot:
    ecdf_plot(column)

scatter plots - MonthlyCharges and TotalCharges and tenure:

In [None]:
fig = px.scatter(df, x='TotalCharges', y='tenure', color='Churn',
          width=600, height=450, color_discrete_sequence=['khaki', 'lime'])

fig.update_layout(title=dict(text='Total charges and tenure', x=0.5),
                            font=dict(family='Times New Roman', size=18),
                 plot_bgcolor='white')

fig.show()

fig = px.scatter(df, x='MonthlyCharges', y='tenure', color='Churn',
          width=600, height=450,color_discrete_sequence=['khaki', 'lime'])

fig.update_layout(title=dict(text='Monthly Charges and tenure', x=0.5),
                            font=dict(family='Times New Roman', size=18),
                 plot_bgcolor='white')

fig.show()

As we can see there's some bounduaries

There's two ratios I'm going to calculate, there's:

* **total charges to tenure ratio** as total charge divided by tenure
* monhly charges difference as differnence between month charge - **total charges tenure ratio (this one above)**

In [None]:
df['total_charges_to_tenure_ratio'] = df['TotalCharges'] / df['tenure']
df['monthly_charges_diff'] = df['MonthlyCharges'] - df['total_charges_to_tenure_ratio']
df['monthly_charges_diff'] = np.nan_to_num(df['monthly_charges_diff'])
kde_plot('monthly_charges_diff')

There's not much features visible from this chart but it may be useful when combined with categorical features

### 3. Categorical Features

This dataset has 16 categorical features:
* Six binary (Yes/No)
* Nine features with three unique values each
* One feature with four unique values

### 3.1 Age and churn (SeniorCitizen column)

In [None]:
map_table = {1: 'Yes',
            0: 'No'}
df['SeniorCitizen'] = df['SeniorCitizen'].map(map_table)

In [None]:
def percentage_barplot(column_name:str, data_frame=df, churn_column='Churn', orientation='v'):
    """arguments: column_name:str, 
                  data_frame:pandas DataFrame, 
                  churn_colum:str default Churn,  
                  orientation:str - defaul v(vertical), 
                              set h to change to horizontal
                  
        returns: bar plot grouped by column_name 
                 arg and churn"""
    
        #  in case when type is not str
    if type(column_name) != str or type(churn_column) != str or type(orientation) != str:
        raise ValueError(f'Expected str type, got: {type(column_name)}')
    
    #  in case when type is correct but colum is not in data frame
    if column_name not in data_frame.columns or churn_column not in data_frame:
        raise ValueError('column not in DataFrame!')
        
    group_by_churn_and_column_name = data_frame.groupby([column_name, churn_column]).count().reset_index()[[column_name, churn_column,'customerID']]
    group_by_churn_and_column_name.rename(columns={'customerID': 'count'}, inplace=True)
    group_by_churn_and_column_name['percentage'] = group_by_churn_and_column_name['count'].apply(lambda count: count / df.shape[0] * 100 )
    
    #  if orientation is not v or h
    if orientation not in ['v', 'h']:
        raise ValueError('Choose vertical: v or horizontal: h orientation!')
    if orientation == 'v': 
        fig = px.bar(group_by_churn_and_column_name, x=column_name, y='percentage',
            color='Churn', color_discrete_sequence = ['khaki', 'lime'],barmode='group')
        fig.update_layout(yaxis=dict(ticksuffix='%'))
        

    else:
        ytick_label='%'
        fig = px.bar(group_by_churn_and_column_name, y=column_name, x='percentage',
            color='Churn', color_discrete_sequence = ['khaki', 'lime'],barmode='group',
                    )
        fig.update_layout(xaxis=dict(ticksuffix = '%'))
        
    
    
    fig.update_layout(width=500, height=600,font=dict(family='Times New Roman', size=14),
                 title=dict(font=dict(size=22), 
                            text=f'{column_name}',x=.5,), 
                  plot_bgcolor='white', 
                  legend=dict(bgcolor='white', borderwidth=3,title = 'Churn:')
                  )


    fig.show()

In [None]:
percentage_barplot('SeniorCitizen')

### 3.2 Age and gender

In [None]:
df['ChurnRate'] = df['Churn'].replace('Yes', 1).replace('No', 0)
g = sns.FacetGrid(df, col='SeniorCitizen', height=4, aspect=1,
                 sharex=False, sharey=False, ylim=(0,0.5))
g.map(sns.barplot, 'gender', 'ChurnRate')
g.set_axis_labels('gender', 'churn rate')
g.set_titles(col_template='Senior: {col_name}')
plt.show()

Conclusions **gender, age and churn** :

* gender does not have any impact on churn rate
* seniors are only 16 % of all clients but churn rate in this group is high - 42 % against non-senior clients

### 3.3 internet services

There's 3 unique services provided, it's:
* [DSL](https://en.wikipedia.org/wiki/Digital_subscriber_line)
* [Fiber optic](https://en.wikipedia.org/wiki/Optical_fiber) 
* No (may be only phone divice or TV subscription without internet)

In [None]:
percentage_barplot('InternetService',orientation='h')

* The **highest churn rate** has **fiber optic** which seems to be the fastest ant the most popular internet service nowadays

* **DSL** subsciribers are **less likely** to churn than fiber optic users

* At least **the lowest churn rate** is visible in **no internet service**

### 3.4 Partner and dependents

In [None]:
percentage_barplot('Partner', orientation='h')

In [None]:
percentage_barplot('Dependents', orientation='h')

* customers which have no partner are more likely to churn
* customers without dependents are more likely to churn

### 3.5 Phone and multiple lines

In [None]:
percentage_barplot('MultipleLines', orientation='h')

**Conclusions:**
* plot above shows that clients with multiple lines has almost the same churn rate as clients which don't have it
* few customers don't have phone services

In [None]:
fig = px.violin(df, x='MultipleLines', y='MonthlyCharges', color='Churn',
         color_discrete_sequence=['rgb(0, 0, 101)','rgb(0,200,200)'],box=True )
fig.update_layout(width=800, height=400,
                 title=dict(text= 'MultipleLines and churn',x=0.5, 
                           font=dict(size=24)),
                 font=dict(family='Times New Roman', size=14),
                 legend=dict(bgcolor='lightblue', borderwidth=3),
                 plot_bgcolor='white')

fig

the following conclusions can be drawn from the plots above:

* there's any pattern for no phone service users
* Users which have no multiple lines and their monthly charges are bigger than 60 dollars are more likely to churn
* Users with multiple lines and monthly charges greater than 60 dollars are very likely to churn

### 3.6 Additional services

There's 6 additional services for clients:
* online security
* online backup
* device protection
* tech support
* streaming tv
* streaming movies

In [None]:
columns_to_plot = ['OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
                   'StreamingTV', 'StreamingMovies']
customers_with_internet = df[df['InternetService'] != 'No']
for column in columns_to_plot:
    percentage_barplot(column, data_frame=customers_with_internet)

Conclusions:
* customers without technical support are more likely to churn than customers which have this kind of support
* streaming services churn rates are not avalible to predict

## 4.Payment methods and contracts

### 4.1 Paperless billing methods

In [None]:
g = sns.FacetGrid(data=df, col='PaperlessBilling', aspect=1.5)
g.map(sns.barplot, 'Contract', 'ChurnRate')
g.set_titles(col_template='Paper Billings: {col_name}')
plt.show()

In [None]:
percentage_barplot('PaymentMethod')

Following conclusions can be drawn from the plot above:
* customers with paperless billing methods are more probably to churn
* electronic check has the biggest churn rate from all payment methods
* short time contracts are more probable to churn

In [None]:
fig = px.box(df, x='Contract', y='MonthlyCharges', color='Churn',
         color_discrete_sequence=['rgb(0, 0, 101)','rgb(0,200,200)'] )
fig.update_layout(width=800, height=400,
                 title=dict(text= 'Contract and monthly charges with churn',x=0.5, 
                           font=dict(size=24)),
                 font=dict(family='Times New Roman', size=14),
                 legend=dict(bgcolor='lightblue', borderwidth=3),
                 plot_bgcolor='white')

fig

Customers which decide to churn have in every case bigger monthly charges from customers which didn't decide to churn, one year and two year payments are biger than in month-month contract

In [None]:
g = sns.FacetGrid(data=df, col='PaymentMethod',aspect=1.5)
g.map_dataframe(sns.boxplot, x='MonthlyCharges', y='Churn')
g.set_titles(col_template='Payment method: {col_name}')
plt.show()

Mailed check payment method has really big gap in monthly charges between people which churn and not churn

## 5.Final conclusions

Numerical features are good predictors to churn and no churn, customers which are in senior group are very likely to churn. People which have fiber optic are very likely to churn. Customers with long term contracts not churn so often. From the other side gender and streaming features seems to not have any inpact on churn rate.