**CUSTOMER CHURN PRDICTION ANALYSIS**

**Hypothesis**

The number of Female to churn are more than the churning rate of Males.

**Questions**

what is the churn rate distribution between people who churned and people who did not churn

what is the churn rate by gender

**Importing libraries and reading csv file**

In [24]:
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
import plotly.express as px
import warnings
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
init_notebook_mode(connected=True)

In [25]:
data = pd.read_csv (r"C:\Users\selas\OneDrive\Desktop\Telco-Customer-Churn.csv")
data.head(3)

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes


From the info above, we can see there are no null values. But lets go ahead and check if there are empty cells and replace them with nan. since sometimes there can be empty cells.

In [26]:
data = data.replace(r'^\s*$', np.nan, regex=True)
data.isnull().sum()

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

And now we can see that TotalCharges column has 11 null values

Lets fill the missing values with the meadian of the tatal charges

In [27]:
#changing the format from object to numeric
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'],errors='coerce')
#Fill the missing values with the median value
data['TotalCharges'] = data['TotalCharges'].fillna(data['TotalCharges'].median())

lets check for duplicate rows

In [28]:
data.duplicated().sum()

0

there are no duplicates

In [29]:
data.shape

(7043, 21)

There are 7043 customers and 21 features in the dataset. The data has 17 categorical features and 3 numeric features (tunure, monthlycharges and totalcharges) and Prediction feature (Churn).

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [31]:
data.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges,TotalCharges
count,7043.0,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692,2281.916928
std,0.368612,24.559481,30.090047,2265.270398
min,0.0,0.0,18.25,18.8
25%,0.0,9.0,35.5,402.225
50%,0.0,29.0,70.35,1397.475
75%,0.0,55.0,89.85,3786.6
max,1.0,72.0,118.75,8684.8


**Exploring the Target Variable**

In [32]:
churn_distribute = data['Churn'].value_counts().to_frame()
churn_distribute = churn_distribute.reset_index()
churn_distribute = churn_distribute.rename(columns={'index': 'Category'})
fig = px.pie(churn_distribute, values = 'Churn', names='Category', color_discrete_sequence=["green", "red"],
             title='Distribution of Churn')
fig.show()

churn distribution per the dataset;

We have an imbalanced data.(because there is an unequal distribution between people who churn and people who didnt churn)

No - 73.5% (people who did not churn were about 5174 out of 7043)

Yes - 26.5% (people who churned were about 1869 out of 7043)

In [39]:
fig = make_subplots(rows=1, cols=2, specs=[[{'type':'domain'}, {'type':'domain'}]])

fig.add_trace(go.Pie(labels = data['gender'].unique(), values = data['gender'].value_counts(), name ='Gender', 
                     marker_colors = ['red', 'green']), 1, 1)

fig.update_traces(hole=0.5, textfont_size=20, marker=dict(line=dict(color='black', width=2)))

fig.update_layout(
    title_text='<b>Gender Distributions<b>', 
    # Add annotations in the center of the donut pies.
    annotations=[dict(text='Gender', x=0.19, y=0.5, font_size=20, showarrow=False)])
iplot(fig)

we can also see from this pie chart that the customers are 49.5% males and 50.5% females