## **`Business Problem`: A subscription-based service company wants to predict which customers are likely to cancel their subscriptions so they can take proactive measures to retain them.**

## **`Dataset`: [Telco Customer Churn Prediction Kaggle dataset](https://www.kaggle.com/datasets/blastchar/telco-customer-churn?resource=download)** 

In [69]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from scipy import stats

import warnings
warnings.filterwarnings('ignore')

In [13]:
# reading the dataset
df = pd.read_csv('../../data/raw/Telco-Customer-Churn.csv')

In [14]:
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [15]:
# getting last 5 dataset
df.tail()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.8,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.2,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.6,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.4,306.6,Yes
7042,3186-AJIEK,Male,0,No,No,66,Yes,No,Fiber optic,Yes,...,Yes,Yes,Yes,Yes,Two year,Yes,Bank transfer (automatic),105.65,6844.5,No


In [16]:
# checking number of rows and columns available in the dataset
df.shape

(7043, 21)

In [17]:
# it is used in "`series`" to return the number of rows
# in dataframe it'll return the number of rows "`multiplied`" by number of columns
df.size

147903

In [18]:
df.ndim

2

In [None]:
# listing all the columns
df.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'],
      dtype='object')

In [20]:
# checking for null values
df.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [21]:
# checking for duplicates
df.duplicated().sum()

np.int64(0)

In [27]:
# checking for detailed info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [43]:
# extracting numerical features names
numerical_features = []

for feature in df.select_dtypes(include=['int64', 'float64']):
    numerical_features.append(feature)

print(f'Numerical features are: {numerical_features}')

Numerical features are: ['SeniorCitizen', 'tenure', 'MonthlyCharges']


In [71]:
# extracting categorical features names
categorical_features = []

for feature in df.select_dtypes(include=['object', 'category']):
    categorical_features.append(feature)
    
print(f'Categorical features are: {categorical_features}')

Categorical features are: ['customerID', 'gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'TotalCharges', 'Churn']


In [None]:
# statistical analysis on numerical features
df.describe()

Unnamed: 0,SeniorCitizen,tenure,MonthlyCharges
count,7043.0,7043.0,7043.0
mean,0.162147,32.371149,64.761692
std,0.368612,24.559481,30.090047
min,0.0,0.0,18.25
25%,0.0,9.0,35.5
50%,0.0,29.0,70.35
75%,0.0,55.0,89.85
max,1.0,72.0,118.75


In [36]:
# statistical analysis on categorical features
df.describe(include='object')

Unnamed: 0,customerID,gender,Partner,Dependents,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,TotalCharges,Churn
count,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043,7043.0,7043
unique,7043,2,2,2,2,3,3,3,3,3,3,3,3,3,2,4,6531.0,2
top,3186-AJIEK,Male,No,No,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,20.2,No
freq,1,3555,3641,4933,6361,3390,3096,3498,3088,3095,3473,2810,2785,3875,4171,2365,11.0,5174


In [41]:
# checking if the target column is balanced or imbalanced data
total_records_in_dataset = 7043
total_number_of_a_class_in_target_feature = 5147

another_class_in_target_column = total_records_in_dataset - total_number_of_a_class_in_target_feature

print(f"In target feature class A: {total_number_of_a_class_in_target_feature} and class B: {another_class_in_target_column}")
print('So this is considered as `Imbalanced dataset`')

In target feature class A: 5147 and class B: 1896
So this is considered as `Imbalanced dataset`


In [67]:
df['Churn'].value_counts()

Churn
No     5174
Yes    1869
Name: count, dtype: int64

In [158]:
# styling plots backgrounds
plt.style.available

plt.style.use( 'tableau-colorblind10' )

In [159]:
# find the outlier in statistical and visualized way
# find the skewness and kurtosis
# find the spreadness from mean
# find the covariance

## **Observations**:
1. There is toatal of `7043 records(row)` and `21 features(columns)` available in the dataset.
2. There is `Null` and `Duplicate` values in the dataset.

3. These are the columns available in the dataset with info:
    - `customerID` - Customer unique ID
    - `gender` - wheather the customer is `male` or `female`
    - `SeniorCitizen` - Whether the customer is a senior citizen or not `(1, 0)`
    - `Partner` - Whether the customer has a partner or not `(Yes, No)`
    - `Dependent` - Whether the customer has dependents or not `(Yes, No)` - people who are relying on them
    - `tenure` - Number of months the customer has `stayed` with the `company`
    - `PhoneService` - Whether the customer has a phone service or not `(Yes, No)`
    - `MultipleLines` - Whether the customer has multiple lines or not `(Yes, No, No phone service)`
    - `InternetServic` - Customer’s internet service provider `(DSL, Fiber optic, No)``
    - `OnlineSecurity` - Whether the customer has online security or not `(Yes, No, No internet service)`
    - `OnlineBackup` - Whether the customer has online backup or not `(Yes, No, No internet service)`
    - `DeviceProtection` - Whether the customer has device protection or not `(Yes, No, No internet service)`
    - `TechSuppor` - Whether the customer has tech support or not `(Yes, No, No internet service)`
    - `StreamingTV` - Whether the customer has streaming TV or not `(Yes, No, No internet service)`
    - `StreamingMovies` - Whether the customer has streaming movies or not `(Yes, No, No internet service)`
    - `Contract` - The contract term of the customer `(Month-to-month, One year, Two year)`
    - `PaperlessBillin` - Whether the customer has paperless billing or not `(Yes, No)`
    - `PaymentMethod` - The customer’s payment method `(Electronic check, Mailed check, Bank transfer (automatic), Credit card`
    - `MonthlyCharges` - The amount `charged to the customer monthly`
    - `TotalCharges` - The `total amount charged to the customer`
    - `Churn`  - Whether the customer churned or not `(Yes or No)` - "`Target Column`"

4. In the entire dataset there `3 numerical features` and `18 catrgorical features` available
5. Numerical features are:
    - `SeniorCitizen`,
    - `tenure`,
    - `MonthlyCharge`
6. Categorical features are:
    - `customerID`, `gender`, `Partner`, `Dependents`, `PhoneService`, `MultipleLines`, `InternetService`, `OnlineSecurity`, `OnlineBackup`, `DeviceProtection`, `TechSupport`, `StreamingTV`, `StreamingMovies`, `Contract`, `PaperlessBilling`, `PaymentMethod`, `TotalCharges`, `Churn`
7. Total `count` in each features is `7043` which says that there is `no null values`
8. 1. **SeniorCitizen** column/feature - binary class(has two options- 0, 1):
        - `Average of 16.2%` of `customers` in the company are `senior citizens`.
        - When there is binary class standard devation doesn't make sense - cause it doesn't tells about the spread of the data from the mean, But we can say `low standard devaiation from mean` - 0.36.
        - Minimum number - 0
        - All the percentiles(25, 50, 75) are `0` - cause most of the customers in the company are `not senior citizens`.
        - Maximum number - 1
   2. **Tenure** - How long the customer has been with the company
        - `Average of 32.4 months customers` have been with using the companies services.
        - Customer tenure varies quite a bit `(±24.6 months) `from the average
        - Some customers are `brand new (0 months)` to the company
        - A `quarter of customers have been with the service for 9 months or less`
        - `Half of customers` have been `with the service for 29 months or less`
        - `Three-quarters `of customers `have been with the service for 55 months or less`
        - The `longest-tenured customer` `has been with the service for 72 months (6 years)`
    3. **MonthlyCharges**:
        - The `average monthly charge is about $64.76`
        - Monthly charges also `vary significantly (±$30.09)`
        - The `lowest monthly charge is $18.25`
        - A `quarter` of `customers pay $35.50 or less`
        - `Half` of customers `pay $70.35 or less`
        - `Three-quarters` of customers `pay $89.85 or less`
        - The `highest` monthly charge is `$118.75`
9. Most of the features has unique values `of 2, 3 and 4.`
10. Target feature/column is - `Churn` : `which has Imbalanced data cause from the toatal record of 7043`, In churn feature `NO` class has `5174` values and `YES` class has `1869` values which clearly says that the `data is imbalanced` if we `trained the model` it will automatically `favours the Majority(In our case `NO`)`

# **TODO:- Univariate, Bivariate, Multivariate Analysis**