## Project Description
Dive into India's telecom sector to analyze customer churn. Utilize pandas and machine learning to study datasets from top telecom firms, revealing demographic and usage patterns. Predict customer retention, merging data analysis and predictive modeling to sharpen your data science expertise.

![Cartoon of telecom customers](IMG_8811.png)


The telecommunications (telecom) sector in India is rapidly changing, with more and more telecom businesses being created and many customers deciding to switch between providers. "Churn" refers to the process where customers or subscribers stop using a company's services or products. Understanding the factors that influence keeping a customer as a client in predicting churn is crucial for telecom companies to enhance their service quality and customer satisfaction. As the data scientist on this project, you aim to explore the intricate dynamics of customer behavior and demographics in the Indian telecom sector in predicting customer churn, utilizing two comprehensive datasets from four major telecom partners: Airtel, Reliance Jio, Vodafone, and BSNL:

- `telecom_demographics.csv` contains information related to Indian customer demographics:

| Variable             | Description                                      |
|----------------------|--------------------------------------------------|
| `customer_id `         | Unique identifier for each customer.             |
| `telecom_partner `     | The telecom partner associated with the customer.|
| `gender `              | The gender of the customer.                      |
| `age `                 | The age of the customer.                         |
| `state`                | The Indian state in which the customer is located.|
| `city`                 | The city in which the customer is located.       |
| `pincode`              | The pincode of the customer's location.          |
| `registration_event` | When the customer registered with the telecom partner.|
| `num_dependents`      | The number of dependents (e.g., children) the customer has.|
| `estimated_salary`     | The customer's estimated salary.                 |

- `telecom_usage` contains information about the usage patterns of Indian customers:

| Variable   | Description                                                  |
|------------|--------------------------------------------------------------|
| `customer_id` | Unique identifier for each customer.                         |
| `calls_made` | The number of calls made by the customer.                    |
| `sms_sent`   | The number of SMS messages sent by the customer.             |
| `data_used`  | The amount of data used by the customer.                     |
| `churn`    | Binary variable indicating whether the customer has churned or not (1 = churned, 0 = not churned).|


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# OneHotEncoder is not needed if using pd.get_dummies()
from sklearn.preprocessing import StandardScaler, OneHotEncoder 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix

In [2]:
demographics = pd.read_csv('telecom_demographics.csv')
usage = pd.read_csv('telecom_usage.csv')

In [3]:
demographics.head(2)

Unnamed: 0,customer_id,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary
0,15169,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979
1,149207,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445


In [4]:
usage.head(2)

Unnamed: 0,customer_id,calls_made,sms_sent,data_used,churn
0,15169,75,21,4532,1
1,149207,35,38,723,1


In [5]:
churn_df = usage.merge(demographics, on='customer_id', how='left')
churn_df.head()

Unnamed: 0,customer_id,calls_made,sms_sent,data_used,churn,telecom_partner,gender,age,state,city,pincode,registration_event,num_dependents,estimated_salary
0,15169,75,21,4532,1,Airtel,F,26,Himachal Pradesh,Delhi,667173,2020-03-16,4,85979
1,149207,35,38,723,1,Airtel,F,74,Uttarakhand,Hyderabad,313997,2022-01-16,0,69445
2,148119,70,47,4688,1,Airtel,F,54,Jharkhand,Chennai,549925,2022-01-11,2,75949
3,187288,95,32,10241,1,Reliance Jio,M,29,Bihar,Hyderabad,230636,2022-07-26,3,34272
4,14016,66,23,5246,1,Vodafone,M,45,Nagaland,Bangalore,188036,2020-03-11,4,34157


In [6]:
churn_df.isna().sum()

customer_id           0
calls_made            0
sms_sent              0
data_used             0
churn                 0
telecom_partner       0
gender                0
age                   0
state                 0
city                  0
pincode               0
registration_event    0
num_dependents        0
estimated_salary      0
dtype: int64

In [7]:
churn_rate = churn_df['churn'].value_counts() / len(churn_df)
print(churn_rate)

churn
0    0.799538
1    0.200462
Name: count, dtype: float64


In [8]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6500 entries, 0 to 6499
Data columns (total 14 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   customer_id         6500 non-null   int64 
 1   calls_made          6500 non-null   int64 
 2   sms_sent            6500 non-null   int64 
 3   data_used           6500 non-null   int64 
 4   churn               6500 non-null   int64 
 5   telecom_partner     6500 non-null   object
 6   gender              6500 non-null   object
 7   age                 6500 non-null   int64 
 8   state               6500 non-null   object
 9   city                6500 non-null   object
 10  pincode             6500 non-null   int64 
 11  registration_event  6500 non-null   object
 12  num_dependents      6500 non-null   int64 
 13  estimated_salary    6500 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 711.1+ KB


In [9]:
cat = churn_df.select_dtypes('object')
cat.head(1)

Unnamed: 0,telecom_partner,gender,state,city,registration_event
0,Airtel,F,Himachal Pradesh,Delhi,2020-03-16


In [11]:
df_encoded = pd.get_dummies(churn_df, columns=['telecom_partner', 'gender', 'state', 
                                             'city', 'registration_event'], dtype='int')
df_encoded.head()

Unnamed: 0,customer_id,calls_made,sms_sent,data_used,churn,age,pincode,num_dependents,estimated_salary,telecom_partner_Airtel,...,registration_event_2023-04-24,registration_event_2023-04-25,registration_event_2023-04-26,registration_event_2023-04-27,registration_event_2023-04-28,registration_event_2023-04-29,registration_event_2023-04-30,registration_event_2023-05-01,registration_event_2023-05-02,registration_event_2023-05-03
0,15169,75,21,4532,1,26,667173,4,85979,1,...,0,0,0,0,0,0,0,0,0,0
1,149207,35,38,723,1,74,313997,0,69445,1,...,0,0,0,0,0,0,0,0,0,0
2,148119,70,47,4688,1,54,549925,2,75949,1,...,0,0,0,0,0,0,0,0,0,0
3,187288,95,32,10241,1,29,230636,3,34272,0,...,0,0,0,0,0,0,0,0,0,0
4,14016,66,23,5246,1,45,188036,4,34157,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
features = df_encoded.drop(['customer_id', 'churn'], axis=1)
target = df_encoded['churn']

In [20]:
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)

In [21]:
X_train, X_test, y_train, y_test = train_test_split(features_scaled, target, test_size=0.2, random_state=42)

In [22]:
print(f'Shape of X_train: {X_train.shape}')
print(f'Shape of y_train: {y_train.shape}')
print(f'Shape of X_test: {X_test.shape}')
print(f'Shape of y_test: {y_test.shape}')

Shape of X_train: (5200, 1263)
Shape of y_train: (5200,)
Shape of X_test: (1300, 1263)
Shape of y_test: (1300,)


In [23]:
logreg = LogisticRegression(random_state=42)
logreg.fit(X_train, y_train)

In [24]:
logreg_pred = logreg.predict(X_test)

In [25]:
print(confusion_matrix(logreg_pred, y_test),'\n')
print(classification_report(logreg_pred, y_test))

[[911 243]
 [116  30]] 

              precision    recall  f1-score   support

           0       0.89      0.79      0.84      1154
           1       0.11      0.21      0.14       146

    accuracy                           0.72      1300
   macro avg       0.50      0.50      0.49      1300
weighted avg       0.80      0.72      0.76      1300



In [26]:
r_forest = RandomForestClassifier(random_state=42)
r_forest.fit(X_train, y_train)

In [27]:
rf_pred = r_forest.predict(X_test)

In [28]:
print(confusion_matrix(rf_pred, y_test),'\n') 
print(classification_report(rf_pred, y_test))

[[1024  273]
 [   3    0]] 

              precision    recall  f1-score   support

           0       1.00      0.79      0.88      1297
           1       0.00      0.00      0.00         3

    accuracy                           0.79      1300
   macro avg       0.50      0.39      0.44      1300
weighted avg       0.99      0.79      0.88      1300



In [29]:
higher_accuracy = "RandomForest"