# **ThinkHumble Assignment** : Customer Churn Prediction
* Name - Ajay
* University - VIT Bhopal University
* email id - ajaysinghpoonia805@gmail.com

## Tasks List:
1. Data Generation
2. Exploratory Data Analysis
3. Data Preprocessing
4. Feature Engineering
5. Model Building
6. Model Selection and Evaluation
7. Model Deployment

## 1. Data Generation

**Key note and thing keep in mind from assignment**
* *Data Quality:* Introduce specific data quality issues like missing values, outliers, or 
inconsistencies_lifetime_value

**Tasks in data generation**
* Generate a synthetic dataset of 5000 customer records containing the following features:
    * `CustomerID`
    * `Age`
    * `Gender`
    * `ContractType` (Month-to-month, One year, Two year)
    * `MonthlyCharges`
    * `TotalCharges`
    * `TechSupport`
    * `InternetService` (DSL, Fiber optic, No)
    * `Tenure`
    * `PaperlessBilling`
    * `PaymentMethod`
    * `Churn` (Yes/No)
* Introduce realistic distributions, correlations, and outliers to the data.
* Ensure a target churn rate of approximately 20%.
* Create derived features like `average_monthly_charges`, `customer_lifetime_value`

* Here `faker` package is used to genrate the data to make it a more realistic but only used for features like `CustomerID`, `Age`, `Gender`, `ContractType`, `TechSupport`, `InternetService`, `PaperlessBilling` and `PaymentMethod`.
* To introduce realistic data we need to use `random` from `numpy` to make numerical values more realistic 

### 1.1 Import Libraries

In [3]:
import pandas as pd
import numpy as np
from faker import Faker
from sklearn.utils import shuffle

### 1.2 Generate a Dictionary for data

In [29]:
# initialise faker
fake = Faker()

# set random seed
Faker.seed(42)
np.random.seed(42)

#according to given tasks initialise the values
no_data = 5000
churn_rate = 0.2 # 20% in fraction is 0.2
null_percentage = 0.05 # null percentage 50%

# Fuctions to make the process more readable and make is realistic data
def gen_ContractType():
    return fake.random_element(elements=['Month-to-month', 'One year', 'Two year'])

def gen_MonthlyCharges(contract_type):
    if contract_type == 'Month-to-month':
        return np.random.uniform(30, 100)
    elif contract_type == 'One year':
        return np.random.uniform(40, 90)
    else:  # Two year
        return np.random.uniform(50, 80)

def gen_TotalCharges(monthly_charges, tenure):
    return monthly_charges * tenure

def gen_Tenure():
    return np.random.randint(1, 73)  # 1 month to 6 years

def gen_Churn():
    return np.random.choice(['Yes', 'No'], p=[churn_rate, 1 - churn_rate])

# define dictionaries and prepare each feature with keys and its tuples with values
data = {
    'CustomerID': [fake.uuid4() for _ in range(no_data)],
    'Age': [fake.random_int(min=18, max=70) for _ in range(no_data)],
    'Gender': [fake.random_element(elements=['Male', 'Female']) for _ in range(no_data)],
    'ContractType': [gen_ContractType() for _ in range(no_data)],
    'TechSupport': [fake.random_element(elements=['Yes', 'No']) for _ in range(no_data)],
    'InternetService': [fake.random_element(elements=['DSL', 'Fiber optic', 'No']) for _ in range(no_data)],
    'Tenure': [gen_Tenure() for _ in range(no_data)],
    'PaperlessBilling': [fake.random_element(elements=['Yes', 'No']) for _ in range(no_data)],
    'PaymentMethod': [fake.random_element(elements=['UPI', 'check/DD', 'IMPS/NEFT', 'Card']) for _ in range(no_data)],
}

# Generate MonthlyCharges and TotalCharges features with correlations
data['MonthlyCharges'] = [gen_MonthlyCharges(contract) for contract in data['ContractType']]
data['TotalCharges'] = [gen_TotalCharges(data['MonthlyCharges'][i], data['Tenure'][i]) for i in range(no_data)]
data['Churn'] = [gen_Churn() for _ in range(no_data)]

# Now we have everything in data format let's convert into DataFrame to do more operations with data

### 1.3 Dictionary data into data frame

In [30]:
data_df = pd.DataFrame(data)
data_df.head()

Unnamed: 0,CustomerID,Age,Gender,ContractType,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,bdd640fb-0667-4ad1-9c80-317fa3b1799d,54,Male,Two year,No,Fiber optic,52,No,UPI,79.68516,4143.62832,Yes
1,23b8c1e9-3924-46de-beb1-3b9046685257,57,Female,Two year,No,Fiber optic,15,Yes,Card,70.529427,1057.941412,No
2,bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9,62,Female,One year,No,Fiber optic,72,Yes,check/DD,87.449033,6296.330401,No
3,972a8469-1641-4f82-8b9d-2434e465e150,66,Female,One year,Yes,DSL,61,No,UPI,47.127828,2874.797493,No
4,17fc695a-07a0-4a6e-8822-e8f36c031199,69,Male,Month-to-month,No,Fiber optic,21,No,IMPS/NEFT,56.749763,1191.745028,Yes


In [31]:
data_df.shape

(5000, 12)

Now we have **12** features and **5000** rows but still we need **2** drived features 

### 1.4 Drived Features `average_monthly_charges` and `customer_lifetime_value`

In [32]:
data_df['average_monthly_charges'] = data_df['TotalCharges'] / data_df['Tenure']
data_df['customer_lifetime_value'] = data_df['TotalCharges']
data_df.head()

Unnamed: 0,CustomerID,Age,Gender,ContractType,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,average_monthly_charges,customer_lifetime_value
0,bdd640fb-0667-4ad1-9c80-317fa3b1799d,54,Male,Two year,No,Fiber optic,52,No,UPI,79.68516,4143.62832,Yes,79.68516,4143.62832
1,23b8c1e9-3924-46de-beb1-3b9046685257,57,Female,Two year,No,Fiber optic,15,Yes,Card,70.529427,1057.941412,No,70.529427,1057.941412
2,bd9c66b3-ad3c-4d6d-9a3d-1fa7bc8960a9,62,Female,One year,No,Fiber optic,72,Yes,check/DD,87.449033,6296.330401,No,87.449033,6296.330401
3,972a8469-1641-4f82-8b9d-2434e465e150,66,Female,One year,Yes,DSL,61,No,UPI,47.127828,2874.797493,No,47.127828,2874.797493
4,17fc695a-07a0-4a6e-8822-e8f36c031199,69,Male,Month-to-month,No,Fiber optic,21,No,IMPS/NEFT,56.749763,1191.745028,Yes,56.749763,1191.745028


### 1.5 check 20% churn rate if not then adjust it

In [33]:
actual_churn_rate = data_df['Churn'].value_counts(normalize=True).get('Yes', 0)
if actual_churn_rate != churn_rate:
    num_churn = int(no_data * churn_rate)
    num_non_churn = no_data - num_churn
    data_df.loc[data_df['Churn'] == 'Yes', 'Churn'] = 'No'
    data_df.loc[data_df.index[:num_churn], 'Churn'] = 'Yes'

### 1.6 Introduce outliers, shuffle the data and put some null values 

In [35]:
# set random seed
Faker.seed(42)
np.random.seed(42)

# Shuffle the DataFrame
data_df = shuffle(data_df, random_state=42)

# Adding a few extreme outliers in MonthlyCharges and TotalCharges
num_outliers = 100
outlier_indices = np.random.choice(data_df.index, num_outliers, replace=False)
data_df.loc[outlier_indices, 'MonthlyCharges'] *= np.random.uniform(3, 10, num_outliers)
data_df.loc[outlier_indices, 'TotalCharges'] *= np.random.uniform(2, 5, num_outliers)

# Introduce null values
num_nulls = int(no_data * null_percentage)
null_indices = np.random.choice(data_df.index, num_nulls, replace=False)

# Nullify values in specific columns
data_df.loc[np.random.choice(null_indices, num_nulls // 3, replace=False), 'MonthlyCharges'] = np.nan
data_df.loc[np.random.choice(null_indices, num_nulls // 3, replace=False), 'TotalCharges'] = np.nan
data_df.loc[np.random.choice(null_indices, num_nulls // 3, replace=False), 'TechSupport'] = np.nan

data_df.head(10)

Unnamed: 0,CustomerID,Age,Gender,ContractType,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,average_monthly_charges,customer_lifetime_value
2115,8e2ac3ff-3c5d-4a42-8dd8-af09451b6267,46,Female,Two year,Yes,Fiber optic,13,No,IMPS/NEFT,240.490678,2913.2913,No,54.670778,710.720113
1940,96f01335-c5ab-48da-a435-4021392a9fc4,46,Female,Two year,No,Fiber optic,45,No,UPI,211.898122,9096.153535,No,62.531257,2813.906565
641,4082cbb9-48bd-4b3e-a2f5-df2badaa44ca,36,Male,One year,No,No,4,Yes,check/DD,489.409449,1032.705208,Yes,70.123459,280.493834
2980,4d4f6814-d385-4240-92a2-f8390acfc6ac,32,Male,One year,No,No,18,Yes,Card,562.296484,6463.180569,No,74.371843,1338.693175
4861,eb1313b7-95d1-4b6d-ae9a-e156325cba4e,58,Male,One year,No,DSL,50,No,check/DD,189.032991,4747.929909,No,42.237572,2111.878585
931,186b2880-ab54-4a15-a69d-01ff1634725b,30,Female,Month-to-month,Yes,DSL,52,Yes,check/DD,662.831613,13805.572524,Yes,96.129517,4998.734869
4743,0b070a68-c8f2-47a3-a221-3776427ae472,66,Female,Two year,,No,66,Yes,Card,559.65152,17820.700454,No,62.214929,4106.185311
3193,4648bbe8-f84c-47a9-b6cf-d2d52cd6e8df,43,Female,Month-to-month,No,No,11,No,IMPS/NEFT,272.65774,778.577633,No,30.824591,339.070505
2281,9399ad15-a37c-41c5-842b-266b1b02fa24,41,Male,One year,Yes,Fiber optic,54,Yes,UPI,387.311794,4907.239113,No,40.806232,2203.536545
4219,7f702973-7aca-4b76-a2ec-f870bb15894a,68,Female,Two year,Yes,Fiber optic,36,Yes,IMPS/NEFT,419.60667,8378.737389,No,75.452532,2716.291145


In [36]:
data_df.shape

(5000, 14)

In [37]:
data_df = data_df.reset_index()
data_df.head()

Unnamed: 0,index,CustomerID,Age,Gender,ContractType,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,average_monthly_charges,customer_lifetime_value
0,2115,8e2ac3ff-3c5d-4a42-8dd8-af09451b6267,46,Female,Two year,Yes,Fiber optic,13,No,IMPS/NEFT,240.490678,2913.2913,No,54.670778,710.720113
1,1940,96f01335-c5ab-48da-a435-4021392a9fc4,46,Female,Two year,No,Fiber optic,45,No,UPI,211.898122,9096.153535,No,62.531257,2813.906565
2,641,4082cbb9-48bd-4b3e-a2f5-df2badaa44ca,36,Male,One year,No,No,4,Yes,check/DD,489.409449,1032.705208,Yes,70.123459,280.493834
3,2980,4d4f6814-d385-4240-92a2-f8390acfc6ac,32,Male,One year,No,No,18,Yes,Card,562.296484,6463.180569,No,74.371843,1338.693175
4,4861,eb1313b7-95d1-4b6d-ae9a-e156325cba4e,58,Male,One year,No,DSL,50,No,check/DD,189.032991,4747.929909,No,42.237572,2111.878585


In [38]:
data_df = data_df.drop("index", axis=1)
data_df.head()

Unnamed: 0,CustomerID,Age,Gender,ContractType,TechSupport,InternetService,Tenure,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn,average_monthly_charges,customer_lifetime_value
0,8e2ac3ff-3c5d-4a42-8dd8-af09451b6267,46,Female,Two year,Yes,Fiber optic,13,No,IMPS/NEFT,240.490678,2913.2913,No,54.670778,710.720113
1,96f01335-c5ab-48da-a435-4021392a9fc4,46,Female,Two year,No,Fiber optic,45,No,UPI,211.898122,9096.153535,No,62.531257,2813.906565
2,4082cbb9-48bd-4b3e-a2f5-df2badaa44ca,36,Male,One year,No,No,4,Yes,check/DD,489.409449,1032.705208,Yes,70.123459,280.493834
3,4d4f6814-d385-4240-92a2-f8390acfc6ac,32,Male,One year,No,No,18,Yes,Card,562.296484,6463.180569,No,74.371843,1338.693175
4,eb1313b7-95d1-4b6d-ae9a-e156325cba4e,58,Male,One year,No,DSL,50,No,check/DD,189.032991,4747.929909,No,42.237572,2111.878585


In [39]:
data_df.shape

(5000, 14)

**Now we are ready with data lets save it in csv file so that we can use it later**

### 1.7 Save the data in csv format

In [40]:
data_df.to_csv('Customer_churn_data.csv', index=False)

## 2. Exploratory Data Analysis

**Tasks in EDA:**
* Perform in-depth EDA to understand the dataset characteristics.
* Calculate summary statistics for numerical columns.
* Analyze categorical data distributions.
* Visualize relationships between features and the target variable (churn).
* Identify potential correlations and patterns.