## Customer Churn Prediction - Telco




## 1.0 Business Understanding
### 1.1 Introduction
Customer churn  is a significant problem in the telecom industry as it results in reduced profit margin and negatively impacting long-term sustainability. 
Churn, which refers to customers discontinuing their service and moving to a competitor, can be driven by various factors such as pricing, customer service quality, network coverage, and the competitiveness of offerings. The implications of high churn rates are multifaceted:

- Reduced Profit Margin: Acquiring new customers often costs more than retaining existing ones due to marketing expenses, promotional offers, and the operational costs of setting up new accounts. When customers leave, the company not only loses the revenue these customers would have generated but also the investment made in acquiring them.

- Investment Recovery: Telecommunications companies make significant upfront investments in infrastructure and customer acquisition. Customer longevity is crucial for recovering these investments. High churn rates shorten the average customer lifespan, jeopardizing the return on these investments.

- Brand Reputation: High churn rates can signal dissatisfaction, potentially damaging the company's reputation. This perception can make it more challenging to attract new customers and retain existing ones.

- Operational Efficiency: High churn rates can lead to inefficiencies in resource allocation and operations. Companies may find themselves in a constant cycle of trying to replace lost customers, diverting resources from improving services and innovating.

### 1.2: Project Objective
`Problem Statement:`

`Goal:` To build a machine learning model that can predict churn

Null Hypothesis: There is no significant correlation between pricing and customer churn

Alternate Hypothesis: There is a statistically significant correlation between pricing and customer churn

#### Analytical Questions
1. Does npricing affect churn?
2.  ?
3. Which is feature have the strongest correlation to churn?
4. ?
5. ?
6. ?

#### Stakeholders
- Telco
- Vodafone

## 2.0 Data Understanding 🔍
The data for this project is in a csv format. The following describes the columns present in the data.

- **Gender**: Whether the customer is a male or a female

- **SeniorCitizen**: Whether a customer is a senior citizen or not

- **Partner**: Whether the customer has a partner or not (Yes, No)

- **Dependents**: Whether the customer has dependents or not (Yes, No)

- **Tenure**: Number of months the customer has stayed with the company

- **Phone Service**: Whether the customer has a phone service or not (Yes, No)

- **MultipleLines**: Whether the customer has multiple lines or not

- **InternetService**: Customer's internet service provider (DSL, Fiber Optic, No)

- **OnlineSecurity**: Whether the customer has online security or not (Yes, No, No Internet)

- **OnlineBackup**: Whether the customer has online backup or not (Yes, No, No Internet)

- **DeviceProtection**: Whether the customer has device protection or not (Yes, No, No internet service)

- **TechSupport**: Whether the customer has tech support or not (Yes, No, No internet)

- **StreamingTV**: Whether the customer has streaming TV or not (Yes, No, No internet service)

- **StreamingMovies**: Whether the customer has streaming movies or not (Yes, No, No Internet service)

- **Contract**: The contract term of the customer (Month-to-Month, One year, Two year)

- **PaperlessBilling**: Whether the customer has paperless billing or not (Yes, No)

- **Payment Method**: The customer's payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))

- **MonthlyCharges**: The amount charged to the customer monthly

- **TotalCharges**: The total amount charged to the customer

- **Churn**: Whether the customer churned or not (Yes or No)




#### 2.1 Prerequisites

- Doing necessary installations

In [2]:
# Install necessary packages in quiet mode

%pip install --quiet pandas matplotlib seaborn plotly pyodbc python-dotenv scikit-learn imbalanced-learn catboost lightgbm xgboost  

Note: you may need to restart the kernel to use updated packages.


- Import needed packages

In [2]:
# Environmental variables
from dotenv import dotenv_values 

# Microsoft Open Database Connectivity (ODBC) library
import pyodbc 

# Data handling
import numpy as np
import pandas as pd 
               
# Visualization
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# Statistical tests
from scipy.stats import kruskal, mannwhitneyu
from itertools import combinations

# Feature Processing
# from imblearn.over_sampling import SMOTE
from sklearn.impute import SimpleImputer
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectKBest, chi2  # Univariate Selection using KBest
from sklearn.model_selection import *  #cross_val_score, fbeta_score, KFold, make_scorer, train_test_split, RandomizedSearchCV
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer

# Modelling
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.ensemble import AdaBoostClassifier

# Evaluation - Cross Validation & Hyperparameters Fine-tuning 
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# Set pandas to display all columns
pd.set_option("display.max_columns", None)

# Suppress the scientific notation
pd.set_option("display.float_format", lambda x: '%.2f' % x)

# Disable warnings               
import warnings
warnings.filterwarnings('ignore')

# Other packages
import os, pickle

print("🛬 Imported all packages.", "Warnings hidden. 👻")

