## Business Problem Understanding

### Context:
In the telecommunications industry, customer churn is a significant concern as it directly impacts revenue and profitability. The dataset represents customer profiles of those who have left a telco company. Understanding the factors that contribute to customer churn can help the company develop strategies to retain existing customers and improve overall customer satisfaction.

### Problem Statement:
The telco company wants to identify the customers who are likely to churn or leave the company. By understanding the characteristics and behavior patterns of these customers, the company can take proactive measures to prevent churn and implement targeted retention strategies.

### Goals:

- Develop a predictive model that can accurately identify customers who are likely to churn based on their profile information and service usage patterns.
- Gain insights into the key factors or variables that are closely associated with customer churn, such as tenure, services subscribed, contract type, monthly charges, and customer demographics.
- Use the insights and the predictive model to develop effective customer retention strategies and personalized offers or incentives to reduce churn.

### Analytic Approach:

- Perform exploratory data analysis (EDA) to understand the dataset, identify any missing or inconsistent data, and gain initial insights into potential relationships between features and customer churn.
- Preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features, if necessary.
- Split the data into training and testing sets.
- Build and evaluate various machine learning classification models (e.g., logistic regression, decision trees, random forests, gradient boosting) to predict customer churn.
- Optimize the best-performing model(s) through techniques like hyperparameter tuning and ensemble methods.
- Interpret the model results to identify the most important features influencing customer churn.
- Validate the model's performance on the test set using appropriate evaluation metrics.

### Metric Evaluation:
The primary evaluation metric for this classification problem will be the Area Under the Receiver Operating Characteristic (ROC-AUC) curve. The ROC-AUC provides a comprehensive measure of the model's ability to distinguish between churners and non-churners across different classification thresholds.

Additionally, we can consider other relevant metrics, such as:
- Precision: The proportion of correctly identified churners among all predicted churners (to minimize false positives).
- Recall: The proportion of correctly identified churners among all actual churners (to minimize false negatives).
- F1-score: The harmonic mean of precision and recall, providing a balanced measure of model performance.
The choice of metric(s) will depend on the business priorities and the relative importance of minimizing false positives (incorrectly identifying non-churners as churners) or false negatives (missing potential churners).

By analyzing the model's performance using these metrics, the telco company can strike a balance between proactive customer retention efforts and efficient resource allocation.

In this approach, a Random Forest Classifier is used as the main algorithm. The data is preprocessed, oversampled, and feature selection is performed. Hyperparameter tuning is also done using GridSearchCV to find the best hyperparameters for the Random Forest Classifier.

### Importing Libraries

In [19]:
import pandas as pd
import numpy as np
import time
import sys
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer

### Data Reading


In [20]:
# Load Dataset
df = pd.read_csv("telcocucu.csv")

# Preprocess data
categorical_cols = ['Dependents', 'OnlineSecurity', 'OnlineBackup', 'InternetService', 'DeviceProtection', 'TechSupport', 'Contract', 'PaperlessBilling']
numerical_cols = ['tenure', 'MonthlyCharges']


the first line loads a CSV file into a Pandas DataFrame, and the next two lines identify which columns in the DataFrame contain categorical data and which columns contain numerical data. This information is typically needed for preprocessing the data before training a machine learning model, as categorical and numerical data often need to be handled differently.

### Encode Target Variable
This step encodes the target variable 'Churn' using a LabelEncoder, which assigns a numerical label to each category. This is necessary because many machine learning algorithms require the target variable to be numeric.

In [21]:
# Encode target variable
le = LabelEncoder()
y = le.fit_transform(df['Churn'])

# Split data into features and target
X = df.drop('Churn', axis=1)


### Check for string values in features and encode them:
This step checks if there are any string (object) columns in the feature matrix X. If such columns exist, they are encoded using a LabelEncoder, which assigns a numerical label to each category.

This step creates a categorical_transformer pipeline that will one-hot encode categorical features using the OneHotEncoder. The handle_unknown='ignore' parameter ensures that any unknown categories in the test set are ignored during the encoding process.

In [22]:
# Check for string values in X
string_cols = [col for col in X.columns if X[col].dtype == 'object']

# Encode string values in X
if string_cols:
    le = LabelEncoder()
    for col in string_cols:
        X[col] = le.fit_transform(X[col])

# Encode categorical features
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])

### Define numerical transformer & combining transformer :
This line defines a numerical_transformer, which is set to 'passthrough' for now. If any preprocessing is required for numerical features, it can be defined here.

This step combines the categorical_transformer and numerical_transformer using the ColumnTransformer. The transformers parameter specifies which transformer to apply to which columns. categorical_cols and numerical_cols should be lists containing the names of categorical and numerical columns, respectively.

In [23]:
# Preprocess numerical features (if needed)
numerical_transformer = 'passthrough'

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
        ('num', numerical_transformer, numerical_cols)
    ])


### Train and evaluate the machine learning model:
This step includes the following:

1. Split the data into training and testing sets.
2. Apply preprocessing (encoding, scaling, etc.) to the training and testing feature matrices.
3. Oversample the minority class in the training data using SMOTE.
4. Perform feature selection on the training data using a Random Forest Classifier.
5. Create a pipeline with a Random Forest Classifier as the estimator.
6. Perform hyperparameter tuning using GridSearchCV with cross-validation to find the best hyperparameters.
7. Train the model with the best hyperparameters on the training data.
8. Make predictions on the test set using the trained model.
9. Evaluate the model's performance using various metrics (ROC-AUC, precision, recall, and F1-score).

This step encompasses the core of the machine learning model training and evaluation process, including data preparation, model selection, hyperparameter tuning, and performance assessment.

In [24]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply the preprocessing to the feature matrix
print("Applying preprocessing to feature matrix...")
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Oversample the minority class
print("Oversampling minority class...")
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Perform feature selection
print("Performing feature selection...")
selector = SelectFromModel(estimator=RandomForestClassifier(), threshold='median')
X_train_selected = selector.fit_transform(X_train_resampled, y_train_resampled)
X_test_selected = selector.transform(X_test)

# Prepare pipeline with Random Forest classifier
print("Preparing pipeline with Random Forest classifier...")
classifier_pipeline = Pipeline(steps=[
    ('classifier', RandomForestClassifier())
])

# Perform hyperparameter tuning
print("Performing hyperparameter tuning...")
param_grid = {
    'classifier__max_depth': [3, 5, 7],
    'classifier__n_estimators': [100, 200, 300]
}
grid_search = GridSearchCV(classifier_pipeline, param_grid, cv=5, scoring='roc_auc')

print("Training model...")
load_animation = "|/-\\"
idx = 0
start_time = datetime.now()

while True:
    current_time = datetime.now()
    elapsed_time = current_time - start_time
    if elapsed_time.total_seconds() > 60:
        break

    msg = f"\r[{load_animation[idx % len(load_animation)]}] Training in progress..."
    sys.stdout.write(msg)
    sys.stdout.flush()
    idx += 1
    time.sleep(0.2)

grid_search.fit(X_train_selected, y_train_resampled)

# Get the best estimator
best_classifier = grid_search.best_estimator_

# Train the model
print("\nTraining best model...")
best_classifier.fit(X_train_selected, y_train_resampled)

# Make predictions on the test set
y_pred = best_classifier.predict(X_test_selected)

print("\nTraining successful!")

# Evaluate the model
roc_auc = roc_auc_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"ROC-AUC: {roc_auc:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

Applying preprocessing to feature matrix...
Oversampling minority class...
Performing feature selection...
Preparing pipeline with Random Forest classifier...
Performing hyperparameter tuning...
Training model...
[-] Training in progress...
Training best model...

Training successful!
ROC-AUC: 0.76
Precision: 0.60
Recall: 0.71
F1-score: 0.65


Based on the provided evaluation metrics, we can draw the following conclusions and provide recommendations for the telco company:

1. **Model Performance Analysis**:
   - The ROC-AUC score of 0.75 indicates a reasonably good ability of the model to distinguish between customers who will churn and those who will not. However, there is still room for improvement.
   - The precision score of 0.60 means that out of the customers predicted as churners, 60% of them are actually churners. This suggests that the model has a moderate level of precision in identifying true positives (churners).
   - The recall score of 0.70 implies that the model is able to correctly identify 70% of the actual churners. While this is a decent recall rate, it also means that 30% of potential churners are being missed by the model.
   - The F1-score of 0.65 provides a balanced view of the model's performance, taking into account both precision and recall. It indicates a moderate level of overall performance.

2. **Recommendations for the Telco Company**:
   - While the model shows promising results, there is still room for improvement. The company should continue to explore additional feature engineering, data preprocessing techniques, and alternative machine learning algorithms to enhance the model's predictive power.
   - Given the moderate precision score, the company should exercise caution when targeting customers predicted as churners. It may be beneficial to prioritize those customers with the highest predicted probability of churning to optimize resource allocation and minimize the impact of false positives.
   - The recall score suggests that the company may be missing a significant portion of potential churners. To address this, the company could consider adjusting the classification threshold or exploring ensemble methods that combine multiple models to improve recall without significantly compromising precision.
   - The company should analyze the feature importance or model coefficients to gain insights into the most influential factors contributing to customer churn. This information can guide the development of targeted retention strategies and personalized offers or incentives for at-risk customers.
   - Continuously monitoring and updating the model with new data is crucial, as customer behavior and market dynamics can change over time. Regular model retraining and evaluation should be implemented to ensure the model remains relevant and effective.

3. **Conclusion on the Effectiveness of the Machine Learning Approach**:
   The machine learning approach demonstrates its effectiveness in identifying potential churners and providing insights into the factors contributing to customer churn. However, the moderate performance metrics suggest that the problem domain may not be fully solved by this model alone.

   To maximize the benefits of the machine learning approach, the telco company should consider integrating the model's predictions with additional data sources, such as customer feedback, market analysis, and domain expertise. By combining the model's insights with qualitative information and business knowledge, the company can develop more comprehensive and effective customer retention strategies.

   Additionally, continuous model improvement through iterative feature engineering, algorithm selection, and parameter tuning will be essential to enhance the model's predictive capabilities and better address the customer churn problem over time.

Overall, while the current machine learning model provides valuable insights and predictions, it should be viewed as part of a broader customer retention strategy that incorporates multiple data sources, domain knowledge, and ongoing model refinement.

## 