## Business Problem Understanding

### Context:
In the telecommunications industry, customer churn is a significant concern as it directly impacts revenue and profitability. The dataset represents customer profiles of those who have left a telco company. Understanding the factors that contribute to customer churn can help the company develop strategies to retain existing customers and improve overall customer satisfaction.

### Problem Statement:
The telco company wants to identify the customers who are likely to churn or leave the company. By understanding the characteristics and behavior patterns of these customers, the company can take proactive measures to prevent churn and implement targeted retention strategies.

### Goals:

- Develop a predictive model that can accurately identify customers who are likely to churn based on their profile information and service usage patterns.
- Gain insights into the key factors or variables that are closely associated with customer churn, such as tenure, services subscribed, contract type, monthly charges, and customer demographics.
- Use the insights and the predictive model to develop effective customer retention strategies and personalized offers or incentives to reduce churn.

### Analytic Approach:

- Perform exploratory data analysis (EDA) to understand the dataset, identify any missing or inconsistent data, and gain initial insights into potential relationships between features and customer churn.
- Preprocess the data by handling missing values, encoding categorical variables, and scaling numerical features, if necessary.
- Split the data into training and testing sets.
- Build and evaluate various machine learning classification models (e.g., logistic regression, decision trees, random forests, gradient boosting) to predict customer churn.
- Optimize the best-performing model(s) through techniques like hyperparameter tuning and ensemble methods.
- Interpret the model results to identify the most important features influencing customer churn.
- Validate the model's performance on the test set using appropriate evaluation metrics.

### Metric Evaluation:
The primary evaluation metric for this classification problem will be the Area Under the Receiver Operating Characteristic (ROC-AUC) curve. The ROC-AUC provides a comprehensive measure of the model's ability to distinguish between churners and non-churners across different classification thresholds.

Additionally, we can consider other relevant metrics, such as:
- Precision: The proportion of correctly identified churners among all predicted churners (to minimize false positives).
- Recall: The proportion of correctly identified churners among all actual churners (to minimize false negatives).
- F1-score: The harmonic mean of precision and recall, providing a balanced measure of model performance.
The choice of metric(s) will depend on the business priorities and the relative importance of minimizing false positives (incorrectly identifying non-churners as churners) or false negatives (missing potential churners).

By analyzing the model's performance using these metrics, the telco company can strike a balance between proactive customer retention efforts and efficient resource allocation.

## Data Understanding


In [184]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
import lightgbm as lgb
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer

In [185]:
# Load Dataset
df = pd.read_csv("telcocucu.csv")

# Preprocess data
categorical_cols = ['Dependents', 'OnlineSecurity', 'OnlineBackup', 'InternetService', 'DeviceProtection', 'TechSupport', 'Contract', 'PaperlessBilling']
numerical_cols = ['tenure', 'MonthlyCharges']


the first line loads a CSV file into a Pandas DataFrame, and the next two lines identify which columns in the DataFrame contain categorical data and which columns contain numerical data. This information is typically needed for preprocessing the data before training a machine learning model, as categorical and numerical data often need to be handled differently.

In [186]:
# Encode target variable
le = LabelEncoder()
y = le.fit_transform(df['Churn'])

# Split data into features and target
X = df.drop('Churn', axis=1)


1. Create a LabelEncoder object to encode categorical variables.
2. Encode the 'Churn' column (which is likely a categorical variable indicating whether a customer has churned or not) into numerical values, and store the encoded values in the variable y.
3. Create a new DataFrame X that contains all the features (predictor variables) except for the 'Churn' column.

The next step would typically be to split the data (X and y) into training and testing sets, and then train a machine learning model using the training data.

In [187]:
# Check for string values in X
string_cols = [col for col in X.columns if X[col].dtype == 'object']

# Encode string values in X
if string_cols:
    le = LabelEncoder()
    for col in string_cols:
        X[col] = le.fit_transform(X[col])

# Encode categorical features
categorical_transformer = Pipeline(steps=[
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
    ])

this line of code creates a Pipeline object called categorical_transformer that contains a single step: encoding categorical features using the OneHotEncoder with the handle_unknown='ignore' parameter.
This categorical_transformer object can then be used in conjunction with other transformers and estimators to preprocess the categorical features and train a machine learning model.

In [188]:
# Preprocess numerical features (if needed)
numerical_transformer = 'passthrough'

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, categorical_cols),
        ('num', numerical_transformer, numerical_cols)
    ])


these lines of code create a ColumnTransformer object called preprocessor that can apply different transformations to the categorical and numerical columns of the dataset. The categorical columns will be encoded using the OneHotEncoder, while the numerical columns will be left untransformed (due to the 'passthrough' keyword).
The preprocessor object can then be used as part of a Pipeline along with a machine learning model to preprocess the data and train the model.

In [189]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply the preprocessing to the feature matrix
X_train = preprocessor.fit_transform(X_train)
X_test = preprocessor.transform(X_test)

# Oversample the minority class
smote = SMOTE()
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Perform feature selection
selector = SelectFromModel(estimator=RandomForestClassifier(), threshold='median')
X_train_selected = selector.fit_transform(X_train_resampled, y_train_resampled)
X_test_selected = selector.transform(X_test)

# Prepare pipeline with Random Forest classifier
classifier_pipeline = Pipeline(steps=[
    ('classifier', RandomForestClassifier())
])

# Perform hyperparameter tuning
param_grid = {
    'classifier__max_depth': [3, 5, 7],
    'classifier__n_estimators': [100, 200, 300]
}
grid_search = GridSearchCV(classifier_pipeline, param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train_selected, y_train_resampled)

# Get the best estimator
best_classifier = grid_search.best_estimator_

# Train the model
best_classifier.fit(X_train_selected, y_train_resampled)

# Make predictions on the test set
y_pred = best_classifier.predict(X_test_selected)

 these lines of code split the data into training and testing sets, create a Pipeline that preprocesses the data and trains a LogisticRegression model, train the model using the training data, and then use the trained model to make predictions on the test data.
Note that you can try different classifiers (e.g., DecisionTreeClassifier, RandomForestClassifier, XGBClassifier, etc.) by replacing LogisticRegression() in the classifier_pipeline with the desired classifier.

In [190]:
# Evaluate the model
roc_auc = roc_auc_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"ROC-AUC: {roc_auc:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")

ROC-AUC: 0.75
Precision: 0.60
Recall: 0.70
F1-score: 0.65


These evaluation metrics provide insights into the performance of the trained model. The ROC-AUC score measures the overall ability of the model to distinguish between classes, while precision, recall, and F1 score give more specific information about the model's performance in terms of correctly identifying positive and negative instances.
By printing these scores, you can assess the model's performance and potentially compare it to other models or baselines.

## 