# Problem Statement:
Predicting whether a customer will churn (cancel their subscription) based on their behavior and account information using the Telco Customer Churn Dataset. The goal is to build a classification model that can accurately predict customer churn.

# Dataset Overview:
The dataset contains the following key features:

- Demographics: gender, SeniorCitizen, Partner, Dependents
- Account Information: tenure, PhoneService, MultipleLines, InternetService, Contract, PaymentMethod
- Services: OnlineSecurity, DeviceProtection, TechSupport
- Charges: MonthlyCharges, TotalCharges
- Target: Churn (Yes/No)

# Steps to be covered:
- Data Preprocessing: Handle missing values, encode categorical variables, and split the dataset.
- Model Training: Train a classification model using Random Forest.
- Model Evaluation: Evaluate the model using accuracy and confusion matrix.
- Save the Model: Save the trained model.
- Load and Predict: Use the saved model for predictions.

# Import Libraries and Load Dataset

In [None]:
# !pip install pandas scikit-learn joblib imbalanced-learn

In [None]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import joblib

# Load the dataset provided by the user
file_path = 'dataset_Telco-Customer-Churn.csv'
dataset = pd.read_csv(file_path)

# Display the first few rows of the dataset to understand its structure
dataset.head()

# Check the distribution of Churn in the original dataset
print("Original dataset class distribution:")
print(dataset['Churn'].value_counts())

# Data Preprocessing

In [None]:
# Convert 'TotalCharges' to numeric, coercing errors to handle any non-numeric data
dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'], errors='coerce')

# Handle missing values by filling 'TotalCharges' with 0
dataset['TotalCharges'] = dataset['TotalCharges'].fillna(0)

# Convert the 'Churn' column to binary (Yes -> 1, No -> 0)
dataset['Churn'] = dataset['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

# Check the conversion
print(dataset['Churn'].value_counts())

# Select relevant features and target
features = dataset.drop(columns=['customerID', 'Churn'])
target = dataset['Churn']

# Get categorical and numerical columns
categorical_columns = features.select_dtypes(include=['object']).columns
numerical_columns = features.select_dtypes(exclude=['object']).columns

# One-hot encoding for categorical features
features_encoded = pd.get_dummies(features, columns=categorical_columns)

# Use stratified split to ensure both classes are represented in the training and test sets
X_train, X_test, y_train, y_test = train_test_split(features_encoded, target, test_size=0.2, stratify=target, random_state=42)

# Check the distribution of the training labels to ensure balance
print("Training set class distribution:", y_train.value_counts(normalize=True))

# Display the shape of training and test data
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# Model Training

In [34]:
# Import SMOTE for oversampling
from imblearn.over_sampling import SMOTE

# Apply SMOTE to balance the training dataset
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

# Initialize the Random Forest model
rf_model = RandomForestClassifier(random_state=42)

# Train the model on the balanced data
rf_model.fit(X_train_balanced, y_train_balanced)

# Predict on the test data
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Output model accuracy and confusion matrix
print(f"Accuracy: {accuracy * 100:.2f}%")
print("Confusion Matrix:")
print(conf_matrix)



Accuracy: 78.21%
Confusion Matrix:
[[906 129]
 [178 196]]


#### Accuracy: 78.21%
The model achieved an accuracy of 78.21%, which means that it correctly predicted whether a customer would churn or not in 78.21% of the test cases. Accuracy alone may not fully represent model performance, especially when dealing with imbalanced datasets (like churn vs. non-churn), so we should also examine the confusion matrix for a deeper analysis.

#### Confusion Matrix:
The confusion matrix gives us detailed insights into the model's predictions:
- 906 (True Negatives): The model correctly predicted 906 cases where the customer did not churn.
- 129 (False Positives): The model incorrectly predicted 129 cases where the customer was predicted to churn, but in reality, they did not churn.
- 178 (False Negatives): The model incorrectly predicted 178 cases where the customer was predicted to not churn, but in reality, they did churn.
- 196 (True Positives): The model correctly predicted 196 cases where the customer did churn.

#### Interpretation:
- True Positives (196) and True Negatives (906) show that the model is correctly identifying a significant number of both churn and non-churn customers.
- However, the number of False Negatives (178) is relatively high, which means the model is missing some customers who actually churned.
- The False Positives (129) indicate cases where the model incorrectly predicted that customers would churn, but they did not.

#### Model Performance:
- While an accuracy of 78.21% is decent, the confusion matrix reveals that there is room for improvement in reducing false negatives (customers who churn but were predicted not to).
- A potential next step could involve tuning the model further or trying other models to improve the precision and recall for the churned class.

# Save the Model

In [35]:
# Save the trained model to a file
model_filename = 'telco_churn_rf_model.pkl'
joblib.dump(rf_model, model_filename)

print(f"Model saved to {model_filename}")


Model saved to telco_churn_rf_model.pkl


# Load and Predict using the Saved Model

In [None]:
# Load the saved model
loaded_model = joblib.load(model_filename)

# Make predictions on the test set with the loaded model
loaded_model_predictions = loaded_model.predict(X_test)

# Mapping predictions to human-readable output
prediction_labels = ['No Churn' if pred == 0 else 'Churn' for pred in loaded_model_predictions]

# Display a few predictions in a more readable format
for i, prediction in enumerate(prediction_labels[:500]):
    print(f"Sample {i+1}: {prediction}")

# Create a DataFrame with the test data and the corresponding predictions
X_test_with_predictions = X_test.copy()
X_test_with_predictions['Predicted Churn'] = loaded_model_predictions

# Map the predictions to human-readable form
X_test_with_predictions['Predicted Churn'] = X_test_with_predictions['Predicted Churn'].apply(lambda x: 'No Churn' if x == 0 else 'Churn')

# Save the DataFrame to a CSV file
output_file_path = 'prediction_customer_churn.csv'
X_test_with_predictions.to_csv(output_file_path, index=False)

output_file_path  # Return the path to the saved CSV file

