# Customer Churn Prediction

This project aims to predict whether a customer will churn from a telecommunications company. The following steps are undertaken:
1. Data loading and exploration
2. Data preprocessing
3. Model training
4. Model evaluation
5. Conclusion

## 1. Data Loading and Exploration

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc
from sklearn.preprocessing import LabelEncoder, StandardScaler

In [None]:
# Load the data
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
data.head(10)

In [None]:
# Get information about the data
print(data.info())

In [None]:
# Check for missing data
print(data.isnull().sum())

In [None]:
# Check the number of unique values in columns
print(data.nunique())

In [None]:
# Visualize the distribution of the target variable
sns.countplot(x='Churn', data=data)
plt.title('Target value distribution')
plt.show()

## 2. Data Preprocessing

In [None]:
# Drop the unnecessary 'customerID' column
data.drop(['customerID'], axis=1, inplace=True)

In [None]:
# Convert 'TotalCharges' to numeric and handle missing values
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
data.fillna(data['TotalCharges'].median(), inplace=True)
print(data.isnull().sum())

In [None]:
# Encode categorical data
categorical = data.select_dtypes(include=['object']).columns
for cat in categorical:
    if cat != 'Churn':
        data[cat] = LabelEncoder().fit_transform(data[cat])

In [None]:
# Convert 'Churn' to binary
data['Churn'] = data['Churn'].apply(lambda x: 1 if x == 'Yes' else 0)

In [None]:
# Split the data into features and target
features = data.drop('Churn', axis=1)
target = data['Churn']

In [None]:
# Scale the features
scaler = StandardScaler()
features = scaler.fit_transform(features)

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)
print('Shapes after splitting:')
print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

## 3. Model Training

In [None]:
# Initialize and train the logistic regression model with L2 regularization
regressor = LogisticRegression(penalty='l2', solver='liblinear')
regressor.fit(X_train, y_train)

## 4. Model Evaluation

In [None]:
# Make predictions
y_pred = regressor.predict(X_test)

In [None]:
# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')

In [None]:
# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Plot ROC curve
fpr, tpr, _ = roc_curve(y_test, regressor.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

## Conclusion

In this project, we analyzed customer churn data from a telecommunications company. After preprocessing the data and encoding categorical variables, we trained a logistic regression model with L2 regularization to predict customer churn. We evaluated the model using metrics such as accuracy, precision, recall, and F1 score. The results indicate that the model performs reasonably well, but there is potential for improvement using more advanced techniques.