# Customer Churn Prediction

## 1. Introduction
This notebook focuses on predicting customer churn for a bank. We will analyze customer data to identify key factors that contribute to churn and build a predictive model to identify at-risk customers. This is a binary classification problem where the goal is to predict whether a customer will exit the bank (churn).

## 2. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('churn.csv')

# Display the first few rows
df.head()

In [None]:
# Get a summary of the dataframe
df.info()

## 3. Data Cleaning and Preprocessing

In [None]:
# Drop irrelevant columns
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

# Check for missing values
df.isnull().sum()

The dataset is clean with no missing values. Now, we'll handle categorical features.

In [None]:
# Encode categorical variables
df = pd.get_dummies(df, columns=['Geography', 'Gender'], drop_first=True)

df.head()

## 4. Exploratory Data Analysis (EDA)

In [None]:
# Target variable distribution
sns.countplot(x='Exited', data=df)
plt.title('Distribution of Customer Churn')
plt.show()

In [None]:
# Correlation matrix
plt.figure(figsize=(12, 10))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

## 5. Model Building and Training

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define features and target
X = df.drop('Exited', axis=1)
y = df['Exited']

# Scale numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train a Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## 6. Model Evaluation

In [None]:
# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
print('\nClassification Report:')
print(classification_report(y_test, y_pred))
print('\nConfusion Matrix:')
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='g')
plt.show()

## 7. Conclusion
The Random Forest model provides a good baseline for predicting customer churn. The model achieved a respectable accuracy, and the classification report shows its performance in terms of precision and recall for both classes. Further improvements could be made by trying other models like Gradient Boosting or by performing more advanced feature engineering.