
# 📊 Task 3: Customer Churn Prediction

## 📝 Introduction and Problem Statement
This project involves predicting customer churn for a bank using structured data. 
The dataset contains features such as customer demographics, account information, and transactional behavior.
The objective is to classify whether a customer will exit (churn) or not based on these attributes.
We'll perform data cleaning, exploratory data analysis (EDA), model training, and evaluation.


In [None]:

# Step 1: Import Libraries
# Import necessary libraries for data handling and visualization
import pandas as pd
import matplotlib.pyplot as plt  # For plotting
import seaborn as sns  # For statistical data visualization
# Import machine learning modules
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder  # For encoding categorical variables
from sklearn.ensemble import RandomForestClassifier  # Random Forest model
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score  # Evaluation metrics


In [None]:

# Step 2: Load Dataset
# Load the dataset from the CSV file into a pandas DataFrame
file_path = "Churn_Modelling.csv"  # Modify as needed
df = pd.read_csv(file_path)
# Display the first few rows to inspect the data structure
print(df.head())


In [None]:

# Step 2.1: Dataset Overview
print("Shape:", df.shape)
print("\nColumns:", df.columns.tolist())
print("\nFirst 5 rows:")
print(df.head())

print("\nInfo:")
print(df.info())

print("\nMissing values:")
print(df.isnull().sum())


In [None]:

# Step 3: Clean and Prepare Data
df.drop(["RowNumber", "CustomerId", "Surname"], axis=1, inplace=True)

# Encode 'Gender' column
le = LabelEncoder()
df['Gender'] = le.fit_transform(df['Gender'])  # Male:1, Female:0

# One-hot encode 'Geography' column
df = pd.get_dummies(df, columns=['Geography'], drop_first=True)


In [None]:

# Step 3.1: Exploratory Data Analysis (EDA)
# Histogram for numerical features
df.hist(figsize=(12, 10), bins=20, edgecolor='black')
plt.tight_layout()
plt.suptitle("Histograms of Features", y=1.02)
plt.show()

# Correlation heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Feature Correlation Heatmap")
plt.show()


In [None]:

# Step 4: Split Features and Target
X = df.drop('Exited', axis=1)
y = df['Exited']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 5: Train Classification Model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Step 6: Make Predictions
y_pred = model.predict(X_test)


In [None]:

# Step 7: Evaluation
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
print("Accuracy Score:", accuracy_score(y_test, y_pred))


In [None]:

# Step 8: Feature Importance
importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)

# Plot feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x=importances, y=importances.index)
plt.title("Feature Importance in Churn Prediction")
plt.xlabel("Importance Score")
plt.ylabel("Features")
plt.tight_layout()
plt.show()



## 📌 Conclusion

- The dataset was preprocessed by encoding categorical variables and removing unnecessary columns.
- Exploratory analysis revealed some strong correlations between features and churn.
- A Random Forest Classifier was trained and achieved good performance.
- Feature importance shows that credit score, age, balance, and tenure are significant predictors.
- This model can assist the bank in identifying customers at risk of leaving.
