# Customer Churn Prediction using Random Forest

## Objective:
In this notebook, we aim to predict whether a customer will churn (i.e., exit) from a service based on various features such as their demographics, account details, and behavior. The goal is to train a machine learning model that predicts the probability of a customer exiting (churning) from the service, with a particular focus on maximizing the **Area Under the ROC Curve (AUC)**.

## Dataset:
We work with two datasets:
1. **Train Dataset (`train.csv`)**: Contains customer details and the target variable, `Exited`, which indicates whether the customer has churned (1) or not (0).
2. **Test Dataset (`test.csv`)**: Contains customer details for which we need to predict the probability of them churning. The true `Exited` values are not provided in the test set, as it is used for final model evaluation.

### Key Features in the Dataset:
- `CreditScore`: Customer's credit score.
- `Age`: Customer's age.
- `Tenure`: Duration of the customer's account in years.
- `Balance`: The balance in the customer's account.
- `NumOfProducts`: Number of products the customer has with the company.
- `HasCrCard`: Whether the customer has a credit card (1 for Yes, 0 for No).
- `IsActiveMember`: Whether the customer is an active member (1 for Yes, 0 for No).
- `EstimatedSalary`: Estimated annual salary of the customer.
- `Geography`: Customer's geographic location.
- `Gender`: Customer's gender (Male or Female).

## Methodology:
1. **Data Preprocessing**:
   - Handle missing values using imputation techniques.
   - One-hot encode categorical features (`Gender`, `Geography`).
   - Split the data into training and validation sets.
   
2. **Model Selection**:
   - We use the **Random Forest Classifier**, an ensemble method, which is well-suited for binary classification tasks like customer churn prediction.

3. **Model Evaluation**:
   - The model is evaluated using **Area Under the ROC Curve (AUC)**, which is a common evaluation metric for imbalanced datasets.
   
4. **Prediction and Submission**:
   - The model's predicted probabilities are saved in a **submission file** in the correct format, which can be used for final evaluation.

## Tools and Libraries Used:
- **Python Libraries**: Pandas, NumPy, Scikit-Learn, Matplotlib, Seaborn
- **Machine Learning Model**: Random Forest Classifier
- **Evaluation Metric**: Area Under the ROC Curve (AUC)

## Instructions:
1. Ensure that the required libraries are installed before running the code.
2. The dataset files (`train.csv`, `test.csv`) should be loaded into the notebook environment.
3. After running the notebook, the model will generate predictions and save them in a `submission.csv` file, which can be used for final evaluation or submission to a competition platform (e.g., Kaggle).

## Final Notes:
- The model can be further improved by tuning hyperparameters and exploring other classification algorithms.
- Feature engineering, including interaction terms and advanced encoding methods, can also enhance performance.

### File Structure:
- `train.csv`: Training dataset
- `test.csv`: Test dataset
- `submission.csv`: Generated predictions for the test set


In [25]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
from sklearn.impute import SimpleImputer

In [26]:
#Load the data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

In [27]:
train_df.head()

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.0,2,1.0,1.0,15068.83,0


In [28]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14222 entries, 0 to 14221
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   id               14222 non-null  int64  
 1   CustomerId       14222 non-null  int64  
 2   Surname          14222 non-null  object 
 3   CreditScore      14222 non-null  int64  
 4   Geography        14222 non-null  object 
 5   Gender           14222 non-null  object 
 6   Age              14222 non-null  float64
 7   Tenure           14222 non-null  int64  
 8   Balance          14222 non-null  float64
 9   NumOfProducts    14222 non-null  int64  
 10  HasCrCard        14222 non-null  float64
 11  IsActiveMember   14222 non-null  float64
 12  EstimatedSalary  14222 non-null  float64
 13  Exited           14222 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 1.5+ MB


In [29]:
# Handle missing values using SimpleImputer for simplicity
# Impute numeric columns using the mean strategy (exclude 'Exited')
numeric_cols = train_df.select_dtypes(include=['float64', 'int64']).columns
numeric_cols = numeric_cols[numeric_cols != 'Exited']  # Exclude 'Exited' from numeric columns

imputer = SimpleImputer(strategy='mean')  # Using mean imputation for numeric features
train_df[numeric_cols] = imputer.fit_transform(train_df[numeric_cols])
test_df[numeric_cols] = imputer.transform(test_df[numeric_cols])

In [30]:
# Handle categorical columns (e.g., 'Geography' and 'Gender')
# Fill missing categorical values with 'Unknown' (for consistency)
train_df['Geography'] = train_df['Geography'].fillna('Unknown')
train_df['Gender'] = train_df['Gender'].fillna('Unknown')
test_df['Geography'] = test_df['Geography'].fillna('Unknown')
test_df['Gender'] = test_df['Gender'].fillna('Unknown')

# One-Hot Encoding of 'Geography' and 'Gender' columns
train_df = pd.get_dummies(train_df, columns=['Geography', 'Gender'], drop_first=True)
test_df = pd.get_dummies(test_df, columns=['Geography', 'Gender'], drop_first=True)

In [35]:
# Ensure the test set has the same columns as the train set by reindexing the columns
train_columns = train_df.columns
test_df = test_df.reindex(columns=train_columns, fill_value=0)  # Fill missing columns with 0

In [36]:
# Define features (X) and target (y) for training
X = train_df.drop(columns=['id', 'CustomerId', 'Surname', 'Exited'])  # Drop irrelevant columns
y = train_df['Exited']

# Train-test split
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In [37]:
# Train the Random Forest model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate the model on the validation set
y_val_pred = model.predict_proba(X_val)[:, 1]
roc_auc = roc_auc_score(y_val, y_val_pred)
print(f"Validation ROC-AUC: {roc_auc}")

Validation ROC-AUC: 0.8704390926885004


In [40]:
# Prepare the test set for prediction
X_test = test_df.drop(columns=['id', 'CustomerId', 'Surname','Exited'])  # Drop same irrelevant columns in the test set

# Get predicted probabilities for the test set
y_test_pred = model.predict_proba(X_test)[:, 1]

# Prepare the submission file
submission = pd.DataFrame({
    'id': test_df['id'],  # 'id' column to keep track of the customers
    'Exited': y_test_pred  # Predicted probabilities for the "Exited" column
})

# Save the submission file
submission.to_csv('submission.csv', index=False)