# Customer Churn Prediction - Code Workflow Guide

## Project Overview

This notebook implements a Random Forest classifier to predict customer churn in a banking dataset. Below is a detailed explanation of each step in the code workflow.

---

## Part 1: Data Preprocessing

### Step 1: Import and Load Dataset

```python
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')
```

**What it does:**
- Imports the pandas library for data manipulation
- Loads the CSV file into a DataFrame called `dataset`
- The dataset contains customer information for churn prediction

### Step 2: Explore the Data

```python
dataset.head()
```

**What it does:**
- Displays the first 5 rows of the dataset
- Helps understand the structure and content of the data
- Shows column names and sample values

### Step 3: Check for Missing Data

```python
dataset.info()
```

**What it does:**
- Provides summary information about the DataFrame
- Shows data types of each column
- Displays count of non-null values (to identify missing data)
- Reports memory usage
- **Result**: Confirms there are no missing values in this dataset

---

## Data Cleaning & Feature Engineering

### Step 4: Remove Non-Predictive Columns

```python
dataset.drop(['CustomerId', 'Surname'], axis=1, inplace=True)
```

**What it does:**
- Removes `CustomerId` and `Surname` columns
- `axis=1` specifies column removal (not row)
- `inplace=True` modifies the original DataFrame
- **Why**: These are unique identifiers with no predictive value

### Step 5: Handle Geography (Categorical Variable)

```python
dataset['Geography'].unique()
```

**What it does:**
- Shows all unique values in the Geography column
- **Result**: ['France', 'Germany', 'Spain']

```python
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first=True)
```

**What it does:**
- Converts categorical Geography into binary dummy variables
- `drop_first=True` removes one category to avoid multicollinearity
- Creates: `Germany` and `Spain` columns (France is the reference category)
- **Why**: Machine learning models need numerical input

```python
dataset = pd.concat([geography_dummies, dataset], axis=1)
dataset.drop(['Geography'], axis=1, inplace=True)
```

**What it does:**
- `pd.concat()` adds the dummy columns to the left of the dataset
- `axis=1` means concatenate horizontally (add columns)
- Removes the original `Geography` column
- **Result**: Two new binary columns (Germany, Spain) replace Geography

### Step 6: Handle Gender (Binary Categorical Variable)

```python
dataset['Gender'].unique()
```

**What it does:**
- Shows unique values: ['Female', 'Male']

```python
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)
```

**What it does:**
- Uses a lambda function to encode gender as binary
- Female → 0, Male → 1
- `apply()` applies the function to each row
- **Why**: Simpler than dummy variables for binary categories

---

## Train-Test Split

### Step 7: Separate Features and Target

```python
X = dataset.iloc[:, :-1].values
```

**What it does:**
- Selects all columns except the last one (features/inputs)
- `iloc[:, :-1]` means: all rows, all columns except last
- `.values` converts DataFrame to numpy array
- **X** = Independent variables (predictors)

```python
y = dataset.iloc[:, -1].values
```

**What it does:**
- Selects only the last column (target variable)
- `iloc[:, -1]` means: all rows, last column only
- **y** = Dependent variable (what we're predicting: Exited)

### Step 8: Split Data into Training and Test Sets

```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```

**What it does:**
- Splits data into 80% training, 20% testing
- `X_train`, `y_train`: Used to train the model
- `X_test`, `y_test`: Used to evaluate the model on unseen data
- `random_state=0`: Ensures reproducible splits
- **Why**: Testing on separate data prevents overfitting

---

## Part 2: Building and Training the Model

### Step 9: Create the Random Forest Model

```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0)
```

**What it does:**
- Imports the Random Forest algorithm
- Creates a model with:
  - `n_estimators=100`: Uses 100 decision trees
  - `max_depth=4`: Each tree has maximum 4 levels deep
  - `random_state=0`: Ensures reproducible results
- **Why Random Forest**: Handles non-linear patterns, resistant to overfitting

### Step 10: Train the Model

```python
model.fit(X_train, y_train)
```

**What it does:**
- Trains the Random Forest on the training data
- The model learns patterns from `X_train` to predict `y_train`
- Builds 100 decision trees using random subsets of data and features
- **Result**: Trained model ready for predictions

### Step 11: Make Predictions

```python
y_pred = model.predict(X_test)
```

**What it does:**
- Uses the trained model to predict outcomes for test data
- `y_pred` contains predicted values (0 or 1)
- These predictions are compared against actual values (`y_test`)

---

## Part 3: Model Evaluation

### Step 12: Predict Single Customer

```python
model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])
```

**What it does:**
- Makes a prediction for one specific customer
- Input format: `[Germany, Spain, CreditScore, Gender, Age, Tenure, Balance, NumOfProducts, HasCrCard, IsActiveMember, EstimatedSalary]`
- This customer: France (0,0), Male, 40 years old, etc.
- **Output**: 0 (won't churn) or 1 (will churn)

### Step 13: Confusion Matrix

```python
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)
```

**What it does:**
- Creates a 2x2 matrix comparing predictions vs actual values
- **Format**:
  ```
  [[True Negative   False Positive]
   [False Negative  True Positive]]
  ```
- Shows where the model was correct/incorrect

### Step 14: Calculate Accuracy (Manual)

```python
(1521+208)/(1521+208+74+197)
```

**What it does:**
- Manually calculates accuracy from confusion matrix
- Formula: (Correct Predictions) / (Total Predictions)
- (TN + TP) / (TN + TP + FN + FP)

### Step 15: Calculate Accuracy (Automated)

```python
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)
```

**What it does:**
- Uses sklearn's built-in accuracy function
- Compares `y_test` (actual) with `y_pred` (predicted)
- **Result**: ~86.45% accuracy

### Step 16: Cross-Validation

```python
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator=model,
                             X=X,
                             y=y,
                             scoring='accuracy',
                             cv=10)
print(f"Average Accuracy: {accuracies.mean()*100} %")
print(f"Standard Deviation: {accuracies.std()*100} %")
```

**What it does:**
- Performs 10-fold cross-validation
- **Process**: 
  1. Splits data into 10 parts
  2. Trains on 9 parts, tests on 1 part
  3. Repeats 10 times with different test parts
  4. Calculates accuracy for each fold
- `accuracies.mean()`: Average accuracy across all folds
- `accuracies.std()`: Variation in accuracy (consistency measure)
- **Why**: More reliable than single train-test split

---

## Key Takeaways

**Workflow Summary:**
1. **Load data** → Check structure and missing values
2. **Clean data** → Remove unnecessary columns
3. **Encode categorical variables** → Convert text to numbers
4. **Split data** → Training (80%) and Testing (20%)
5. **Train model** → Random Forest learns patterns
6. **Predict** → Generate predictions on test data
7. **Evaluate** → Measure accuracy and reliability

**Model Performance:**
- **Accuracy**: ~86% (correctly predicts 86 out of 100 customers)
- **Cross-validation**: Confirms model consistency across different data splits

**Use Case:**
The bank can use this model to identify customers likely to leave and take proactive retention measures.

# Random Forest Classifier

## Part 1 - Data Preprocessing

### Importing the dataset

In [None]:
import pandas as pd
dataset = pd.read_csv('churn_modelling.csv')

In [None]:
dataset.head()

### Checking missing data

In [None]:
dataset.info()

In [None]:
# The dataset we are using is related to customer churn modeling. It contains information about customers
# such as their credit score, geography, gender, age, tenure, balance, number of products, whether they have
# a credit card, whether they are active members, their estimated salary, and whether they exited (churned).
# 
# Our goal is to preprocess this data and build a machine learning model to predict customer churn. We will
# handle missing data, encode categorical variables, and then use ensemble models like Random Forest and
# XGBoost to train and evaluate the model's performance.


### Handling categorical variables

CustomerId and Surname columns

In [None]:
dataset.drop(['CustomerId', 'Surname'], axis = 1, inplace = True)

In [None]:
dataset.head()

Geography column

In [None]:
dataset['Geography'].unique()

In [None]:
geography_dummies = pd.get_dummies(dataset['Geography'], drop_first = True)

In [None]:
geography_dummies

In [None]:
dataset = pd.concat([geography_dummies, dataset], axis = 1)

In [None]:
dataset.head()

In [None]:
dataset.drop(['Geography'], axis = 1, inplace = True)

In [None]:
dataset.head()

Gender column

In [None]:
dataset['Gender'].unique()

In [None]:
dataset['Gender'] = dataset['Gender'].apply(lambda x: 0 if x == 'Female' else 1)

In [None]:
dataset.head(10)

### Creating the Training Set and the Test Set

Getting the inputs and output

In [None]:
X = dataset.iloc[:, :-1].values

In [None]:
y = dataset.iloc[:, -1].values

In [None]:
X

In [None]:
y

Getting the Training Set and the Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Part 2 - Building and training the model

### Building the model

In [None]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, max_depth=4, random_state=0)

### Training the model

In [None]:
model.fit(X_train, y_train)

### Inference

In [None]:
y_pred = model.predict(X_test)

In [None]:
y_pred

In [None]:
y_test

### Predicting the result of a single observation

**Homework**

Use our model to predict if the customer with the following informations will leave the bank:

Geography: France

Credit Score: 600

Gender: Male

Age: 40 years old

Tenure: 3 years

Balance: \$ 60000

Number of Products: 2

Does this customer have a credit card? Yes

Is this customer an Active Member: Yes

Estimated Salary: \$ 50000

So, should we say goodbye to that customer?

In [None]:
model.predict([[0, 0, 600, 1, 40, 3, 60000, 2, 1, 1, 50000]])

## Part 3: Evaluating the model

### Making the Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

### Accuracy

In [None]:
(1521+208)/(1521+208+74+197)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

### k-Fold Cross Validation

In [None]:
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = model,
                             X = X,
                             y = y,
                             scoring = 'accuracy',
                             cv = 10)
print(f"Average Accuracy: {accuracies.mean()*100} %")
print(f"Standard Deviation: {accuracies.std()*100} %")