# Data Preprocessing
Welcome! In this notebook, we will prepare our data so that an AI/Machine Learning model can learn from it. 

Models only understand **numbers**, and they don't like **missing data**. Preprocessing is where we fix these issues!

### Step 1: Import Libraries and Load Data

In [1]:
%pip install scikit-learn pandas numpy

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

df = pd.read_csv('loan_data.csv')
df.head()



Note: you may need to restart the kernel to use updated packages.


Unnamed: 0,person_age,person_gender,person_education,person_income,person_emp_exp,person_home_ownership,loan_amnt,loan_intent,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status
0,22.0,female,Master,71948.0,0,RENT,35000.0,PERSONAL,16.02,0.49,3.0,561,No,1
1,21.0,female,High School,12282.0,0,OWN,1000.0,EDUCATION,11.14,0.08,2.0,504,Yes,0
2,25.0,female,High School,12438.0,3,MORTGAGE,5500.0,MEDICAL,12.87,0.44,3.0,635,No,1
3,23.0,female,Bachelor,79753.0,0,RENT,35000.0,MEDICAL,15.23,0.44,2.0,675,No,1
4,24.0,male,Master,66135.0,1,RENT,35000.0,MEDICAL,14.27,0.53,4.0,586,No,1


### Step 2: Handle Missing Data
From our EDA, we know there might be missing values. For simplicity, we will drop the rows that have missing values. In advanced scenarios, you might "impute" (guess) them instead.

In [2]:
print("Rows before dropping missing values:", len(df))
df = df.dropna()
print("Rows after dropping missing values:", len(df))

Rows before dropping missing values: 45000
Rows after dropping missing values: 45000


### Step 3: Encoding Categorical Data (Converting Text to Numbers)
AI models can't read text like `"High School"` or `"Bachelor"`. We need to convert them to numbers.

`person_education` has an order (High School < Associate < Bachelor < Master < Doctorate). We use **Ordinal Encoding** for this.

In [3]:
education_order = [['High School', 'Associate', 'Bachelor', 'Master', 'Doctorate']]
ordinal_enc = OrdinalEncoder(categories=education_order)

# Create a new column with the encoded numbers
df['person_education_encoded'] = ordinal_enc.fit_transform(df[['person_education']])

# Now we drop the original text column
df = df.drop(columns=['person_education'])
df[['person_education_encoded']].head()

Unnamed: 0,person_education_encoded
0,3.0
1,0.0
2,0.0
3,2.0
4,3.0


For textual data without order (like `person_gender` or `loan_intent`), we use **One-Hot Encoding**. Better yet, for beginners, we can use `pd.get_dummies()` which does this automatically!

In [4]:
# Also, map simple Yes/No to 1/0
df['previous_loan_defaults_on_file'] = df['previous_loan_defaults_on_file'].map({'Yes': 1, 'No': 0})

# One-Hot Encoding for remaining text columns
text_columns = ['person_gender', 'person_home_ownership', 'loan_intent']
df = pd.get_dummies(df, columns=text_columns, drop_first=True)

df.head()

Unnamed: 0,person_age,person_income,person_emp_exp,loan_amnt,loan_int_rate,loan_percent_income,cb_person_cred_hist_length,credit_score,previous_loan_defaults_on_file,loan_status,person_education_encoded,person_gender_male,person_home_ownership_OTHER,person_home_ownership_OWN,person_home_ownership_RENT,loan_intent_EDUCATION,loan_intent_HOMEIMPROVEMENT,loan_intent_MEDICAL,loan_intent_PERSONAL,loan_intent_VENTURE
0,22.0,71948.0,0,35000.0,16.02,0.49,3.0,561,0,1,3.0,False,False,False,True,False,False,False,True,False
1,21.0,12282.0,0,1000.0,11.14,0.08,2.0,504,1,0,0.0,False,False,True,False,True,False,False,False,False
2,25.0,12438.0,3,5500.0,12.87,0.44,3.0,635,0,1,0.0,False,False,False,False,False,False,True,False,False
3,23.0,79753.0,0,35000.0,15.23,0.44,2.0,675,0,1,2.0,False,False,False,True,False,False,True,False,False
4,24.0,66135.0,1,35000.0,14.27,0.53,4.0,586,0,1,3.0,True,False,False,True,False,False,True,False,False


### Step 4: Split Data into Features (X) and Target (y)
We want to predict `loan_status`. So, `loan_status` is our **Target (y)**. Everything else is our **Features (X)**.

In [5]:
X = df.drop(columns=['loan_status'])
y = df['loan_status']

### Step 5: Train / Test Split
We must hide some data from our AI model while it trains, so we can test it later. This is called the Train/Test Split.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("Training data rows:", len(X_train))
print("Testing data rows:", len(X_test))

Training data rows: 36000
Testing data rows: 9000


### Step 6: Feature Scaling
Some numbers are huge (like `person_income` = 80000), and some are small (like `loan_int_rate` = 12). AI models can get confused if they aren't on the same scale.
We use a **StandardScaler** to make all numbers small and manageable.

In [7]:
scaler = StandardScaler()

# We 'fit' (learn the scale) and 'transform' (apply the scale) on training data
X_train_scaled = scaler.fit_transform(X_train)

# We ONLY 'transform' the test data (so it uses the exact same scaling as the training data)
X_test_scaled = scaler.transform(X_test)

### Next Steps: AI Model Training (Suggestions)

Congratulations! Your data is fully cleaned and preprocessed. 

Here is a simple code snippet on how you would train a **Logistic Regression** model and a **Random Forest** model on this processed data:

```python
# 1. Import Models and Metrics
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 2. Train Logistic Regression (Simple, Fast)
log_model = LogisticRegression()
log_model.fit(X_train_scaled, y_train)

# Predict and check accuracy
log_preds = log_model.predict(X_test_scaled)
print("Logistic Regression Accuracy:", accuracy_score(y_test, log_preds))

# 3. Train Random Forest (More Powerful, Handles Non-Linear Data)
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Note: Random Forest doesn't strictly NEED scaling, but it doesn't hurt
rf_model.fit(X_train_scaled, y_train)

# Predict and check accuracy
rf_preds = rf_model.predict(X_test_scaled)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))
```