## Credit Risk Model - LendingClub Dataset
#### Project Assignment #4

### 1. Data Loading
#### Step 1.1: Import Libraries
We will start by importing essential libraries.

In [1]:
import pandas as pd
import numpy as np

#### Step 1.2: Load the Dataset
- Ensure the dataset is in the same folder as this notebook.
- Use pd.read_csv to load it.

In [2]:
# Initialize an empty list to store the chunks
data_chunks = []

In [6]:
# Load the dataset in chunks
chunk_size = 10000  # Adjust based on your system's memory
for chunk in pd.read_csv('accepted_2007_to_2018q4.csv', chunksize=chunk_size, low_memory=False):
    data_chunks.append(chunk)

In [7]:
# Combine all chunks into a single DataFrame
data = pd.concat(data_chunks)

In [8]:
# Display the first few rows
data.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,68407277,,3600.0,3600.0,3600.0,36 months,13.99,123.03,C,C4,...,,,Cash,N,,,,,,
1,68355089,,24700.0,24700.0,24700.0,36 months,11.99,820.28,C,C1,...,,,Cash,N,,,,,,
2,68341763,,20000.0,20000.0,20000.0,60 months,10.78,432.66,B,B4,...,,,Cash,N,,,,,,
3,66310712,,35000.0,35000.0,35000.0,60 months,14.85,829.9,C,C5,...,,,Cash,N,,,,,,
4,68476807,,10400.0,10400.0,10400.0,60 months,22.45,289.91,F,F1,...,,,Cash,N,,,,,,


### 2. Data Exploration
#### Step 2.1: Select Relevant Columns
For this project, we'll use three numeric and three categorical variables. Replace column names with ones from your dataset.

In [9]:
# Selecting columns
numeric_cols = ['loan_amnt', 'funded_amnt', 'int_rate']  # Example numeric variables
categorical_cols = ['term', 'grade', 'home_ownership']  # Example categorical variables

In [10]:
# Subsetting the data
selected_data = data[numeric_cols + categorical_cols]

In [11]:
# Display basic information
selected_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4521402 entries, 0 to 2260700
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   loan_amnt       float64
 1   funded_amnt     float64
 2   int_rate        float64
 3   term            object 
 4   grade           object 
 5   home_ownership  object 
dtypes: float64(3), object(3)
memory usage: 241.5+ MB


#### Step 2.2: Summary Statistics
Summarize numeric and categorical variables to understand the data.

In [12]:
# Summary of numeric variables
selected_data[numeric_cols].describe()

Unnamed: 0,loan_amnt,funded_amnt,int_rate
count,4521336.0,4521336.0,4521336.0
mean,15046.93,15041.66,13.09283
std,9190.244,9188.412,4.832138
min,500.0,500.0,5.31
25%,8000.0,8000.0,9.49
50%,12900.0,12875.0,12.62
75%,20000.0,20000.0,15.99
max,40000.0,40000.0,30.99


In [14]:
# Summary of categorical variables
selected_data[categorical_cols].describe()

Unnamed: 0,term,grade,home_ownership
count,4521336,4521336,4521336
unique,2,7,6
top,36 months,B,MORTGAGE
freq,3219508,1327114,2222900


### 3. Handling Missing Values
#### Step 3.1: Identify Missing Values
Find out how many values are missing in each column.

In [15]:
# Checking missing values
selected_data.isnull().sum()

loan_amnt         66
funded_amnt       66
int_rate          66
term              66
grade             66
home_ownership    66
dtype: int64

#### Step 3.2: Impute Missing Values
We'll use the median for numeric variables and the mode for categorical ones.

In [17]:
# Impute numeric variables with median
for col in numeric_cols:
    selected_data[col].fillna(selected_data[col].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  selected_data[col].fillna(selected_data[col].median(), inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_data[col].fillna(selected_data[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplac

In [18]:
# Impute categorical variables with mode
for col in categorical_cols:
    selected_data[col].fillna(selected_data[col].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  selected_data[col].fillna(selected_data[col].mode()[0], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  selected_data[col].fillna(selected_data[col].mode()[0], inplace=True)


In [19]:
# Confirm there are no missing values
selected_data.isnull().sum()

loan_amnt         0
funded_amnt       0
int_rate          0
term              0
grade             0
home_ownership    0
dtype: int64

### 4. Encoding Categorical Variables
#### Step 4.1: One-Hot Encoding
Convert categorical variables into numeric format using one-hot encoding.

In [20]:
# One-hot encoding
encoded_data = pd.get_dummies(selected_data, columns=categorical_cols, drop_first=True)

# Check the transformed dataset
encoded_data.head()

Unnamed: 0,loan_amnt,funded_amnt,int_rate,term_ 60 months,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,home_ownership_MORTGAGE,home_ownership_NONE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,3600.0,3600.0,13.99,False,False,True,False,False,False,False,True,False,False,False,False
1,24700.0,24700.0,11.99,False,False,True,False,False,False,False,True,False,False,False,False
2,20000.0,20000.0,10.78,True,True,False,False,False,False,False,True,False,False,False,False
3,35000.0,35000.0,14.85,True,False,True,False,False,False,False,True,False,False,False,False
4,10400.0,10400.0,22.45,True,False,False,False,False,True,False,True,False,False,False,False


### 5. Scaling Numerical Features
#### Step 5.1: Standardizing Numeric Features
Use standardization to scale numeric variables.

In [21]:
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Scale numeric features
encoded_data[numeric_cols] = scaler.fit_transform(encoded_data[numeric_cols])

# Check the scaled dataset
encoded_data.head()

Unnamed: 0,loan_amnt,funded_amnt,int_rate,term_ 60 months,grade_B,grade_C,grade_D,grade_E,grade_F,grade_G,home_ownership_MORTGAGE,home_ownership_NONE,home_ownership_OTHER,home_ownership_OWN,home_ownership_RENT
0,-1.245558,-1.245233,0.18567,False,False,True,False,False,False,False,True,False,False,False,False
1,1.050371,1.051154,-0.228228,False,False,True,False,False,False,False,True,False,False,False,False
2,0.538956,0.539636,-0.478637,True,True,False,False,False,False,False,True,False,False,False,False
3,2.171133,2.172139,0.363647,True,False,True,False,False,False,False,True,False,False,False,False
4,-0.505638,-0.505165,1.936461,True,False,False,False,False,True,False,True,False,False,False,False


### Final Steps
#### Step 6: Save Processed Data
Save the cleaned and transformed data to a CSV file for further use.

In [22]:
# Save the processed data
encoded_data.to_csv('processed_lending_club_data.csv', index=False)
print("Data saved successfully!")

Data saved successfully!
