### Business Understanding
Project Idea: Develop a model to predict whether a loan will be paid back or defaulted using the LendingClub dataset.
Design a system to predict whether borrowers will repay their loans or not, based on their information and the loan details.
Construct a classification system to estimate the probability of loan default based on borrower and loan details.
#### objectives
1. Data Collection
2. Data Preprocessing
3. Exploratory Data Analysis (EDA)
4. Feature Engineering
5. Modeling
6. Evaluation
7. Conclusion

### Data Understanding
LendingClub dataset: Available on Kaggle.

### Load data

In [3]:
import pandas as pd

# Load the dataset
data = pd.read_csv('data/Loading_the_data.csv')
data.head()

FileNotFoundError: [Errno 2] No such file or directory: 'data/Loading_the_data.csv'

### Preprocess the data

In [None]:
# Handling missing values
data = data.dropna()

# Encoding categorical variables
data = pd.get_dummies(data, drop_first=True)

# Normalizing numerical features
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)


### EDA

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Distribution of target variable
sns.countplot(data['loan_status'])
plt.show()

# Correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, fmt='.2f', cmap='coolwarm')
plt.show()


### Feature Engineering

In [None]:
# Feature selection based on correlation
correlation = data.corr()
target_corr = correlation['loan_status'].abs().sort_values(ascending=False)
important_features = target_corr[target_corr > 0.1].index

data = data[important_features]


### Modeling

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb

# Split data
X = data.drop('loan_status', axis=1)
y = data['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic Regression
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

# Gradient Boosting
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)

# XGBoost
xgb_model = xgb.XGBClassifier()
xgb_model.fit(X_train, y_train)


### Evaluate model performance

In [None]:
from sklearn.metrics import classification_report, roc_auc_score

# Logistic Regression
y_pred_log_reg = log_reg.predict(X_test)
print('Logistic Regression')
print(classification_report(y_test, y_pred_log_reg))
print('AUC-ROC:', roc_auc_score(y_test, log_reg.predict_proba(X_test)[:, 1]))

# Random Forest
y_pred_rf = rf.predict(X_test)
print('Random Forest')
print(classification_report(y_test, y_pred_rf))
print('AUC-ROC:', roc_auc_score(y_test, rf.predict_proba(X_test)[:, 1]))

# Gradient Boosting
y_pred_gb = gb.predict(X_test)
print('Gradient Boosting')
print(classification_report(y_test, y_pred_gb))
print('AUC-ROC:', roc_auc_score(y_test, gb.predict_proba(X_test)[:, 1]))

# XGBoost
y_pred_xgb = xgb_model.predict(X_test)
print('XGBoost')
print(classification_report(y_test, y_pred_xgb))
print('AUC-ROC:', roc_auc_score(y_test, xgb_model.predict_proba(X_test)[:, 1]))


### Report findings