## Simple Model Tests

The main objective of the first project is to validate the ability to split the dataset into train and test sets stratified by the target class, and to train and compare the performance of different models such as decision tree, random forest, and xgboost. For this project, we will use accuracy as the evaluation metric.


## NBFI Vehicle Loan repayment Dataset

link de kaggle: https://www.kaggle.com/datasets/meastanmay/nbfi-vehicle-loan-repayment-dataset

In this project, we will be exploring and learn from a dataset that contains information on customers who have taken out vehicle loans from a non-banking financial institution (NBFI). 

> Our main objective is to build a classification machine learning model that can predict whether or not a customer is likely to default on their loan repayment.

The capacity to **accurately** predict loan defaults is crucial for financial institutions to manage risk and ensure the financial health of their business. By the end of this lab, you will have gained valuable experience in building, evaluating, and selecting machine learning models for classification tasks, and how to apply them to real-world financial data.


### Loading and examining the data

**Task 1** Load the data and start the Exploratory Data Analysis (EDA). 

> EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. 

Read the dataset `NBFI_Train.csv` CSV file using read_csv pandas command, then answer the following questions:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
df = pd.read_csv('NBFI_Train.csv')
df.head()

  df = pd.read_csv('NBFI_Train.csv')


Unnamed: 0,ID,Client_Income,Car_Owned,Bike_Owned,Active_Loan,House_Own,Child_Count,Credit_Amount,Loan_Annuity,Accompany_Client,...,Client_Permanent_Match_Tag,Client_Contact_Work_Tag,Type_Organization,Score_Source_1,Score_Source_2,Score_Source_3,Social_Circle_Default,Phone_Change,Credit_Bureau,Default
0,12142509,6750,0.0,0.0,1.0,0.0,0.0,61190.55,3416.85,Alone,...,Yes,Yes,Self-employed,0.568066,0.478787,,0.0186,63.0,,0
1,12138936,20250,1.0,0.0,1.0,,0.0,15282.0,1826.55,Alone,...,Yes,Yes,Government,0.56336,0.215068,,,,,0
2,12181264,18000,0.0,0.0,1.0,0.0,1.0,59527.35,2788.2,Alone,...,Yes,Yes,Self-employed,,0.552795,0.329655,0.0742,277.0,0.0,0
3,12188929,15750,0.0,0.0,1.0,1.0,0.0,53870.4,2295.45,Alone,...,Yes,Yes,XNA,,0.135182,0.631355,,1700.0,3.0,0
4,12133385,33750,1.0,0.0,1.0,0.0,2.0,133988.4,3547.35,Alone,...,Yes,Yes,Business Entity Type 3,0.508199,0.301182,0.355639,0.2021,674.0,1.0,0


In [4]:
#How many null values present the dataset?
missing=df.isnull().sum()/df.shape[0]*100
missing.loc[missing>0].shape

(33,)

In [5]:
#Drop rows for columns which have more than 30% null values

df.dropna(subset=['Score_Source_1','Social_Circle_Default','Own_House_Age','Client_Occupation'],how='any',inplace=True)

**Task 2** Impute the null values of the varibles Car_Owned, Bike_Owned, Active_Loan,House_Own      
and Child_Count using simple imputer. 

In [6]:
from sklearn.impute import SimpleImputer
import numpy as np

si = SimpleImputer(strategy="most_frequent",missing_values=np.nan)

df.iloc[:, 2:7] = si.fit_transform(df.iloc[:,2:7])
# Verify that the null entries have been imputed
df.iloc[:, 2:7].isnull().sum()

Car_Owned      0
Bike_Owned     0
Active_Loan    0
House_Own      0
Child_Count    0
dtype: int64

In [7]:
df.iloc[:, 8:16] = si.fit_transform(df.iloc[:, 8:16])
df.iloc[:, 20:33] = si.fit_transform(df.iloc[:, 20:33])
df.iloc[:, 37:39] = si.fit_transform(df.iloc[:, 37:39])

**Task 3** Effective feature engineering is based on sound knowledge of the business problem and the available data sources. 

A machine learning algorithm needs to be able to understand the data it receives. There are plenty of methods to encode categorical variables into numeric and each method comes with its own advantages and disadvantages. 

First, run the following code:

In [8]:
df.dropna(how='any',inplace=True)
df_clean=df.reset_index(drop=True)

In [9]:
df_clean.isnull().sum()

ID                            0
Client_Income                 0
Car_Owned                     0
Bike_Owned                    0
Active_Loan                   0
House_Own                     0
Child_Count                   0
Credit_Amount                 0
Loan_Annuity                  0
Accompany_Client              0
Client_Income_Type            0
Client_Education              0
Client_Marital_Status         0
Client_Gender                 0
Loan_Contract_Type            0
Client_Housing_Type           0
Population_Region_Relative    0
Age_Days                      0
Employed_Days                 0
Registration_Days             0
ID_Days                       0
Own_House_Age                 0
Mobile_Tag                    0
Homephone_Tag                 0
Workphone_Working             0
Client_Occupation             0
Client_Family_Members         0
Cleint_City_Rating            0
Application_Process_Day       0
Application_Process_Hour      0
Client_Permanent_Match_Tag    0
Client_C

Apply the OneHotEncoder from scikit-learn to encode the categorical columns.

First, store the name of the categorical columns in a variable `categorical_columns`.

Concatenate the result with the numerical variables in a new dataframe called `data_preprocessed`.
For this task 

In [13]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

categorical_columns = df_clean.drop(['Default','ID'], axis=1).select_dtypes(include=['object']).columns
df_clean[categorical_columns] = df_clean[categorical_columns].astype(str)

In [14]:
# OneHotEncoder
ohe = OneHotEncoder()
color_ohe = ohe.fit_transform(df_clean.loc[:, categorical_columns])

# Create DataFrame from encoded data
ohe_df = pd.DataFrame(color_ohe.toarray(), columns=ohe.get_feature_names_out(categorical_columns))

# Concatenate with the original DataFrame
data_preprocessed = pd.concat([df_clean.drop(categorical_columns, axis=1), ohe_df], axis=1)

**Task 4** It's time to separate the features and the target and split the dataset in train and test set.


In [15]:
from sklearn.model_selection import train_test_split

# target and features
X = data_preprocessed.drop(['Default','ID'], axis=1)
y = data_preprocessed['Default']

# split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2, random_state=0)

**Task 5** Normalize the dataset to ensure that all features are on a similar scale. This step is crucial for logistic regression, as it helps prevent certain features from dominating the others in the model's learning process. 

You should use StandardScaler to standardize the features and store the results in the variables `X_train_scaler` and `X_test_scaler`.

In [16]:
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler on the training data
X_train_scaler = scaler.fit_transform(X_train)

# Use the fitted scaler to transform the testing data
X_test_scaler = scaler.fit_transform(X_test)

**Task 6** Train the **logistic regression** model using the training dataset. Apply the regularization technique during the training process to ensure the model generalizes well to unseen data. Utilize an optimization algorithm, such as gradient descent or a variant of it, to find the optimal set of coefficients that minimize the loss function.


In [17]:
from sklearn.linear_model import LogisticRegression

# Create a logistic regression model with regularization
logreg_model = LogisticRegression(C=1.0, penalty='l2')

# Fit the model on the normalized training data
logreg_model.fit(X_train_scaler, y_train)

**Task 7** Evaluate the performance of the trained logistic regression model using the testing dataset.

You should use the predict method of the trained logreg_model to make predictions on the normalized testing data. Calculate and store the predictions in the variable y_pred. Then, utilize appropriate evaluation metrics to assess the model's performance, such as accuracy, precision, recall, and F1-score. Store the results in their respective variables: accuracy, precision, recall, and f1_score.

In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Make predictions on the normalized testing data
y_pred = logreg_model.predict(X_test_scaler)

# Calculate evaluation metrics
print("accuracy =", accuracy_score(y_test, y_pred))
print("precision =", precision_score(y_test, y_pred))
print("recall =", recall_score(y_test, y_pred))
print("f1_score =", f1_score(y_test, y_pred))


accuracy = 0.9524324324324325
precision = 1.0
recall = 0.18518518518518517
f1_score = 0.3125


#### Conclusion
The objective of this lab was to explore and apply various techniques for feature engineering in the context of logistic regression. Feature engineering plays a crucial role in improving model performance by creating new features from existing data or selecting relevant features.