# Lending Club Data
The Lending Club has made an anonymized set of data available for anyone to study or if you do not have a lending club account the data is also available on Kaggle. 

The Lending Club is a platform which allows the crowdfunding of various loans. Various investors are able to browse the profiles of people applying for loans and decide whether or not to help fun them.

In this assignment you will build a model that predicts the largest loan amount that will be successfully funded for any given individual. This model can then be used to advise the applicants on how much they could apply for.

## Cleaning the data
### Cleaning Rejected Data Stats
Here, I am dropping unnecessary variables such as policy code, application date, states and zipcode variable when I am cleaning the data. There are differences in net income and credits between different states in the States, however, first of all, zip codes are not a number that is aligned from poorest neighborhood to richest neighborhood. It does not mean that zipcode with small number has smaller income compared to larger number zipcode. And also, even if I want it to be used as categorical variable, it only contains the first 3 numbers of the zipcode rather than the whole, which makes categorization of each neighborhood hard. Secondly, each state variable has been removed because from given data, I didn't wanted to discriminate people from where they come from. They may have come from lower credibility state, but they may have important reason to get the money. Consider A and B came to the bank to get lones, and all the things are the same except for A came from a very credible state and B doesn't, B won't get the loans. (Trying to have ethics while making money)
Lastly, I tried to contain the Loan title variable as it is the reason why the borrower wants to make the loan, however, there were just too many reasons(73929 unique values) and it just seemed training will take forever in my local. So, for the sake of the assignment, I also dropped the column. 

### Cleaning the Accepted Data Stats
As the assignment instruction states, there are so many information and missing information in the accepted data. Here, I am going to drop the columns that has too much missing values even if it seems relevent when screening the potential loan-er while I drop variables(columns) that are unnecessary. Also, some columns just seemed redundant which dependes purely on duration of time (30days,90days,120days, etc) so I dropped them and left 1 of each unique that seemed more important to me. The reason behind this is that having this much specificity might cause overfitting the model. 

In [1]:
import pandas as pd
import pystan
import scipy.stats as sts
import numpy as np
import matplotlib.pyplot as plt
import re
import seaborn as sns
from sklearn import preprocessing

#first, manually download the csv file
#then upload it in the codes
data_reject = pd.read_csv(r'C:\Users\green\Desktop\2022 Spring\CS156\Assignment2\rejected_2007_to_2018Q4.csv')
data_accept = pd.read_csv(r'C:\Users\green\Desktop\2022 Spring\CS156\Assignment2\accepted_2007_to_2018Q4.csv')

  exec(code_obj, self.user_global_ns, self.user_ns)


In [2]:
#Let's clean the data!
#Data cleaning: data_reject
print(data_reject.head())
#debt to income ratio seems to have '%' symbol that I don't want
#remove the symbol
data_reject['Debt-To-Income Ratio'] = data_reject['Debt-To-Income Ratio'].str.replace('%','')
print('---')
#now the data seems to have un-necessary rows that ain't gonna do any help with the assignment
#such as 'zip code', 'state', 'policy code'
data_reject.drop(['Policy Code'],axis=1,inplace=True)
data_reject.drop(['Zip Code'], axis=1, inplace = True)
data_reject.drop(['State'], axis=1,inplace=True)
data_reject.drop(['Application Date'], axis=1, inplace=True)
data_reject.head()

   Amount Requested Application Date                        Loan Title  \
0            1000.0       2007-05-26  Wedding Covered but No Honeymoon   
1            1000.0       2007-05-26                Consolidating Debt   
2           11000.0       2007-05-27       Want to consolidate my debt   
3            6000.0       2007-05-27                           waksman   
4            1500.0       2007-05-27                            mdrigo   

   Risk_Score Debt-To-Income Ratio Zip Code State Employment Length  \
0       693.0                  10%    481xx    NM           4 years   
1       703.0                  10%    010xx    MA          < 1 year   
2       715.0                  10%    212xx    MD            1 year   
3       698.0               38.64%    017xx    MA          < 1 year   
4       509.0                9.43%    209xx    MD          < 1 year   

   Policy Code  
0          0.0  
1          0.0  
2          0.0  
3          0.0  
4          0.0  
---


Unnamed: 0,Amount Requested,Loan Title,Risk_Score,Debt-To-Income Ratio,Employment Length
0,1000.0,Wedding Covered but No Honeymoon,693.0,10.0,4 years
1,1000.0,Consolidating Debt,703.0,10.0,< 1 year
2,11000.0,Want to consolidate my debt,715.0,10.0,1 year
3,6000.0,waksman,698.0,38.64,< 1 year
4,1500.0,mdrigo,509.0,9.43,< 1 year


In [3]:
#now what are the unique values?
Request_reject_unique = data_reject['Amount Requested'].unique()
LoanReason_reject_unique = data_reject['Loan Title'].unique()
EmployLength_reject_unique=data_reject['Employment Length'].unique()
print(Request_reject_unique, LoanReason_reject_unique,EmployLength_reject_unique) 
#the result shows that there are 12 unique values for employment length
#however, we still do not know how many unique reasons exist for other columns
print(len(Request_reject_unique))
print(len(LoanReason_reject_unique))
print(len(EmployLength_reject_unique))
#Let's see if there are enough information to use from the chosen columns
reject_missing = data_reject.isnull().sum()*100/len(data_reject) #percentage of missing data
reject_missing_df = pd.DataFrame({'Percent Missing in Rejected':reject_missing})
print(reject_missing_df)
#since 'Risk Score' have a lot of data missing from, it seems unreasonable to have that column
data_reject.drop(['Risk_Score'], axis=1,inplace=True)
data_reject.head()

[  1000.  11000.   6000. ...  73825. 114800.  64075.] ['Wedding Covered but No Honeymoon' 'Consolidating Debt'
 'Want to consolidate my debt' ... 'dougie03' 'freeup'
 'Business Advertising Loan'] ['4 years' '< 1 year' '1 year' '3 years' '2 years' '10+ years' '9 years'
 '5 years' '7 years' '6 years' '8 years' nan]
3640
73929
12
                      Percent Missing in Rejected
Amount Requested                         0.000000
Loan Title                               0.004713
Risk_Score                              66.902251
Debt-To-Income Ratio                     0.000000
Employment Length                        3.440862


Unnamed: 0,Amount Requested,Loan Title,Debt-To-Income Ratio,Employment Length
0,1000.0,Wedding Covered but No Honeymoon,10.0,4 years
1,1000.0,Consolidating Debt,10.0,< 1 year
2,11000.0,Want to consolidate my debt,10.0,1 year
3,6000.0,waksman,38.64,< 1 year
4,1500.0,mdrigo,9.43,< 1 year


In [4]:
#Let's do the same process with data_accept
print(data_accept.head()) #what? 151 columns? dang this is loooong
#see the labels
#1. See the columns that have too much missing data
accepted_missing = data_accept.isnull().sum()*100/len(data_accept)
accepted_missing_df = pd.DataFrame({'Percent Missing in Accepted':accepted_missing})
print(accepted_missing_df)
#e. Find the unique values
Request_accept_unique = data_accept['loan_amnt'].unique()
LoanReason_accept_unique = data_accept['title'].unique()
EmployLength_accept_unique=data_accept['emp_length'].unique()
print(Request_accept_unique, LoanReason_accept_unique,EmployLength_accept_unique) 
print(len(Request_accept_unique))
print(len(LoanReason_accept_unique))
print(len(EmployLength_accept_unique))

         id  member_id  loan_amnt  funded_amnt  funded_amnt_inv        term  \
0  68407277        NaN     3600.0       3600.0           3600.0   36 months   
1  68355089        NaN    24700.0      24700.0          24700.0   36 months   
2  68341763        NaN    20000.0      20000.0          20000.0   60 months   
3  66310712        NaN    35000.0      35000.0          35000.0   60 months   
4  68476807        NaN    10400.0      10400.0          10400.0   60 months   

   int_rate  installment grade sub_grade  ... hardship_payoff_balance_amount  \
0     13.99       123.03     C        C4  ...                            NaN   
1     11.99       820.28     C        C1  ...                            NaN   
2     10.78       432.66     B        B4  ...                            NaN   
3     14.85       829.90     C        C5  ...                            NaN   
4     22.45       289.91     F        F1  ...                            NaN   

  hardship_last_payment_amount disbursement_

In [5]:
#let's re-name the columns of data_reject to make the coding a bit more shorter to write
#re-naming based from accepted column
#Thus, rejected data will have same column name with accepted (for the ones that have similar meanings)
data_reject=data_reject.rename(columns={"Amount Requested":"loan_amnt", "Employment Length": "emp_length", "Debt-To-Income Ratio":"dti"})
#from the table, Employment Length is a categorical variable that contains different employment lengths 
#thus, I will make different columns for each category (One-hot encoding)
#this allows categorical variable become indicator variable(#variable)
dummies = pd.get_dummies(data_reject.emp_length, prefix='emp')
#merge the dummy feature to the rejected
data_reject = pd.merge(data_reject, dummies, how='left', left_index = True, right_index=True)
#now drop the employment length(emp_length)
data_reject.drop(['emp_length'], axis=1, inplace=True)
#see if things are done correctly
data_reject.head()

Unnamed: 0,loan_amnt,Loan Title,dti,emp_1 year,emp_10+ years,emp_2 years,emp_3 years,emp_4 years,emp_5 years,emp_6 years,emp_7 years,emp_8 years,emp_9 years,emp_< 1 year
0,1000.0,Wedding Covered but No Honeymoon,10.0,0,0,0,0,1,0,0,0,0,0,0
1,1000.0,Consolidating Debt,10.0,0,0,0,0,0,0,0,0,0,0,1
2,11000.0,Want to consolidate my debt,10.0,1,0,0,0,0,0,0,0,0,0,0
3,6000.0,waksman,38.64,0,0,0,0,0,0,0,0,0,0,1
4,1500.0,mdrigo,9.43,0,0,0,0,0,0,0,0,0,0,1


In [6]:
#same 'dti' had different datatypes
print(type(data_accept['dti'][1]) == type(data_reject['dti'][1]))
#print(type(data_reject))
data_reject['dti'] = data_reject['dti'].replace({'%':""},regex=True)
data_reject['dti'] = data_reject['dti']._convert(numeric=True)
print(type(data_accept['dti'][1]) == type(data_reject['dti'][1]))

False
True


In [7]:
#Now check the missing rate again
reject_missing = data_reject.isnull().sum()*100/len(data_reject) #percentage of missing data
reject_missing_df = pd.DataFrame({'Percent Missing in Rejected':reject_missing})
print(reject_missing_df)
#Cool, now reject data has been finally cleaned to be used 

               Percent Missing in Rejected
loan_amnt                         0.000000
Loan Title                        0.004713
dti                               0.000000
emp_1 year                        0.000000
emp_10+ years                     0.000000
emp_2 years                       0.000000
emp_3 years                       0.000000
emp_4 years                       0.000000
emp_5 years                       0.000000
emp_6 years                       0.000000
emp_7 years                       0.000000
emp_8 years                       0.000000
emp_9 years                       0.000000
emp_< 1 year                      0.000000


In [8]:
#The thing is, some of the loan titles have erros as 'creditt card loan'
#I still think it is important to have the loan title, but there are to much diverse reasons and my local has hard time using this to train the data. 
#But, in future, to use this, integer encoding-> onehot encoding is necessary But will generate 63~73k different columns for this.
#Also, some of the columns, there are too much missing data. Things like 'settlement' have 98% data missing
#So, I choose the data I want to work with: loan amount, employment year, debt to income ratio(dti)
data_accept = data_accept[["loan_amnt", "dti", "emp_length"]] #I thought total_pymnt was the total amount borrowed but it wasn't(some even had 0)
#for i in data_accept.total_pymnt:
#    if i == 0:
#        print(0)
#    else:
#        pass
data_reject.drop(['Loan Title'], axis=1, inplace=True)
dummy = pd.get_dummies(data_accept.emp_length, prefix = 'emp')
data_accept = pd.merge(data_accept, dummy, how='left', left_index = True, right_index =True)
data_accept.drop(['emp_length'], axis=1, inplace=True)
print(data_accept.head())

   loan_amnt    dti  emp_1 year  emp_10+ years  emp_2 years  emp_3 years  \
0     3600.0   5.91           0              1            0            0   
1    24700.0  16.06           0              1            0            0   
2    20000.0  10.78           0              1            0            0   
3    35000.0  17.06           0              1            0            0   
4    10400.0  25.37           0              0            0            1   

   emp_4 years  emp_5 years  emp_6 years  emp_7 years  emp_8 years  \
0            0            0            0            0            0   
1            0            0            0            0            0   
2            0            0            0            0            0   
3            0            0            0            0            0   
4            0            0            0            0            0   

   emp_9 years  emp_< 1 year  
0            0             0  
1            0             0  
2            0             0 

In [9]:
#check the missing rate just in case
accepted_missing = data_accept.isnull().sum()*100/len(data_accept)
accepted_missing_df = pd.DataFrame({'Percent Missing in Accepted':accepted_missing})
print(accepted_missing_df)
reject_missing = data_reject.isnull().sum()*100/len(data_reject) #percentage of missing data
reject_missing_df = pd.DataFrame({'Percent Missing in Rejected':reject_missing})
print(reject_missing_df)

               Percent Missing in Accepted
loan_amnt                         0.001460
dti                               0.077144
emp_1 year                        0.000000
emp_10+ years                     0.000000
emp_2 years                       0.000000
emp_3 years                       0.000000
emp_4 years                       0.000000
emp_5 years                       0.000000
emp_6 years                       0.000000
emp_7 years                       0.000000
emp_8 years                       0.000000
emp_9 years                       0.000000
emp_< 1 year                      0.000000
               Percent Missing in Rejected
loan_amnt                              0.0
dti                                    0.0
emp_1 year                             0.0
emp_10+ years                          0.0
emp_2 years                            0.0
emp_3 years                            0.0
emp_4 years                            0.0
emp_5 years                            0.0
emp_6 years

In [10]:
#Drop the NAs
#it is only the accepted dataset that has the NAs
data_accept = data_accept.dropna()
print("amount of data left in accepted dataset after cleaning: ", len(data_accept))
print("amount of data left in rejected dataset after cleaning: ", len(data_reject))
#now make a column that tells whether a person got payment or not
#having 0 will mean the person is rejected
data_reject['loan_get'] = [0 for i in range(len(data_reject))]
#print(data_reject.head())
data_accept['loan_get'] = [1 for i in range(len(data_accept))]
#print(data_accept.head())
#merge the two dataset (the rejected are 0 accepted are 1 with the columns)
data = pd.concat([data_accept, data_reject])
print(data.head())

amount of data left in accepted dataset after cleaning:  2258957
amount of data left in rejected dataset after cleaning:  27648741
   loan_amnt    dti  emp_1 year  emp_10+ years  emp_2 years  emp_3 years  \
0     3600.0   5.91           0              1            0            0   
1    24700.0  16.06           0              1            0            0   
2    20000.0  10.78           0              1            0            0   
3    35000.0  17.06           0              1            0            0   
4    10400.0  25.37           0              0            0            1   

   emp_4 years  emp_5 years  emp_6 years  emp_7 years  emp_8 years  \
0            0            0            0            0            0   
1            0            0            0            0            0   
2            0            0            0            0            0   
3            0            0            0            0            0   
4            0            0            0            0         

### After cleaning the data: Modeling (+ Explaining the extension)
Now I have the cleaned data to train. I made a new column 'loan_get' to tell whether the borrower actually got the loan or not. The loan_amnt is the amount of money they got (if the loan_get = 1). [Optional part of the assignment]

Here, we can solve problem in 2 ways. 
1. Tell whether this person is going to get the loan or not
2. Tell how much this person is expected to get the loan if they can get the loan

The data we have are supervised with the 'loan_get' variable, thus making a supervised learning methods as our choice. 

First one can be done by a classifier(logistic regression) while the second one can be done by a regression(linear regression). I will do a classifier with cross validations & different c values as an extension. Also, doing the linear regression will be another extension, which uses dti values and employment years to figureout how much can they get if they get their loan proposals accepted. 

- After reading the column label's meaning and during the process of data cleaning, I've felt the need of making a simple AI rather than something very complex. To have many more criteria to test out, it for sure may have higher accuracy, however, sometimes, simpler model works good enough (Occam's Razor). 

- So, I was thinking about specific variable 'employment year' and 'loan amount' to see what are the relationship between these variables and the acceptance for the loan. And by considering prospect theory, which people feel much less dramatic increase/decrease of gain/loss when it gets further from 0. (ex, 100 USD increase from 10 USD or 1M USD has different utility in people's feeling which depends on the base (10 vs 1M from this example))
- However, I still wanted to chose something outside from the class while one variable (loan amount) has more of continuous characteristic and the other (employment year) has more of discrete/categorizing characteristics. So, that is why I chose mixed model 'Joint inidcator model' 


In [11]:
#Before we train the model, we need to split the data and set up the environment
#Set up the environment
from sklearn.model_selection import train_test_split
from sklearn import linear_model
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import numpy as np 
import sklearn.metrics as met

#first, split the data
train, test = train_test_split(data, test_size = 0.2, random_state=42) #data is huge
variables = list(data.columns)
x_vars = variables[:13]

X_train = train[x_vars]
y_train = train['loan_get']

test_X = test[x_vars]
test_y = test['loan_get']

In [12]:
#Logistic Regression 
#regr = linear_model.LogisticRegression()
#In Logistic Regression there is a problem with overfitting. 
#according to https://realpython.com/logistic-regression-python/#multi-variate-logistic-regression
#Regularization can possibly reduce or penalize the complexity of the model. 
#Higher C value means higher regulation = put more weight with the training data rather than panalizing it
c = [0.1, 0.5, 1,2, 5, 10] #somehow 1, 2, and 100 are giving me nan
best_c = []
for i in c:
    log_reg = linear_model.LogisticRegression(C = i, solver = 'lbfgs')
    #something extra to do ()
    accuracy_crossval = np.array(cross_val_score(log_reg, X_train, y_train, cv = 5)).mean()
    print('mean cross validation score :', accuracy_crossval)
    #performance measure
    best_c.append(accuracy_crossval)

print(best_c)   
#so, it seems that C = 0.5 has the highest cross validation mean score. 
#But others really have similar accuracy by 0.5% differences between the worst and the best



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


mean cross validation score : 0.9311657563361238


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


mean cross validation score : 0.9309854515545009


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


mean cross validation score : 0.9311892452753201
mean cross validation score : 0.9311732376886435
mean cross validation score : 0.9311732376886435


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


mean cross validation score : 0.9286932743343141
[0.9311657563361238, 0.9309854515545009, 0.9311892452753201, 0.9311732376886435, 0.9311732376886435, 0.9286932743343141]


### Evaluation

In [13]:
#evaluating one of the results
log_reg = linear_model.LogisticRegression(C=1, solver = 'lbfgs').fit(X_train, y_train)
accuracy_crossval = np.array(cross_val_score(log_reg, X_train, y_train, cv = 5)).mean()
prediction = log_reg.predict(test_X)
accuracy = met.accuracy_score(test_y, prediction)
print("accuracy score : ", accuracy)
print("coeffs: ", log_reg.coef_, ', ',log_reg.intercept_)
print('precision score : ', met.precision_score(test_y,prediction))
print('recall :', met.recall_score(test_y, prediction))
print('f1 score :', met.f1_score(test_y, prediction))

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


accuracy score :  0.9235141786228964
coeffs:  [[ 2.11343078e-05 -3.37065578e-04  6.42535023e-02  4.98211344e-01
   1.18163721e-01  1.04779275e-01  8.16041575e-02 -2.57279290e-01
   6.54850830e-02  6.04013452e-02  5.88229615e-02  5.18678690e-02
  -2.57051081e+00]] ,  [-1.75669943]
precision score :  0.0
recall : 0.0
f1 score : 0.0


In [14]:
print("confusion matrix: ")
met.confusion_matrix(test_y, prediction)

confusion matrix: 


array([[5524037,    5111],
       [ 452392,       0]], dtype=int64)

In [68]:
#Linear regression: predicting the amount of loan they can get based from dti and employment 
variables = list(data.columns)
x_vars = variables[1:13]
X = data[x_vars]
y = data['loan_amnt']

X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.20, random_state=42)
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)
prediction = regr.predict(X_test)

            dti  emp_1 year  emp_10+ years  emp_2 years  emp_3 years  \
347953    10.26           0              0            0            0   
2189888    4.74           0              0            0            0   
19236929  11.32           0              0            0            0   
2448190   -1.00           0              0            0            0   
26522277  10.94           0              0            0            0   

          emp_4 years  emp_5 years  emp_6 years  emp_7 years  emp_8 years  \
347953              0            0            0            0            0   
2189888             0            0            0            0            0   
19236929            0            0            0            0            0   
2448190             0            0            0            0            0   
26522277            0            0            0            0            0   

          emp_9 years  emp_< 1 year  
347953              0             1  
2189888             0       

In [69]:
#Linear regression performance measure
print("coeffs: ", regr.coef_, ', ',regr.intercept_)
print("MSE : ", met.mean_squared_error(y_test, prediction))
print("median absolute error : ", met.median_absolute_error(y_test, prediction))
print("r2: ", met.r2_score(y_test, prediction))
print("mean absolute error: ", met.mean_absolute_error(y_test, prediction))

coeffs:  [4.18449665e-04 1.27374562e+02 4.85169837e+03 1.89273505e+03
 1.93891877e+03 2.27142323e+03 2.24359191e+02 3.40521249e+03
 3.91783840e+03 3.74155930e+03 3.45610791e+03 1.99467613e+03] ,  11383.448976762373
MSE :  212719157.15092295
median absolute error :  8378.12722046813
r2:  0.0037672345762671533
mean absolute error:  9292.046696446623


### Summary
Variables Included : dti, employment length, loan amount(requested for rejected people/requested and recieved for approved people)
- Reason: The variable that were easy to convert & had small loss of information
Cleaning of the data & transformations: 
1. Dropping redundant columns
2. Dropping columns that has too much NAs
3. Onehot coding for employment length(from categorical variable to indicator)
4. Make accepted and rejected data to have same data types for each columns

ML models of choice: Logistic Regression(classification of getting the money or not), Linear Regression(estimating of how much money they can recieve depending on dti and employment length)

Settings : C value(regulation of how strong the training data is and how strong are the panalizing of training data)

Specific methods: cross-validation (5 folds)

Model Performance:
- Logistic Regression: Strange performance was found, having 0 as f1
    - accuracy score :  0.9235141786228964
    - coeffs:  [[ 2.11347307e-05 -3.36742421e-04  6.42533329e-02  4.98210030e-011.18163409e-01  1.04778998e-01  8.16039424e-02 -2.57278613e-01 6.54849104e-02  6.04011861e-02  5.88228065e-02  5.18677323e-02  -2.57050403e+00]] , [-1.7566948]
    - precision score :  0.0
    - recall : 0.0
    - f1 score : 0.0
- Linear Regression
   - coeffs:  [4.18449665e-04 1.27374562e+02 4.85169837e+03 1.89273505e+031.93891877e+03 2.27142323e+03 2.24359191e+02 3.40521249e+03 3.91783840e+03 3.74155930e+03 3.45610791e+03 1.99467613e+03] ,11383.4489767623734
   
  - MSE :  212719157.15092295
  - median absolute error :  8378.12722046813
  - r2:  0.0037672345762671533
  - mean absolute error:  9292.046696446623