<a href="https://colab.research.google.com/github/Hubert26/machine-learning/blob/main/project_Hubert_Szewczyk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#OPIS PROJEKTU

Lending Club to firma pożyczkowa typu peer-to-peer, która łączy pożyczkobiorców z
inwestorami za pośrednictwem platformy internetowej. Obsługuje osoby, które potrzebują
pożyczek osobistych w wysokości od 1000 do 40 000 USD. Pożyczkobiorcy otrzymują pełną
kwotę udzielonej pożyczki pomniejszoną o opłatę początkową, która jest uiszczana firmie.
Inwestorzy kupują weksle zabezpieczone osobistymi pożyczkami i płacą Lending Club
opłatę za usługę. Firma Lending Club udostępnia dane o wszystkich pożyczkach
udzielonych za pośrednictwem swojej platformy w określonych okresach.
Na potrzeby tego projektu zostały użyte dane dotyczące pożyczek udzielonych za
pośrednictwem Lending Club na przestrzeni lat 2007 -2011. Każda pożyczka jest opatrzona
informacją o tym, czy ostatecznie została spłacona (Fully Paid lub Charged off w kolumnie
loan_status).

W tym projekcie chcę zbudować model klasyfikacyjny, który na podstawie zebranych danych będzie przewidywał, czy potencjalny pozyczkobiorca spłaci swój dług.

##KOD

### Preprocessing danych:
1. [Import bibliotek](#0)
2. [Wygenerowanie danych](#1)
3. [Utworzenie kopii danych](#2)
4. [Zmiana typu danych i wstępna eksploracja](#3)
5. [LabelEncoder](#4)
6. [OneHotEncoder](#5)
7. [Pandas *get_dummies()*](#6)
8. [Standaryzacja - StandardScaler](#7)
9. [Przygotowanie danych do modelu](#8)

###IMPORT BIBLIOTEK

In [None]:
import numpy as np
import pandas as pd
import sklearn

In [None]:
!gdown --id 1LrKHVdjRK4pMQdsU3gPMUaCgYPbOTr_o

Downloading...
From: https://drive.google.com/uc?id=1LrKHVdjRK4pMQdsU3gPMUaCgYPbOTr_o
To: /content/Loan_data.csv
100% 46.9M/46.9M [00:00<00:00, 52.6MB/s]


In [None]:
df_raw = pd.read_csv("Loan_data.csv", sep=',', dtype={'0': int, '49': str})
df = df_raw.copy()

  df_raw = pd.read_csv("Loan_data.csv", sep=',', dtype={'0': int, '49': str})


In [None]:
df.head()

Unnamed: 0,id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,...,hardship_payoff_balance_amount,hardship_last_payment_amount,disbursement_method,debt_settlement_flag,debt_settlement_flag_date,settlement_status,settlement_date,settlement_amount,settlement_percentage,settlement_term
0,1077501,,5000.0,5000.0,4975.0,36 months,10.65%,162.87,B,B2,...,,,Cash,N,,,,,,
1,1077430,,2500.0,2500.0,2500.0,60 months,15.27%,59.83,C,C4,...,,,Cash,N,,,,,,
2,1077175,,2400.0,2400.0,2400.0,36 months,15.96%,84.33,C,C5,...,,,Cash,N,,,,,,
3,1076863,,10000.0,10000.0,10000.0,36 months,13.49%,339.31,C,C1,...,,,Cash,N,,,,,,
4,1075358,,3000.0,3000.0,3000.0,60 months,12.69%,67.79,B,B5,...,,,Cash,N,,,,,,


In [None]:
print(df.iloc[0])

id                       1077501
member_id                    NaN
loan_amnt                 5000.0
funded_amnt               5000.0
funded_amnt_inv           4975.0
                          ...   
settlement_status            NaN
settlement_date              NaN
settlement_amount            NaN
settlement_percentage        NaN
settlement_term              NaN
Name: 0, Length: 151, dtype: object


In [None]:
df.iloc[:, 49]

0             NaN
1             NaN
2             NaN
3             NaN
4             NaN
           ...   
42531    Mar-2008
42532    Jul-2010
42533    Jul-2010
42534    Jul-2010
42535    Jul-2010
Name: next_pymnt_d, Length: 42536, dtype: object

In [50]:
column_names = df.columns.tolist()
znalezione_wyrazy = [string for string in column_names if "fico" in string]
znalezione_wyrazy

['fico_range_low',
 'fico_range_high',
 'last_fico_range_high',
 'last_fico_range_low',
 'sec_app_fico_range_low',
 'sec_app_fico_range_high']

In [None]:
mask = df.isna()

In [None]:
naan_counts = [(df[column] == True).sum() for column in df.columns]


In [None]:
print(naan_counts)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 24, 3595, 0, 0, 0, 11247, 33, 0, 39, 2298, 13, 0, 21, 0, 0, 0, 0, 0, 0, 0, 1, 3, 5, 0, 2, 0, 0, 0, 0, 0, 0, 42535, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1846, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 17]


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42536 entries, 0 to 42535
Columns: 151 entries, id to settlement_term
dtypes: float64(120), object(31)
memory usage: 49.0+ MB


In [None]:
df.describe()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,installment,annual_inc,dti,delinq_2yrs,fico_range_low,fico_range_high,...,payment_plan_start_date,hardship_length,hardship_dpd,hardship_loan_status,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,42535.0,42535.0,42535.0,42535.0,42531.0,42535.0,42506.0,42535.0,42535.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,160.0,160.0,160.0
mean,,11089.722581,10821.585753,10139.938785,322.623063,69136.56,13.373043,0.152449,713.052545,717.052545,...,,,,,,,,4272.137875,49.905875,1.2
std,,7410.938391,7146.914675,7131.598014,208.927216,64096.35,6.726315,0.512406,36.188439,36.188439,...,,,,,,,,3119.373774,15.56369,4.085255
min,,500.0,500.0,0.0,15.67,1896.0,0.0,0.0,610.0,614.0,...,,,,,,,,193.29,10.69,0.0
25%,,5200.0,5000.0,4950.0,165.52,40000.0,8.2,0.0,685.0,689.0,...,,,,,,,,1842.75,40.0,0.0
50%,,9700.0,9600.0,8500.0,277.69,59000.0,13.47,0.0,710.0,714.0,...,,,,,,,,3499.35,49.97,0.0
75%,,15000.0,15000.0,14000.0,428.18,82500.0,18.68,0.0,740.0,744.0,...,,,,,,,,5701.1,60.6525,0.0
max,,35000.0,35000.0,35000.0,1305.19,6000000.0,29.99,13.0,825.0,829.0,...,,,,,,,,14798.2,92.74,24.0


In [None]:
selected_columns = [column for column, count in zip(df.columns, naan_counts) if count > 15]

In [None]:
df[selected_columns]

Unnamed: 0,dti,delinq_2yrs,inq_last_6mths,mths_since_last_delinq,open_acc,pub_rec,revol_bal,total_acc,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_amnt,policy_code,acc_now_delinq,pub_rec_bankruptcies,tax_liens,settlement_term
0,27.65,0.0,1.0,,3.0,0.0,13648.0,9.0,0.00,0.0,0.00,171.62,1.0,0.0,0.0,0.0,
1,1.00,0.0,5.0,,3.0,0.0,1687.0,4.0,0.00,122.9,1.11,119.66,1.0,0.0,0.0,0.0,
2,8.72,0.0,2.0,,2.0,0.0,2956.0,10.0,0.00,0.0,0.00,649.91,1.0,0.0,0.0,0.0,
3,20.00,0.0,1.0,35.0,10.0,0.0,5598.0,37.0,16.97,0.0,0.00,357.48,1.0,0.0,0.0,0.0,
4,17.94,0.0,0.0,38.0,15.0,0.0,27783.0,38.0,0.00,0.0,0.00,67.30,1.0,0.0,0.0,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
42531,10.00,,,,,,0.0,,0.00,0.0,0.00,0.00,1.0,,,,
42532,10.00,,,,,,0.0,,0.00,0.0,0.00,32.41,1.0,,,,
42533,10.00,,,,,,0.0,,0.00,0.0,0.00,82.03,1.0,,,,
42534,4.00,,,,,,0.0,,0.00,0.0,0.00,205.32,1.0,,,,


In [38]:
for i in column_names:
  print(i)

id
member_id
loan_amnt
funded_amnt
funded_amnt_inv
term
int_rate
installment
grade
sub_grade
emp_title
emp_length
home_ownership
annual_inc
verification_status
issue_d
loan_status
pymnt_plan
url
desc
purpose
title
zip_code
addr_state
dti
delinq_2yrs
earliest_cr_line
fico_range_low
fico_range_high
inq_last_6mths
mths_since_last_delinq
mths_since_last_record
open_acc
pub_rec
revol_bal
revol_util
total_acc
initial_list_status
out_prncp
out_prncp_inv
total_pymnt
total_pymnt_inv
total_rec_prncp
total_rec_int
total_rec_late_fee
recoveries
collection_recovery_fee
last_pymnt_d
last_pymnt_amnt
next_pymnt_d
last_credit_pull_d
last_fico_range_high
last_fico_range_low
collections_12_mths_ex_med
mths_since_last_major_derog
policy_code
application_type
annual_inc_joint
dti_joint
verification_status_joint
acc_now_delinq
tot_coll_amt
tot_cur_bal
open_acc_6m
open_act_il
open_il_12m
open_il_24m
mths_since_rcnt_il
total_bal_il
il_util
open_rv_12m
open_rv_24m
max_bal_bc
all_util
total_rev_hi_lim
inq_fi
to

In [51]:
df[znalezione_wyrazy].describe()

Unnamed: 0,fico_range_low,fico_range_high,last_fico_range_high,last_fico_range_low,sec_app_fico_range_low,sec_app_fico_range_high
count,42535.0,42535.0,42535.0,42535.0,0.0,0.0
mean,713.052545,717.052545,689.922511,676.952039,,
std,36.188439,36.188439,80.818099,119.647752,,
min,610.0,614.0,0.0,0.0,,
25%,685.0,689.0,644.0,640.0,,
50%,710.0,714.0,699.0,695.0,,
75%,740.0,744.0,749.0,745.0,,
max,825.0,829.0,850.0,845.0,,


In [52]:
df[znalezione_wyrazy].iloc[0]

fico_range_low             735.0
fico_range_high            739.0
last_fico_range_high       739.0
last_fico_range_low        735.0
sec_app_fico_range_low       NaN
sec_app_fico_range_high      NaN
Name: 0, dtype: float64