Atalov S. (TSI AUCA)

# Home Loans

---

<img src="https://storage.googleapis.com/kaggle-datasets-images/2806004/4841702/2531be016c7f9ffbcbf7d848c847f9c5/dataset-cover.jpeg?t=2023-01-12-06-32-57![image.png](attachment:image.png)">

---
## 0. Problem Statement

About Company
Dream Housing Finance company deals in all home loans. They have a presence across all urban, semi-urban and rural areas. The customer first applies for a home loan after that company validates the customer's eligibility for a loan.

#### Problem
The company wants to automate the loan eligibility process (real-time) based on customer detail provided while filling out the online application form. These details are Gender, Marital Status, Education, Number of Dependents, Income, Loan Amount, Credit History and others. To automate this process, they have given a problem identifying the customer segments eligible for loan amounts to target these customers specifically. Here they have provided a partial data set.

In [133]:
import pandas as pd

In [134]:
# read the datafile
df = pd.read_csv("train.csv")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            552 non-null    object 
 1   Gender             539 non-null    object 
 2   Married            549 non-null    object 
 3   Dependents         537 non-null    object 
 4   Education          552 non-null    object 
 5   Self_Employed      525 non-null    object 
 6   ApplicantIncome    552 non-null    int64  
 7   CoapplicantIncome  552 non-null    float64
 8   LoanAmount         531 non-null    float64
 9   Loan_Amount_Term   540 non-null    float64
 10  Credit_History     502 non-null    float64
 11  Property_Area      552 non-null    object 
 12  Loan_Status        552 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 56.2+ KB


## 1. Data Preprocessing

### Drop columns

`Loan_ID`

In [135]:
df.drop(columns=["Loan_ID"])

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,Male,Yes,0,Graduate,No,2132,1591.0,96.0,360.0,1.0,Semiurban,Y
1,Female,No,1,Graduate,No,3481,0.0,155.0,36.0,1.0,Semiurban,N
2,Male,Yes,0,Graduate,No,2383,3334.0,172.0,360.0,1.0,Semiurban,Y
3,Male,Yes,3+,Not Graduate,No,4755,0.0,95.0,,0.0,Semiurban,N
4,Female,Yes,2,Graduate,No,1378,1881.0,167.0,360.0,1.0,Urban,N
...,...,...,...,...,...,...,...,...,...,...,...,...
547,Male,Yes,0,Graduate,No,2400,2167.0,115.0,360.0,1.0,Semiurban,Y
548,Male,No,0,Graduate,No,2500,20000.0,103.0,360.0,1.0,Semiurban,Y
549,Male,Yes,1,Graduate,No,3315,0.0,96.0,360.0,1.0,Semiurban,Y
550,Male,No,0,Graduate,Yes,11000,0.0,83.0,360.0,1.0,Urban,N


### Check missing values:

In [136]:
df["LoanAmount"].head(50)

0      96.0
1     155.0
2     172.0
3      95.0
4     167.0
5      81.0
6     120.0
7     225.0
8     138.0
9     113.0
10    164.0
11    181.0
12    125.0
13    160.0
14     67.0
15    135.0
16     75.0
17    140.0
18     94.0
19    132.0
20    144.0
21     66.0
22     60.0
23    100.0
24     70.0
25    110.0
26    180.0
27    108.0
28    210.0
29     90.0
30    120.0
31    234.0
32    120.0
33      NaN
34     84.0
35    135.0
36    105.0
37      NaN
38     88.0
39    150.0
40     90.0
41    151.0
42    182.0
43    120.0
44    225.0
45    140.0
46      NaN
47     55.0
48     85.0
49     88.0
Name: LoanAmount, dtype: float64

In [137]:
df.isnull().sum()

Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        27
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           21
Loan_Amount_Term     12
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

### Fill the missing values

In [138]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LogisticRegression

In [139]:
df.drop(columns=["Loan_ID"], inplace=True)

In [140]:
df.isnull().sum()

Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        27
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           21
Loan_Amount_Term     12
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

In [141]:

cleaned = df.dropna(subset=["Gender"])
X = cleaned[["ApplicantIncome","CoapplicantIncome"]]
y = cleaned["Gender"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)

missing_indices = df[df['Gender'].isnull()].index

# Заполните пропущенные значения в "LoanAmount" предсказанными значениями из модели
df.loc[missing_indices, 'Gender'] = knn.predict(df.loc[missing_indices, ['ApplicantIncome', 'CoapplicantIncome']])

In [142]:
cleaned = df.dropna(subset=["Credit_History"])
X = cleaned[["ApplicantIncome","CoapplicantIncome"]]
y = cleaned["Credit_History"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)

missing_indices = df[df['Credit_History'].isnull()].index

# Заполните пропущенные значения в "LoanAmount" предсказанными значениями из модели
df.loc[missing_indices, 'Credit_History'] = knn.predict(df.loc[missing_indices, ['ApplicantIncome', 'CoapplicantIncome']])

In [144]:

cleaned = df.dropna(subset=["Self_Employed"])
X = cleaned[["ApplicantIncome","CoapplicantIncome"]]
y = cleaned["Self_Employed"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
predictions = knn.predict(X_test)
accuracy_score(y_test, predictions)

missing_indices = df[df['Self_Employed'].isnull()].index

# Заполните пропущенные значения в "LoanAmount" предсказанными значениями из модели
df.loc[missing_indices, 'Self_Employed'] = knn.predict(df.loc[missing_indices, ['ApplicantIncome', 'CoapplicantIncome']])

In [145]:
df["LoanAmount"] = df["LoanAmount"].fillna(df["LoanAmount"].mean())

In [146]:
df["Loan_Amount_Term"] = df["Loan_Amount_Term"].fillna(df["Loan_Amount_Term"].mean())

In [147]:
df["Married"] = df["Married"].fillna(df["Married"].mode()[0])

In [148]:
df["Dependents"] = df["Dependents"].fillna(df["Dependents"].mode()[0])

### Handle non-numeric columns

In [149]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 552 entries, 0 to 551
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Gender             552 non-null    object 
 1   Married            552 non-null    object 
 2   Dependents         552 non-null    object 
 3   Education          552 non-null    object 
 4   Self_Employed      552 non-null    object 
 5   ApplicantIncome    552 non-null    int64  
 6   CoapplicantIncome  552 non-null    float64
 7   LoanAmount         552 non-null    float64
 8   Loan_Amount_Term   552 non-null    float64
 9   Credit_History     552 non-null    float64
 10  Property_Area      552 non-null    object 
 11  Loan_Status        552 non-null    object 
dtypes: float64(4), int64(1), object(7)
memory usage: 51.9+ KB


In [150]:
df["Gender"] = df["Gender"].replace({"Male": 1, "Female": 0})

In [151]:
df["Married"] = df["Married"].replace({"Yes": 1, "No": 0})
df["Self_Employed"] = df["Self_Employed"].replace({"Yes": 1, "No": 0})
df["Education"] = df["Education"].replace({"Graduate": 1, "Not Graduate": 0})
df["Loan_Status"] = df["Loan_Status"].replace({"Y": 1, "N": 0})

In [152]:
df["Credit_History"] = df["Credit_History"].astype(int)

In [153]:
df

Unnamed: 0,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,1,1,0,1,0,2132,1591.0,96.0,360.000000,1,Semiurban,1
1,0,0,1,1,0,3481,0.0,155.0,36.000000,1,Semiurban,0
2,1,1,0,1,0,2383,3334.0,172.0,360.000000,1,Semiurban,1
3,1,1,3+,0,0,4755,0.0,95.0,340.555556,0,Semiurban,0
4,0,1,2,1,0,1378,1881.0,167.0,360.000000,1,Urban,0
...,...,...,...,...,...,...,...,...,...,...,...,...
547,1,1,0,1,0,2400,2167.0,115.0,360.000000,1,Semiurban,1
548,1,0,0,1,0,2500,20000.0,103.0,360.000000,1,Semiurban,1
549,1,1,1,1,0,3315,0.0,96.0,360.000000,1,Semiurban,1
550,1,0,0,1,1,11000,0.0,83.0,360.000000,1,Urban,0


In [154]:
df = pd.get_dummies(df, columns=["Property_Area", "Dependents"])

In [155]:
df["Gender"] = df["Gender"].astype(int)
df["Married"] = df["Married"].astype(int)

### Train Test Split

In [156]:
X = df.drop(columns=["Loan_Status"])
y = df["Loan_Status"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

---
## 2. Modeling

### kNN

In [157]:
KNN = KNeighborsClassifier(n_neighbors=3)
KNN.fit(X_train,y_train)

In [158]:
X_train

Unnamed: 0,Gender,Married,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_Rural,Property_Area_Semiurban,Property_Area_Urban,Dependents_0,Dependents_1,Dependents_2,Dependents_3+
388,1,0,1,0,9167,0.0,185.0,360.0,1,True,False,False,False,False,False,True
516,0,1,1,0,3167,2283.0,154.0,360.0,1,False,True,False,True,False,False,False
210,0,0,1,0,3812,0.0,112.0,360.0,1,True,False,False,False,True,False,False
15,1,1,1,0,3276,484.0,135.0,360.0,1,False,True,False,False,False,True,False
336,1,1,0,0,4931,0.0,128.0,360.0,0,False,True,False,False,False,False,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,1,0,0,0,4885,0.0,48.0,360.0,1,True,False,False,True,False,False,False
106,1,1,1,0,4166,3369.0,201.0,360.0,1,False,False,True,False,True,False,False
270,1,1,1,0,2400,0.0,75.0,360.0,1,False,False,True,True,False,False,False
435,1,1,1,0,6065,2004.0,250.0,360.0,1,False,True,False,False,True,False,False


In [1]:
X_train

NameError: name 'X_train' is not defined

In [160]:
predictions = knn.predict(X_train)
accuracy_score(X_train, predictions)


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- Credit_History
- Dependents_0
- Dependents_1
- Dependents_2
- Dependents_3+
- ...


In [161]:
accuracy_score(X_test, predictions)

ValueError: Found input variables with inconsistent numbers of samples: [111, 53]

### Logistic Regression

In [162]:
# fit the model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Оценка производительности модели на обучающем наборе данных
model.score(X_train, y_train)



0.8321995464852607

In [163]:
# show score on train data
model.score(X_train, y_train)




0.8321995464852607

In [164]:
# show score on test data

model.score(X_test, y_test)

0.7567567567567568

---
## 3. Predict Test Data 

### Read and Prepare test data in the same way as was done above
`real_test.csv`

In [177]:
df_test = pd.read_csv("real_test.csv")
df_test.info()
submission = df_test[["Loan_ID"]]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62 entries, 0 to 61
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            62 non-null     object 
 1   Gender             62 non-null     object 
 2   Married            62 non-null     object 
 3   Dependents         62 non-null     object 
 4   Education          62 non-null     object 
 5   Self_Employed      57 non-null     object 
 6   ApplicantIncome    62 non-null     int64  
 7   CoapplicantIncome  62 non-null     float64
 8   LoanAmount         61 non-null     float64
 9   Loan_Amount_Term   60 non-null     float64
 10  Credit_History     62 non-null     float64
 11  Property_Area      62 non-null     object 
dtypes: float64(4), int64(1), object(7)
memory usage: 5.9+ KB


In [178]:
missing_indices = df_test[df_test['Self_Employed'].isnull()].index

# Заполните пропущенные значения в "LoanAmount" предсказанными значениями из модели
df_test.loc[missing_indices, 'Self_Employed'] = knn.predict(df_test.loc[missing_indices, ['ApplicantIncome', 'CoapplicantIncome']])

In [179]:
missing_indices = df_test[df_test['LoanAmount'].isnull()].index

# Заполните пропущенные значения в "LoanAmount" предсказанными значениями из модели
df_test.loc[missing_indices, 'LoanAmount'] = linear.predict(df_test.loc[missing_indices, ['ApplicantIncome', 'CoapplicantIncome']])

In [180]:
df_test["Loan_Amount_Term"] = df_test["Loan_Amount_Term"].fillna(df_test["Loan_Amount_Term"].mean())

In [181]:
df_test["Married"] = df_test["Married"].replace({"Yes": 1, "No": 0})
df_test["Self_Employed"] = df_test["Self_Employed"].replace({"Yes": 1, "No": 0})
df_test["Education"] = df_test["Education"].replace({"Graduate": 1, "Not Graduate": 0})

In [182]:
df_test["Gender"] = df_test["Gender"].replace({"Male": 1, "Female": 0})

In [183]:
df_test = pd.get_dummies(df_test, columns=["Property_Area", "Dependents"])

### Make a prediction using your best model:

In [184]:
df_test = df_test.drop(columns =["Loan_ID"])

In [185]:
predictions = model.predict(df_test)

In [187]:
submission["Loan_Status"] = predictions

### Save predictions as `studentname_predictions.csv` and submit in ecourse

HINT: Use `df.to_csv('studentname_predictions.csv', index=False)`

In [189]:
submission.to_csv('Zhunusov_Adylbek.csv', index=False)