## Credit Preditor using Logistic Regression

### Setting up the environment

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
import numpy as np
import math

In [14]:
df = pd.read_csv("german_data.csv", sep=" ", header=None)
df.columns
df[20] = np.where(df[20] == 2, 0, 1)
df[20] = df[20].astype("category", copy=False)

In [15]:
df.head()
df.describe()


Unnamed: 0,1,4,7,10,12,15,17
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,20.903,3271.258,2.973,2.845,35.546,1.407,1.155
std,12.058814,2822.736876,1.118715,1.103718,11.375469,0.577654,0.362086
min,4.0,250.0,1.0,1.0,19.0,1.0,1.0
25%,12.0,1365.5,2.0,2.0,27.0,1.0,1.0
50%,18.0,2319.5,3.0,3.0,33.0,1.0,1.0
75%,24.0,3972.25,4.0,4.0,42.0,2.0,1.0
max,72.0,18424.0,4.0,4.0,75.0,4.0,2.0


### Here it is important to NORMALIZE the values of the predictors

#### Normalizing (or scaling) the predictors helps when variables hugely vary in measurment (for example Age vs Salary),
#### allowing the model to give the same importance to every variable (independent of its units of measurment)
#### before the actual training process

In [26]:
X = df[[1,4,7,10,12,15,17]]##Only the numerical data, because transforming the categories to floating values would requiere pipelines
y = df[20]  ##Note that this column contains categories, not int64

In [34]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
## this is Z-score standarization, which basically substracts the mean to all the values to later devide each value by
## the standard deviation, usually making most values range from [-3;3] 
X_standarized = scaler.fit_transform(X)

### Preparing the data for training the model

In [28]:
from sklearn.model_selection import train_test_split

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X_standarized, y, test_size=0.3, random_state=990)

### Introducing the model and training it

In [30]:
from sklearn.linear_model import LogisticRegression

In [31]:
md = LogisticRegression()
md.fit(X_train, y_train)
md.coef_

array([[-0.43688133, -0.07862818, -0.20658385, -0.03123329,  0.30904876,
         0.06984367, -0.06817791]])

### Trying for the predictions

In [32]:
predictions = md.predict(X_test)
predictions.view()

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [33]:
y_test.view()

  y_test.view()


707    0
725    1
91     1
19     1
31     1
      ..
698    1
307    0
776    1
257    0
297    1
Name: 20, Length: 300, dtype: category
Categories (2, int64): [0, 1]

### Evaluation of model

In [38]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, predictions)
print(f'Accuary: {accuracy: .2f}')

Accuary:  0.73


#### An accuary of 73% is actually incredible when we consider we didn't take into account most vaiables