<div>
<img src=https://www.institutedata.com/wp-content/uploads/2019/10/iod_h_tp_primary_c.svg width="300">
</div>

# *Mini Project 2*
By: Stephanie Nduaguba

### Instruction

The purpose of the Mini Project is to reinforce skills that have been covered in recent modules.

A. Find a dataset with some missing values and perform EDA:

- Use Logistic Regression, SVC and Bayes    

B. Things to consider:

- Can be either a binary classification or multiclassification
- Confusion matrix
- Score and interpretability with LIME for one model (the best prediction model)
    
C. Discuss the outputs.

D. Optional:
- Cross validation
- GridSearch

### Dataset Description

The dataset contains information about credit card applications. The purpose of using this dataset is to perform classification tasks, specifically for building prediction models related to credit card approval. The dataset contains a diverse set of attribute types, including continuous attributes, nominal attributes with a small number of values, and nominal attributes with a larger number of values. The types of features in the dataset include categories, integers, and real numbers. There are a total of 690 instances and 15 features.

*Has Missing Values?*
- *Yes*

### Additional Details

Given the nature of the dataset, and to ensure data confidentiality, all attribute names and values have been replaced with meaningless symbols.

Attribute Information:

    A1: b, a. Gender
    A2: continuous. Age
    A3: continuous. Debt
    A4: u, y, l, t. Married
    A5: g, p, gg. BankCustomer
    A6: c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff. EducationLevel
    A7: v, h, bb, j, n, z, dd, ff, o. Ethnicity
    A8: continuous. YearsEmployed
    A9: t, f. PriorDefault
    A10: t, f. Employed
    A11: continuous. CreditScore
    A12: t, f. DriversLicense
    A13: g, p, s. Citizen
    A14: continuous. ZipCode
    A15: continuous. Income
    A16: +,- (class attribute) ApprovalStatus

A1-A15 are the features
A16(+-) is label

Source: https://archive.ics.uci.edu/dataset/27/credit+approval

For this project, we use a dataset from a data mining course at the Polytechnic University of Catalonia (https://www.cs.upc.edu/~belanche/Docencia/mineria/mineria.html). The dataset describes the customers (seniority, age, marital status, income, and other characteristics), the loan (the requested amount, the price of the item), and its status (paid back or not).

For our analysis, we have used a copy of the dataset on GitHub.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn.naive_bayes import GaussianNB

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import label_binarize

from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, classification_report, roc_curve, roc_auc_score, auc

%matplotlib inline

In [2]:
# Load and view data

CreditScoring = "https://github.com/gastonstat/CreditScoring/raw/master/CreditScoring.csv"

df = pd.read_csv(CreditScoring)
df

Unnamed: 0,Status,Seniority,Home,Time,Age,Marital,Records,Job,Expenses,Income,Assets,Debt,Amount,Price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4450,2,1,1,60,39,2,1,1,69,92,0,0,900,1020
4451,1,22,2,60,46,2,1,1,60,75,3000,600,950,1263
4452,2,0,2,24,37,2,1,2,60,90,3500,0,500,963
4453,1,0,1,48,23,1,1,3,49,140,0,0,550,550


In [4]:
# Lowercase all the column names
df.columns = df.columns.str.lower()
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,1,9,1,60,30,2,1,3,73,129,0,0,800,846
1,1,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,2,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,1,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,1,0,1,36,26,1,1,1,46,107,0,0,310,910


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4455 entries, 0 to 4454
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   status     4455 non-null   int64
 1   seniority  4455 non-null   int64
 2   home       4455 non-null   int64
 3   time       4455 non-null   int64
 4   age        4455 non-null   int64
 5   marital    4455 non-null   int64
 6   records    4455 non-null   int64
 7   job        4455 non-null   int64
 8   expenses   4455 non-null   int64
 9   income     4455 non-null   int64
 10  assets     4455 non-null   int64
 11  debt       4455 non-null   int64
 12  amount     4455 non-null   int64
 13  price      4455 non-null   int64
dtypes: int64(14)
memory usage: 487.4 KB


We will treat the zero values as the missing values. Also some of the data in features are not meant to be int

### Dataset attribute

We can see that the DataFrame has the following columns:
    
    status: whether the customer managed to pay back the loan (1) or not (2)
    seniority: job experience in years
    home: type of homeownership: renting (1), a homeowner (2), and others
    time: period planned for the loan (in months)
    age: age of the client
    marital [status]: single (1), married (2), and others
    records: whether the client has any previous records: no (1), yes (2) (It’s not clear from the dataset description what kind of records we have in this column. For the purposes of this project, we may assume that it’s about records in the bank’s database.)
    job: type of job: full-time (1), part-time (2), and others
    expenses: how much the client spends per month
    income: how much the client earns per month
    assets: total worth of all the assets of the client
    debt: amount of credit debt
    amount: requested amount of the loan
    price: price of an item the client wants to buy
    
Although most of the columns are numerical, there are categorical: status, home, marital[status], records, and job. But the values are numerical and we need to translate them to the actual names.

In [9]:
# Convert categorical variables to strings

# Create value mapping for status
status_values = {
    1: 'ok',
    2: 'default',
    0: 'unknown'
}

# Use dictionary to do the mapping
df.status = df.status.map(status_values)
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,1,60,30,2,1,3,73,129,0,0,800,846
1,ok,17,1,60,58,3,1,1,48,131,0,0,1000,1658
2,default,10,2,36,46,2,2,3,90,200,3000,0,2000,2985
3,ok,0,1,60,24,1,1,1,63,182,2500,0,900,1325
4,ok,0,1,36,26,1,1,1,46,107,0,0,310,910


In [10]:
# Create value mapping for home
home_values = {
    1: 'rent',
    2: 'owner',
    3: 'private',
    4: 'ignore',
    5: 'parents',
    6: 'other',
    0: 'unknown'
}

df.home = df.home.map(home_values)

In [11]:
# Create value mapping for marital
marital_values = {
    1: 'single',
    2: 'married',
    3: 'widow',
    4: 'separated',
    5: 'divorced',
    0: 'unknown'
}

df.marital = df.marital.map(marital_values)

In [12]:
# Create value mapping for records
records_values = {
    1: 'no',
    2: 'yes',
    0: 'unknown'
}

df.records = df.records.map(records_values)

In [13]:
# Create value mapping for job
job_values = {
    1: 'fixed',
    2: 'parttime',
    3: 'freelance',
    4: 'others',
    0: 'unknown'
}

df.job = df.job.map(job_values)

In [14]:
# View categorical data transformation
df.head()

Unnamed: 0,status,seniority,home,time,age,marital,records,job,expenses,income,assets,debt,amount,price
0,ok,9,rent,60,30,married,no,freelance,73,129,0,0,800,846
1,ok,17,rent,60,58,widow,no,fixed,48,131,0,0,1000,1658
2,default,10,owner,36,46,married,yes,freelance,90,200,3000,0,2000,2985
3,ok,0,rent,60,24,single,no,fixed,63,182,2500,0,900,1325
4,ok,0,rent,36,26,single,no,fixed,46,107,0,0,310,910


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4455 entries, 0 to 4454
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   status     4455 non-null   object
 1   seniority  4455 non-null   int64 
 2   home       4455 non-null   object
 3   time       4455 non-null   int64 
 4   age        4455 non-null   int64 
 5   marital    4455 non-null   object
 6   records    4455 non-null   object
 7   job        4455 non-null   object
 8   expenses   4455 non-null   int64 
 9   income     4455 non-null   int64 
 10  assets     4455 non-null   int64 
 11  debt       4455 non-null   int64 
 12  amount     4455 non-null   int64 
 13  price      4455 non-null   int64 
dtypes: int64(9), object(5)
memory usage: 487.4+ KB
