# Logistic Regression Coding Challenge

© Explore Data Science Academy

## Honour Code

I **Shedrack**, **Udeh**, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the EDSA honour code (https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

## Overview

Within this coding challenge, we begin our practical experience of building models for classification problems. We do so with a basic Logistic Regression model.   

<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/credit_card.jpg"
     alt="Learn good habits to avoid modeling debt"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Learn good habits to avoid modeling debt... Photo by <a href="https://unsplash.com/@rupixen?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"> Rupixen.com </a> on Unsplash.
</div>

The structure of this notebook is as follows:

 - First, we will start off by loading and viewing the dataset.
 - We will see that the dataset has a mixture of both numerical and non-numerical features; that it contains values from different ranges; and that it contains a number of missing entries.
 - Based upon the observations above, we will preprocess the dataset to ensure the machine learning model we choose can make good predictions.
 - After our data is in good shape, we will do some exploratory data analysis to build our intuitions.
 - Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

### Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score


## The Dataset
We'll use the [Credit Card Approval dataset](http://archive.ics.uci.edu/ml/datasets/credit+approval) from the UCI Machine Learning Repository.
    
We explore the variables within this dataset in the sections below. 

### Reading in the data

First, loading and viewing the dataset. We find that since this data are confidential, the contributor of the dataset has anonymized the feature names.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/89fee4463f428f55d31a254924e18501a3c468c3/Data/classification_sprint/cc_approvals.data',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. 

This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.   

As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing.

In [3]:
df.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-



<li>Our dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000.
<li>Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.</li>
</ul>

## Data Cleaning

## Question 1 

Write a function to clean the given data . The function should:
* Replace the '?'s with NaN.
* Impute the missing values with mean imputation.
* Impute the missing values of non-numeric columns with the most frequent values as present in the respective columns.

_**Function Specifications:**_
* Should take a pandas Dataframe and column name as input and return a list as an output.
* The list should be a count of unique values in the column

In [7]:
### START FUNCTION
def data_cleaning(data, column_name):
    # Replacing the '?' with NaN
    df.replace('?', np.nan, inplace=True)
    
    # Imputing the missing values with mean for numeric columns
    numerical_cols = df.select_dtypes(include=np.number).columns
    df[numerical_cols] = df[numerical_cols].astype(float)
    df[numerical_cols] = df[numerical_cols].fillna(df[numerical_cols].mean())
    
    # Imputing the missing values with mode for non-numerical columns
    non_numerical_cols = df.select_dtypes(exclude=np.number).columns
    df[non_numerical_cols] = df[non_numerical_cols].fillna(df[non_numerical_cols].mode().iloc[0])
    
    # Return count of unique values in the column

    return df[column_name].value_counts().tolist()

    
### END FUNCTION

In [8]:
data_cleaning(df, 9)

[395, 295]

_**Expected Outputs:**_
    

>```
data_cleaning(df, 0) == [480, 210]
data_cleaning(df, 9) == [395, 295]
```

## Data Preprocessing

## Question 2

Write a function to pre-process the data so that we can run it through the classifier. The function should:
* Convert the non-numeric data into numeric using sklearn's ```labelEncoder``` 
* Drop the features 11 and 13 and convert the DataFrame to a NumPy array
* Split the data into features and labels
* Standardise the features using sklearn's ```MinMaxScaler```
* Split the data into 80% training and 20% testing data.
* Use the `train_test_split` method from `sklearn` to do this.
* Set random_state to equal 42 for this internal method. 

_**Function Specifications:**_
* Should take a dataframe as input.
* The input should be the raw unprocessed dataframe df.
* Should return two `tuples` of the form `(X_train, y_train), (X_test, y_test)`.

In [11]:
### START FUNCTION
def data_preprocess(df):  
    
    # Converting the non-numeric data to numeric using LabelEncoder
    label_encoder = LabelEncoder()
    df = df.apply(lambda x: label_encoder.fit_transform(x.astype(str)) if x.dtype == "object" else x)
    
    # Droping the features 11 and 13
    df = df.drop([11, 13], axis=1)
    
    # Converting DataFrame to NumPy array
    data = df.values
    
    # Spliting data into features and labels
    X = data[:, :-1]
    y = data[:, -1]
    
    # Standardize the features using MinMaxScaler
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    
    # Spliting data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    #return (X_train, y_train), (X_test, y_test)
    
    return (X_train, y_train), (X_test, y_test)

### END FUNCTION

In [12]:
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:1])
print(y_train[:1])
print(X_test[:1])
print(y_test[:1])

[[1.         0.25862069 0.48214286 1.         1.         0.38461538
  0.25       0.         0.         0.         0.         0.
  0.        ]]
[1.]
[[0.         0.20402299 0.05357143 0.5        0.         0.38461538
  0.25       0.         0.         1.         0.02985075 0.
  0.00105   ]]
[1.]


_**Expected Outputs:**_

```python
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:2])
print(y_train[:2])
print(X_test[:2])
print(y_test[:2])
```

> ```
[[1.         0.25787966 0.48214286 1.         1.         0.42857143
  0.33333333 0.         0.         0.         0.         0.
  0.        ]]
[1.]
[[0.5        1.         0.05357143 0.66666667 0.33333333 0.42857143
  0.33333333 0.         0.         1.         0.02985075 0.
  0.00105   ]]
[1.]
```

## Training the model

## Question 3.1

Now that we have formatted our data, we can fit a model using sklearn's `LogisticRegression` class with solver 'lbfgs'. Write a function that will take as input `(X_train, y_train)` that we created previously, and return a trained model.

_**Function Specifications:**_
* Should take two numpy `arrays` as input in the form `(X_train, y_train)`.
* The returned model should be fitted to the data.

In [19]:
### START FUNCTION
def train_model(X_train, y_train):
    
    # Creating and training a logistic regression model:
    model = LogisticRegression(solver='lbfgs')
    model.fit(X_train, y_train)
    
    return model 

### END FUNCTION

In [20]:
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)

3.048954832271885
[[ 0.10540556 -0.6212872   0.01130681  0.76133957  0.30195581 -0.2834454
  -0.49387636 -0.76575712 -3.43528863 -1.06426785 -0.82406864  0.04956249
  -1.35582238]]


_**Expected Outputs:**_

```python
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)
```
```
1.5068926456005878
[[ 0.25237869 -0.22847881 -0.01779302  2.00977742  0.23903441 -0.29504922
  -0.08952344 -0.83468871 -3.48756932 -1.07648711 -0.83688921  0.07860585
  -1.3077735 ]]
```

## Testing the model

### Question 3.2 

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Write a function which returns the roc auc score of your trained model when tested with the test set.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a `float` of the roc auc score of the model. This number should be between zero and one.

_**Hint**_  Use the positive class's probability as the score

In [25]:
### START FUNCTION
def roc_score(lm, X_test, y_test):

    # Prediction of the probabilities of the positive class
    y_prediction_prob = lm.predict_proba(X_test)[:, 1]
    
    # Calculating ROC AUC score
    roc_auc = roc_auc_score(y_test, y_prediction_prob)
    
    return roc_auc

### END FUNCTION

In [26]:
print(roc_score(lm,X_test,y_test))

0.8821428571428571


_**Expected Outputs:**_
    
```python
print(roc_score(lm,X_test,y_test))
```
>```
0.8865546218487395
```

### Question 3.3

Write a function which calculates the Accuracy, Precision, Recall and F1 scores.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a tuple in the form (`Accuracy`, `Precision`, `Recall`, `F1-Score`)

In [29]:
### START FUNCTION
def scores(lm, X_test, y_test):

    # Making the predictions on the test set
    y_pred = lm.predict(X_test)
    
    # Calculating the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    
    # Calculating the precision
    precision = precision_score(y_test, y_pred)
    
    # Calculating the recall
    recall = recall_score(y_test, y_pred)
    
    # Calculating the F1-score
    f1 = f1_score(y_test, y_pred)
    
    #return accuracy, precision, recall, f1
    
    return (accuracy, precision, recall, f1)

### END FUNCTION

In [30]:
(accuracy, precision, recall, f1) = scores(lm, X_test, y_test)    

print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)

Accuracy: 0.826087
Precision: 0.854839
Recall: 0.779412
F1 score: 0.815385


_**Expected Outputs:**_
```python
(accuracy, precision, recall, f1) = scores(lm,X_test,y_test)
    
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)
```
> ```
Accuracy: 0.833333
Precision: 0.846154
Recall: 0.808824
F1 score: 0.827068
```