<center> <h1>CreditCardApprovalPredictor:</h1><h2> A Logistic Regression Model for Credit Card Application Approval</h2></center>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/credit_card.jpg"
     alt="Learn good habits to avoid modelling debt"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Learn good habits to avoid modelling debt... Photo by <a href="https://unsplash.com/@rupixen?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"> Rupixen.com </a> on Unsplash. <br> Project completed by Ikechukwu Chilaka
</div>

### 1.0 Introduction
In this challenge project, I will begin a practical experience of building models for classification problems.using basic Logistic Regression model.  

#### 1.1 Objective
- Implement a logistic regression model from scratch to solve a classification problem.
- Apply learned concepts to preprocess data, fit the model, and evaluate its performance.
- Enhance problem-solving skills by addressing a real-world classification challenge.

#### 1.2 Method
- First, I started off by loading and viewing the dataset.
- I discovered that the dataset has a mixture of both numerical and non-numerical features; that it contains values from different ranges; and that it contains a number of missing entries.
 - Based upon the observations above, I carried out preprocessing steps on the dataset to ensure the machine learning model I choose can make good predictions.
 - Once the data was in good shape, I did some exploratory data analysis to build my intuitions.
 - Finally, I built a machine learning model that can predict if an individual's application for a credit card will be accepted.
 
### Imports

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

### 2.0  The Dataset
I will use the [Credit Card Approval dataset](http://archive.ics.uci.edu/ml/datasets/credit+approval) from the UCI Machine Learning Repository.

The variables within this dataset will be explored in the sections below. 

#### 2.1 Reading in the data

First, loading and viewing the dataset. We find that since this data are confidential, the contributor of the dataset has anonymized the feature names.

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/89fee4463f428f55d31a254924e18501a3c468c3/Data/classification_sprint/cc_approvals.data',header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


The output may appear a bit confusing at first glance, but I will try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but [this blog](http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html) gives us a pretty good overview of the probable features. The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. 

This gives  a pretty good starting point, and I can map these features with respect to the columns in the output.   

From our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing.

In [3]:
df.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


<li>The dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64, respectively) and all the other features contain non-numeric values.</li><br>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000.</br> 
<br>    
<li>Finally, the dataset has missing values, which I will take care of in this task. The missing values in the dataset are labelled with '?', which can be seen in the last cell's output.</li>

#### 2.2 Cleaning the data

* Replace the '?'s with NaN.
* Impute the missing values with mean imputation.
* Impute the missing values of non-numeric columns with the most frequent values as present in the respective columns.

In [4]:
def data_cleaning(data, column_name):
    """
    Cleans the data by replacing missing values and imputing missing values.

    Parameters:
        data (pd.DataFrame): The pandas DataFrame to clean.
        column_name (str): The column name to calculate the unique value counts.

    Returns:
        list: A list of counts of unique values in the specified column.

    Raises:
        ValueError: If 'data' is not a DataFrame or 'column_name' is not in data.
    """
    # Ensure input is a DataFrame
    if not isinstance(data, pd.DataFrame):
        raise ValueError("Data input is not a DataFrame")

    # Ensure column name is in the DataFrame
    if column_name not in data.columns:
        raise ValueError(f"{column_name} is not a column in the data")

    # Replace '?' with NaN
    data = data.replace('?', np.nan)
    
    # Impute missing values
    # Treating numeric columns first - fill with mean
    for column in data.columns:
        if data[column].dtype in ['float64', 'int64']:
            # Since missing values are NaN, astype(float) will exclude them in the calculation
            mean_calc = data[column].astype(float).mean()
            data[column] = data[column].astype(float).fillna(mean_calc)
        else:
            # Treating non-numeric - fill with most frequent value
            most_frequent_value = data[column].mode().iloc[0]
            data[column] = data[column].fillna(most_frequent_value)
    
    # Return the count of unique values in the specified column
    return list(data[column_name].value_counts())

In [5]:
#Test
print(data_cleaning(df, 0))
print(data_cleaning(df, 9))

[480, 210]
[395, 295]


**Expected Outputs:**

```python

    data_cleaning(df, 0) == [480, 210]
    data_cleaning(df, 9) == [395, 295]
```

#### 2.3 Data Preprocessing

* Convert the non-numeric data into numeric using sklearn's ```labelEncoder``` 
* Drop the features 11 and 13 and convert the DataFrame to a NumPy array
* Split the data into features and labels
* Standardise the features using sklearn's ```MinMaxScaler```
* Split the data into 80% training and 20% testing data.
* Use the `train_test_split` method from `sklearn` to do this.
* Set random_state to equal 42 for this internal method. 

In [6]:
def data_preprocess(df):
    """
    Processes the input DataFrame for classification tasks by encoding, scaling, and splitting.

    Parameters:
        df (pd.DataFrame): The raw, unprocessed dataframe.

    Returns:
        tuple: Two tuples containing the training and testing data splits: ((X_train, y_train), (X_test, y_test)).
    """
    # Initialize LabelEncoder
    label_encoder = LabelEncoder()
    
    # Convert non-numeric columns to numeric using LabelEncoder
    for column in df.columns:
        if df[column].dtype == 'object':
            df[column] = label_encoder.fit_transform(df[column])

    # Drop the features 11 and 13
    df = df.drop(columns=[df.columns[11], df.columns[13]])

    # Convert DataFrame to a NumPy array
    data_np = df.values
    
    # Split data into features and labels (assuming the last column is the label) - standard convention
    X = data_np[:, :-1]
    y = data_np[:, -1]
    
    # Standardize the features using MinMaxScaler
    scaler = MinMaxScaler()
    X = scaler.fit_transform(X)
    
    # Split the data into 80% training and 20% testing data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    return (X_train, y_train), (X_test, y_test)

In [7]:
#Test
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:1])
print(y_train[:1])
print(X_test[:1])
print(y_test[:1])

[[1.         0.25787966 0.48214286 1.         1.         0.42857143
  0.33333333 0.         0.         0.         0.         0.
  0.        ]]
[1.]
[[0.5        1.         0.05357143 0.66666667 0.33333333 0.42857143
  0.33333333 0.         0.         1.         0.02985075 0.
  0.00105   ]]
[1.]


**Expected Outputs:*

```python
(X_train, y_train), (X_test, y_test) = data_preprocess(df)
print(X_train[:1])
print(y_train[:1])
print(X_test[:1])
print(y_test[:1])
```

> ```
[[1.         0.25787966 0.48214286 1.         1.         0.42857143
  0.33333333 0.         0.         0.         0.         0.
  0.        ]]
[1.]
[[0.5        1.         0.05357143 0.66666667 0.33333333 0.42857143
  0.33333333 0.         0.         1.         0.02985075 0.
  0.00105   ]]
[1.]
```

### 3.0 Training the model

Now that I have formatted our data, I can fit a model using sklearn's `LogisticRegression` class with solver 'lbfgs'. I will write a function that will take as input `(X_train, y_train)` that I created previously, and return a trained model.

In [8]:
from sklearn.linear_model import LogisticRegression

def train_model(X_train, y_train):
    """
    Trains a logistic regression model on the provided training data.

    Parameters:
        X_train (numpy.array): The training feature dataset.
        y_train (numpy.array): The training label dataset.

    Returns:
        LogisticRegression: A logistic regression model fitted to the training data.
    """
    # Create a logistic regression model with the 'lbfgs' solver
    model = LogisticRegression(solver='lbfgs')  

    # Fit the model to the training data
    model.fit(X_train, y_train)

    return model

In [9]:
#Test
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)

1.5068926456002565
[[ 0.25237869 -0.22847881 -0.01779302  2.00977742  0.23903441 -0.29504922
  -0.08952344 -0.83468871 -3.48756932 -1.07648711 -0.83688921  0.07860585
  -1.3077735 ]]


_**Expected Outputs:**_

```python
lm = train_model(X_train, y_train)
print(lm.intercept_[0])
print(lm.coef_)
```
```
1.5068926456005878
[[ 0.25237869 -0.22847881 -0.01779302  2.00977742  0.23903441 -0.29504922
  -0.08952344 -0.83468871 -3.48756932 -1.07648711 -0.83688921  0.07860585
  -1.3077735 ]]
```

### 4.0 Testing the model

#### 4.1 AUC - ROC curve

AUC - ROC curve is a performance measurement for classification problem at various thresholds settings. ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes. Write a function which returns the roc auc score of the trained model when tested with the test set.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a `float` of the roc auc score of the model. This number should be between zero and one.

_**Note**_  The positive class's probability as the score is used

In [10]:
def roc_score(lm, X_test, y_test):
    """
    Computes the ROC AUC score for the provided test data and a fitted model.

    Parameters:
        lm (LogisticRegression): The fitted logistic regression model.
        X_test (numpy.array): The test feature dataset.
        y_test (numpy.array): The actual labels of the test dataset.

    Returns:
        float: The ROC AUC score of the model on the test data.
    """
    # Get the probability scores for the positive class
    y_scores = lm.predict_proba(X_test)[:, 1]

    # Calculate the ROC AUC score
    auc_score = roc_auc_score(y_test, y_scores)

    return auc_score

In [11]:
#Test
print(roc_score(lm,X_test,y_test))

0.8865546218487395


_**Expected Outputs:**_
    
```python
print(roc_score(lm,X_test,y_test))
```
>```
0.8865546218487395
```

#### 4.1 Accuracy, Precision, Recall and F1 scores.

_**Function Specifications:**_
* Should take the fitted model and two numpy `arrays` `X_test, y_test` as input.
* Should return a tuple in the form (`Accuracy`, `Precision`, `Recall`, `F1-Score`)

In [12]:
def scores(lm, X_test, y_test):
    """
    Calculates the Accuracy, Precision, Recall, and F1-Score of the fitted logistic regression model.

    Parameters:
        lm (LogisticRegression): The fitted logistic regression model.
        X_test (numpy.array): The test feature dataset.
        y_test (numpy.array): The actual labels of the test dataset.

    Returns:
        tuple: A tuple containing the Accuracy, Precision, Recall, and F1-Score.
    """
    # Predict the labels for the test set
    y_pred = lm.predict(X_test)

    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    #handle cases where there are no true positives by setting zero_division=0 in the precision_score call, which avoids division by zero errors by returning 0 in such cases.
    precision = precision_score(y_test, y_pred, zero_division=0)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    return (accuracy, precision, recall, f1)

In [13]:
(accuracy, precision, recall, f1) = scores(lm,X_test,y_test)
    
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)

Accuracy: 0.833333
Precision: 0.846154
Recall: 0.808824
F1 score: 0.827068


_**Expected Outputs:**_
```python
(accuracy, precision, recall, f1) = scores(lm,X_test,y_test)
    
print('Accuracy: %f' % accuracy)
print('Precision: %f' % precision)
print('Recall: %f' % recall)
print('F1 score: %f' % f1)
```
> ```
Accuracy: 0.833333
Precision: 0.846154
Recall: 0.808824
F1 score: 0.827068
```

#  

<div align="center" style=" font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/ExploreAI_logos/EAI_Blue_Dark.png"  style="width:200px";/>
</div>