# Learning theory - continued


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, auc, mean_squared_error, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay, RocCurveDisplay


## Let's try to put it all into practice

We will use a german credit reisk dataset that classifies people into good or bad risks from:
https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

First we will need to load our data (you will also need to upload the data to colab first):

In [None]:
# Load the CSV file into a DataFrame
data = pd.read_csv('4_german_credit_data.csv')

# Display the first few rows to get an initial understanding
data.head()


After loading the dataset, it's essential to get a quick overview of its characteristics, such as the number of rows and columns, the data types of the columns, and basic statistics of numerical columns.

In [None]:
# Get a concise summary of the dataset
data.info()

# Get basic statistics of numerical columns
data.describe()


In [None]:
data.head()

What we can see is that the dataset contains 21 columns: 20 features and 1 target column (CreditRisk).
Features are a mix of categorical (e.g., CheckingAccountStatus, CreditHistory) and numerical (e.g., Duration, CreditAmount) data types.
The target column CreditRisk has two classes: 1 (Good Risk) and 2 (Bad Risk).

Note we have lot's of categorical data, we would need to understand how these features work and how the values are distributed. For example:


In [None]:
# Check the distribution of some categorical features
plt.figure(figsize=(10, 6))
data['CheckingAccountStatus'].value_counts().plot(kind='bar')
plt.title('Distribution of Checking Account Status')
plt.xlabel('Status')
plt.ylabel('Count')
plt.show()

In [None]:
# Plot the CheckingAccountStatus against the CreditRisk
plt.figure(figsize=(10, 6))
sns.countplot(data=data, x='CheckingAccountStatus', hue='CreditRisk')
plt.title('Distribution of CreditRisk by CheckingAccountStatus')
plt.ylabel('Count')
plt.xlabel('Checking Account Status')
plt.legend(title='Credit Risk', loc='upper right')
plt.grid(axis='y')

plt.show()


1. A11 (< 0 DM): This category has the highest proportion of high risk.
2. A12 (0 <= ... < 200 DM): The number of low-risk individuals is slightly higher than high-risk individuals.
3. A13 (>= 200 DM / salary assignments for at least 1 year): A higher proportion of individuals are low risk, but there is still a significant number of high-risk individuals.
4. A14 (no checking account): A majority of individuals in this category are labeled as low risk.

We hsould really look at all the variables.

Numerical values are a bit easier to look at:

In [None]:
# Calculate the correlation of numerical features with the target variable 'CreditRisk'
correlations = data.corr(numeric_only=True)['CreditRisk'].sort_values()

correlations

Let's pick some features:

* Duration: As we observed, it has a positive correlation with the target variable.
* Amount: The amount of the loan can be an indicator of risk.
* Age: Age might play a role in determining creditworthiness.
* CheckingAccountStatus: As seen in the plot, different levels of checking account status have different risk distributions.
* CreditHistory: The credit history of a person can be a significant indicator of their credit risk.
* Purpose: The purpose of the loan can influence the risk associated with it.
* SavingsAccountBonds: Savings can be an indicator of financial stability.

In [None]:
# Selecting initial set of features
selected_features = ['Duration', 'CreditAmount', 'Age', 'CheckingAccountStatus',
                     'CreditHistory', 'Purpose', 'SavingsAccountBonds']

# Plotting correlation matrix for these features and the target variable
correlation_matrix = data[selected_features + ['CreditRisk']].corr(numeric_only=True)

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Selected Features and Target Variable')
plt.show()


We still need to think about cleaning and transforming our data a little:

Handle Missing Data: Ensure that there's no missing data in the dataset or decide on a strategy to handle them (e.g., imputation).

Convert Categorical Data: Convert categorical variables into a format suitable for KNN, typically using one-hot encoding.

In [None]:
# Check for missing values in the dataset
missing_values = data.isnull().sum()
print(f"missing values - {missing_values}")


In [None]:
# Drop rows with missing values if any exist
data_cleaned = data.dropna()

We are going to apply one-hot encoding to the categorical features.

One-Hot Encoding is a method used to convert categorical data variables so they can be provided to machine learning algorithms to improve predictions. Machine learning algorithms cannot work with categorical data directly. Categorical data must be converted to numbers. One of the most common ways to do this transformation is by using one-hot encoding.

How It Works:
Identify Unique Categories: For each categorical variable, identify the unique categories it can take.

Create New Columns: For each unique category, a new binary (0 or 1) column is created.

Binary Indication:

For each record, the column corresponding to the category the variable takes value will have a '1', and all other new columns will have a '0'.
This means that out of all the new columns for a categorical variable, only one can take the value '1' for a given record.

**Important** One drawback of one-hot encoding is that it can increase the dimensionality of the dataset significantly if the categorical variable has many unique values.
It can also lead to multicollinearity, a situation where two or more variables are highly correlated. In the context of one-hot encoding, one variable can be predicted from the others. This is called the dummy variable trap.
To avoid the dummy variable trap, one common practice is to drop one of the one-hot encoded columns (hence, for a variable with n categories, we keep nâˆ’1 dummy columns).

In [None]:
# Identify numerical and categorical columns
numeric_features = ['Duration', 'CreditAmount', 'Age']
categorical_features = ['CheckingAccountStatus', 'CreditHistory', 'Purpose', 'SavingsAccountBonds']

# Apply one-hot encoding using pandas get_dummies method
data_encoded = pd.get_dummies(data_cleaned, columns=categorical_features)

# Display the first few rows of the encoded data
data_encoded.head()

In [None]:
data_encoded.columns

In [None]:
encoded_features = data_encoded.columns[17:].to_list() + numeric_features

In [None]:
encoded_features

Let's create our data finally:

In [None]:
X = data_encoded[encoded_features]
y = data_encoded['CreditRisk'] - 1  # Convert to 0 and 1

Now remember, before we go any further we need to split our data. This is essential to assesing our data.

In [None]:
# Split your data

In [None]:
# Check the shapes

Now we can initialise our model:

In [None]:
# Initialise a KNeighborsClassifier with n_neighbors=5


Next we fit our model to the data, be careful which data to use:

In [None]:
# Fit the model

Next we apply our model to make prediction, in this case on the test set:

In [None]:
# generate predictions on the test test
y_pred =

Let's take a look at what's happening, it's always worth checking your confusion matrix.

```confusion_matrix(y_true, y_pred)```

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html


In [None]:
# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred)

# Display the confusion matrix
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=[0, 1])
disp.plot(cmap=plt.cm.Blues)
plt.show()

Complete the below function to calculate our metrics. You can make use of confusion matrix:

![CM](https://miro.medium.com/v2/resize:fit:4800/format:webp/1*uR09zTlPgIj5PvMYJZScVg.png)

```
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
```

In [None]:
def compute_accuracy(y_true, y_pred):
    """Compute accuracy. What propotion of predictions were correct."""

def compute_recall(y_true, y_pred):
    """Compute recall (sensitivity). How all actual positive values, how many did we get."""

def compute_precision(y_true, y_pred):
    """Compute precision. Of all positive predictions, how many were correct."""


Apply our new functions to the model:

Another metric of interest is the F1 score, this help us by balancing precision and recall.

$$
F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}
$$

Create a function to calculate f1 score and apply it to our model:


In [None]:
def compute_f1(y_true, y_pred):
    """Compute f1."""


Remember we have the ability to adjust the focus of our classifier. The ROC curve let's us explore our options in this adjustment and let's us compare classifiers for their ability to give us more options. Let's apply
```
fpr, tpr, thresholds = roc_curve(y_true, y_score)
```
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html


In [None]:
# first we need the class probability estimates


In [None]:
# then we can use the roc_curve function


We also also interested in the area under the curve metric. We can use:

```
roc_auc_score(y_true, y_score)
```

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

We can also calculate based on fpr, and tpr with `auc(fpr, tpr)`

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html


Typically we like to visualise this curve, we can use the values we generated or the function:
```
RocCurveDisplay(fpr= , tpr= , roc_auc= ).plot()
```
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html

Not a great model! Still we have some options to adjust, imagine we are very concerned with False negatives, credit risks we missed. How can we adjust our classifier:

In [None]:
# Predict class probabilities using predict_proba(X_test)[:, 1]



In [None]:
# Adjust the threshold to decrease the FNR (and increase TPR)
threshold =
y_pred_adjusted = (y_prob > threshold).astype(int)

# Calculate and print metrics with the adjusted threshold


## Exercises

Ok, time for you to give it all a try. We have already created all the data you need, split it into test and training datasets. Your job:

First, fit a linear model (LinearRegression) and different KNN models with different values of k to the data see what you think perfodms best. Remember the steps:

* Initialise the model

* Fit the model

* Make predictions (a little trickier for the linearRegression, but you have lots of examples).

* Evaluate the model by evaluating thepredictions

Compare the models and decide the model that works best.

Using the model you believe works best, adjust the threshold to:


*   Maximise precision
*   Maximise recall
*   Maximise accuracy

Taking a look at the ROC might help.






Try to improve your model by using more (or less) information we have from the dataset. That is change the features we are including.
