In [None]:
import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt


from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, auc, mean_squared_error, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay, RocCurveDisplay


## Let's try to put it all into practice

We will use a german credit reisk dataset that classifies people into good or bad risks from:
https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

First we will need to load our data (you will also need to upload the data to colab first):

In [None]:
# Load the CSV file into a DataFrame
data = pd.read_csv('4_german_credit_data.csv')


After loading the dataset, it's essential to get a quick overview of its characteristics, such as the number of rows and columns, the data types of the columns, and basic statistics of numerical columns.

In [None]:
data.head()

What we can see is that the dataset contains 21 columns: 20 features and 1 target column (CreditRisk).
Features are a mix of categorical (e.g., CheckingAccountStatus, CreditHistory) and numerical (e.g., Duration, CreditAmount) data types.
The target column CreditRisk has two classes: 1 (Good Risk) and 2 (Bad Risk).



Let's pick some features:

* Duration: As we observed, it has a positive correlation with the target variable.
* Amount: The amount of the loan can be an indicator of risk.
* Age: Age might play a role in determining creditworthiness.
* CheckingAccountStatus: As seen in the plot, different levels of checking account status have different risk distributions.
* CreditHistory: The credit history of a person can be a significant indicator of their credit risk.
* Purpose: The purpose of the loan can influence the risk associated with it.
* SavingsAccountBonds: Savings can be an indicator of financial stability.

In [None]:
# Identify numerical and categorical columns
numeric_features = ['Duration', 'CreditAmount', 'Age']
categorical_features = ['CheckingAccountStatus', 'CreditHistory', 'Purpose', 'SavingsAccountBonds']

# Selecting initial set of features
selected_features = numeric_features + categorical_features

In [None]:
print(selected_features)

Let's create our data for ML:

In [None]:
X = data[selected_features]
y = data['CreditRisk'] - 1  # Convert to 0 and 1

We are going to apply one-hot encoding to the categorical features.

One-Hot Encoding is a method used to convert categorical data variables so they can be provided to machine learning algorithms to improve predictions. Machine learning algorithms cannot work with categorical data directly. Categorical data must be converted to numbers. One of the most common ways to do this transformation is by using one-hot encoding.

How It Works:
Identify Unique Categories: For each categorical variable, identify the unique categories it can take.

Create New Columns: For each unique category, a new binary (0 or 1) column is created.

Binary Indication:

For each record, the column corresponding to the category the variable takes value will have a '1', and all other new columns will have a '0'.
This means that out of all the new columns for a categorical variable, only one can take the value '1' for a given record.

**Important** One drawback of one-hot encoding is that it can increase the dimensionality of the dataset significantly if the categorical variable has many unique values.
It can also lead to multicollinearity, a situation where two or more variables are highly correlated. In the context of one-hot encoding, one variable can be predicted from the others. This is called the dummy variable trap.
To avoid the dummy variable trap, one common practice is to drop one of the one-hot encoded columns (hence, for a variable with n categories, we keep n−1 dummy columns).

This time we will use onehotencoder from scikitlearn and note that logistic regression will have a problem with multicolinearity.

`OneHotEncoder` is used to convert categorical data into a binary matrix. It creates one column for each unique category in the data, and for each sample, only the column corresponding to its category will have a value of 1 (all other columns are 0).

**Use Case:** OneHotEncoder is commonly used when working with machine learning models that cannot directly handle categorical data.

**Documentation:** [OneHotEncoder - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

**IMPORTANT** Before or after training and test split?


In [None]:
# Preprocess categorical features using fit or fit_transform or fit and transform





In [None]:
# Take a look at the data

We will need to get our column names if we want a dataframe later, these will be different remember. We will need to use:
```.get_feature_names_out(categorical_features)```

The we will need to create a new dataframe, for example:

```X_categorical_df = pd.DataFrame(X_categorical,
                                      columns=encoded_column_names,
                                      index=X.index)```

In [None]:
# Check the head of the categorical data


Finall we will need to combine back togetger, we can use concat from pandas, eg:

```pd.concat([A, B], axis=1)```

Next we will scale our quantitative data.

`StandardScaler` standardizes features by removing the mean and scaling them to unit variance. It transforms data so that each feature has a mean of 0 and a standard deviation of 1.

**Use Case:** StandardScaler is typically used for algorithms that rely on the scale of data, such as logistic regression, support vector machines, or neural networks.

**Documentation:** [StandardScaler - scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html)

**IMPORTANT** Before or after training and test split?

In [None]:
# Create a standard scalar and fit


In [None]:
# transform the data


In [None]:
# Take a look at the new data


Create new dataframes for scaled data.

Finally combine back with one hot encoded data:

In [None]:
# Check the gead


### Logistic regression

We will now apply the logistic regression model from scikit-learn.

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Take a look at the parameters:

* For now we want to set penalty = None, we will talk about regularisation later in the course, which scikit-learn does by default. To ensure this works well we should also set solver to lbfgs and max_iter to 1000.
* It's always useful to check how intercepts are dealt with for regression.

Now we can initialise our model:

In [None]:
# Initialise a Logistic Regression model


Next we fit our model to the data, be careful which data to use:

In [None]:
# Fit the model


Next we apply our model to make prediction, in this case on the test set:

In [None]:
# generate predictions on the test dataset


In [None]:
# Check if our model is making predictions or estimating probabilities


Make use of the scikitlearn functions to produce our metrics of interest:

```
accuracy_score, precision_score, recall_score, f1_score
```


In [None]:
# Calculate evaluation metrics


# Print the metrics


Remember we have the ability to adjust the focus of our classifier. The ROC curve let's us explore our options in this adjustment and let's us compare classifiers for their ability to give us more options. Let's apply
```
fpr, tpr, thresholds = metrics.roc_curve(y, scores)
```
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html


In [None]:
# first we need the class probability estimates


We also also interested in the area under the curve metric. We can use:

```
roc_auc_score(y_true, y_score)
```

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

Typically we like to visualise this curve, we can use the values we generated or the function:
```
RocCurveDisplay(fpr= , tpr= , roc_auc= ).plot()
```
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html

We can also explore our model a little further. We can look at the coefficients:

In [None]:
# Get coefficients and intercept
coefficients =
intercept =
feature_names =

# Display coefficients alongside feature names

coef_df = pd.DataFrame(coefficients, columns=feature_names)
print("Coefficients:\n", coef_df)
print("Intercept:", intercept)

We might also want to look at residuals. Residuals in Logistic Regression
Residuals in logistic regression are typically calculated as:

Residual
=
Observed Value
−
Predicted Probability
Residual=Observed Value−Predicted Probability

Goal: The residual plot should show no clear pattern. Patterns might indicate that the model is not capturing some important relationships in the data.

In [None]:
# Calculate residuals
residuals = y_test - y_proba  # Observed - Predicted probabilities

# Choose a feature for plotting residuals
feature = 'Age'
x_axis = X_test[feature]

# Create a residual plot
plt.scatter(x_axis, residuals, alpha=0.7)
plt.axhline(0, color='red', linestyle='--')  # Add a horizontal line at 0
plt.title(f"Residuals vs {feature}")
plt.xlabel(feature)
plt.ylabel("Residuals (Observed - Predicted Probability)")
plt.show()


What to Look For:

**No Pattern:** The points should be randomly scattered around the horizontal line at 0. This indicates the model is fitting well.

**Systematic Patterns:** If the residuals form a curve or other pattern, it might indicate that:


*   A non-linear relationship exists, and a transformation of the feature could help.
*   An interaction term or additional feature may be missing.

**Heteroscedasticity** If the spread of residuals increases or decreases with the feature's value, the model might not be well-calibrated across the range of the feature.
