In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import KNNImputer


from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, accuracy_score,  precision_score, recall_score, balanced_accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier

# 1. Dataset Exploration
## a) Load the df_hackathon.csv dataset - (hint: use pandas)
Assign it to a variable called df.

## b) Using pandas examine the numerical features of the dataset

Examine the 7+1 figure summary (count, mean, std, min, 1Q, median, 3Q, max)

- Αre there any missing feature values?
- Are there any unexpected (or extreme) feature values?

Visualise in histograms the numerical features.

- Is there significant data skewness in any of the variables?

## c) Using pandas examine the categorical features of the dataset

- Are there any significant inequalities in the dataset?
- Are there any features with missing values?

## d) further dataset exploration
Using the facets library visualise the dataset (use code below). Try different combinations. Do you notice any patterns in the dataset that might influence the model's predictions?

In [None]:
from IPython.core.display import display, HTML

vis_df = df.to_json(orient='records')

HTML_TEMPLATE = """<script src="https://cdnjs.cloudflare.com/ajax/libs/webcomponentsjs/1.3.3/webcomponents-lite.js"></script>
        <link rel="import" href="https://raw.githubusercontent.com/PAIR-code/facets/1.0.0/facets-dist/facets-jupyter.html">
        <facets-dive id="elem" height="600"></facets-dive>
        <script>
          var data = {jsonstr};
          document.querySelector("#elem").data = data;
        </script>"""
html = HTML_TEMPLATE.format(jsonstr=vis_df)
display(HTML(html))

# 2. Feature Engineering

## a) Create a correlogram (hint: use pandas corr() function) and visualise it

- Are there any very strong correlations (numerical features)? if yes, remove one of the features.
- Explain how strong correlations can have an impact on some ML models.

## b) Deal with extreme outliers

Reflect your answer on 1b). Did you identify any features with extreme outliers?

If yes, explain the nature of these outliers and deal with them appropriately.
- How many are there (in proportion to the whole dataset)?
- would dropping these rows lead to considerable data loss?

Try to also think what could result in the data collection process that resulted in them.

## c) Deal with features with missing values

Start by splitting your into training (X_train, y_train) and testing sets (X_test, y_test) in an 80-20 split (hint: use train_test_split from sklearn). Use random_state = 42.

Reflect again on exercise 2c. Did you identify any numerical features with missing values?

- Impute the values in the feature with the missing values using the numerical features.
- Check (e.g. by plotting) the distribution of the feature with the missing values before and after imputation. Ensure that the imputation did not bias/skew the feature's distribution.

Just as a reminder, the dependent variable (y) is gringotts_approved_loan.

`Tip:` It is essential to train the imputation model only on the training data to avoid [data leakage](https://machinelearningmastery.com/data-preparation-without-data-leakage/). Then use the trained imputation model to also predict the most appropriate value for the testing data. Please feel free to use the impute_missing_values function.





Save the imputed dfs as `X_train_imputed` and `X_test_imputed`.



In [None]:
def impute_missing_values(feature_with_missing, X_train, X_test):
    # One-hot encode categorical features.
    X_train_encoded = pd.get_dummies(X_train)
    X_test_encoded = pd.get_dummies(X_test)

    X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='inner', axis=1)

    # Impute using KNN imputer (model-based approach). Importantly, we are only fitting
    # the imputer on the training data.
    knn_imputer_full = KNNImputer(n_neighbors=5)
    X_train_imputed = knn_imputer_full.fit_transform(X_train_encoded)
    X_test_imputed = knn_imputer_full.transform(X_test_encoded)

    # Convert back to pandas DataFrame, ensuring column names are retained.
    X_train_imputed = pd.DataFrame(X_train_imputed, columns=X_train_encoded.columns, index=X_train_encoded.index)
    X_test_imputed = pd.DataFrame(X_test_imputed, columns=X_test_encoded.columns, index=X_test_encoded.index)

    # Replace the imputed column to the original X_train and X_test passed in.
    X_train[feature_with_missing] = X_train_imputed[feature_with_missing]
    X_test[feature_with_missing] = X_test_imputed[feature_with_missing]

    return X_train, X_test

# 4. Train RF model

Train a [random forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model and evaluate its performance on 10-fold CV and also on the test set. You might find sklearn [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) helpful to create the model (`tip`: use OneHotEncoder to encode the cateogrical features)

Evaluate the performance using accuracy_score and classification_report from sklearn. Comment on the model's performance considering precision and recall. Comment on the generalisability of the model (does it have similar performance on the test set as with the cross validation set?)

Use `X_train_imputed`, `y_train`, `X_test_imputed`, `y_test`.

# 5. Evaluate feature importance
Identify and visualise the top 10 most important features that influence when Gringotts approves a loan.

Comment on the impact of the top few features.

# 6. Understanding model's performance further

In [None]:
!pip install fairlearn
from fairlearn.metrics import MetricFrame, selection_rate

Determine if the model exhibits bias by performing significantly differently across various groups. For instance, assess whether the model's performance is superior or inferior for samples where the gender is male compared to other groups.

Use the get_fairness_evaluation function to analyze the impact of this and other variables on model performance. This function uses Fairlearn to evaluate how the model fares across different classes, focusing on sensitive metrics such as accuracy, balanced accuracy, precision, and recall.

In [None]:
def get_fairness_evaluation(X_test, y_test, y_pred, columns):
  sensitive_features_df = X_test[columns]

  def precision_wrapper(y_true, y_pred): return precision_score(y_true, y_pred, pos_label='Yes', zero_division=0)
  def recall_wrapper(y_true, y_pred): return recall_score(y_true, y_pred, pos_label='Yes', zero_division=0)

  mf = MetricFrame(metrics={
                      'accuracy': accuracy_score,
                      'balanced_accuracy': balanced_accuracy_score,
                      'precision': precision_wrapper,
                      'recall': recall_wrapper,
                      'count': lambda y_true, y_pred: y_true.shape[0]},
                  y_true=y_test,
                  y_pred=y_pred,
                  sensitive_features=sensitive_features_df)

  plot = mf.by_group.plot.bar(
    subplots=True,
    legend=True,
    figsize=[12, 8],
    title="Fairness Metrics Across Sensitive Features"
  )

  plt.tight_layout()
  plt.show()

  return mf.by_group

In [None]:
get_fairness_evaluation(X_test_imputed, y_test, y_pred, ['gender'])

Try out different combinations. For example you can pass as input to the columns feature ['bloodline', gender'].

# Sociotechnical questions

Reflection questions: As part of the assessment of this bounty you will also be graded on how you reflect on the following questions. When practicing AI Ethics, it is important to understand AI as socio-technical systems, and that although technical issues can be problematic, one need to understand the systematic and social structures giving rise to them in the first place.

- Can all problems with bias be solved technically? If not, why?
- What decision-making systems should be using algorithmic/AI approaches and which should not?
- Is it just and fair to solve problems with bias by fixing technological constraints?