This final practical session is composed of 2 parts.
<br> <br>
In the first part, you will be guided to answer a classification problem on a credit default dataset. As a banker, your objective will be to use supervised algorithms to learn how to estimate a probability of default on the history of your clients, and then to predict this probability on a set of new clients.
<br><br>
In the second part you will get a dataset composed of XX points with XX features. Your objective will be to analyze this dataset to detect anomalies and report them. As there are no available labels, you will have to use unsupervised algorithms of your choice and compare the results of each method. Don't hesitate to be creative.

In [None]:
######### List of useful imports, don't hesitate to fill it with other packages #############

import numpy as np
import pandas as pd

### Learning
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.cluster import KMeans
from sklearn.metrics.cluster import contingency_matrix
from sklearn.metrics import confusion_matrix, roc_curve, auc

from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.linear_model import LinearRegression as LinReg
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import SpectralClustering
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
#...

### Visualization
from mpl_toolkits.mplot3d import Axes3D
import networkx as nx  # to play with graphs
import graphviz
from sklearn.tree import export_graphviz
import matplotlib.pyplot as plt
import seaborn as sns

## Guidelines
<br><br>
This notebook is yours: feel free to add any cell, any function and any comment in addition to the answer of the questions and the functions already provided.
<br><br>
Your work will be evaluated on the basis of your answers in the 2 parts: credit default and anomaly detection (~50/50)
<br><br>
Make your code clean and well organized
<br><br>
Make clear and precise answers for the explicit questions. You will not only be evaluated on your code but also on your understanding of the methods and your interpretation of the results.
<br><br>
Your work is **individual**, but feel free to ask me any question about the guidelines on Discord. Also, you're highly encouraged to paste some functions we used in previous TP (in particular for visualization).

# Credit Default Estimation

In [None]:
train = pd.read_csv('german_credit_train.csv', index_col=0)
test = pd.read_csv('german_credit_test.csv', index_col=0)

In [None]:
y_train = train.Creditability
y_test = test.Creditability

X_train = train.iloc[:, 1:]
X_test = test.iloc[:, 1:]

## Data Preparation and feature selection

*Question:* Try to understand the features in the dataset. Select some features that seem relevant to you and describe their nature and their distribution. If some variables don't seem relevant to you explain why and discard them.

*Question:* Handle the categorical variables to make them suitable for the training. Make any transformation of the quantitative variables that seem relevant for you and explain your choice

*Question:* Use the visualization tools of your choice to get first intuitions on the importance on the features in the prediction of our target variable. Explain your choice and interpret the figures.
<br><br>
**Requirements:** At least one categorical feature and one quantitative feature.

## First machine learning algorithms

*Question*: Train at least 3 different supervised learning algorithms on the training set (without trying to optimize their hyperparameters). Explain your choice. Display relevant metrics on both the training and the test set. 

*Question:* Interpret your results. Compare the performance of each model on the train and test set, and the consistency between the two.

*Question:* Choose one of your model that can provide a measure of the importance of each feature. Print this measure and analyze it, what do you remark? If you identify (not compulsory) something feel free to modify the datasets (back to the feature selection process). If you performed this step train again the selected model an observe if there is a difference, conclude.

## Model Optimization 

*Question*: Choose one of these two families of algorithms: Decision Trees and SVM. 
Perform a cross-validation on this model. Be careful to:
1. Carefully select the parameters concerned by the optimization
2. Explain their role and why you decided to optimize them
3. Carefully choose the range/values of these parameters to put in the grid search

*Question:* Benchmark the results of your cross validation to the baseline algorithms of the previous section. Is the performance improved?

Under this cell feel free to test any other approach not mentionned before to try to optimize the performance of your model (not mandatory).

## Conclusion

*Question:* In the light of your results explain which model you would choose to predict the credit reliability of a new client of your bank and which information you would carefully collect for this purpose.

# Open Problem

In this problem you will try to answer a binary classification problem with both supervised and unsupervised algorithms. 

First, run the following cell to download the datasets for this part. You will get a feature dataset $X$, and a target variable $y$.

In [None]:
X = pd.read_csv('X_part2.csv', index_col=0) # Features
y = np.array(pd.read_csv('y_part2.csv', index_col=0)).ravel() # Target variable

*Question 1.:* Train some supervised algorithms of your choice to answer the problem. Comment on your choice and analyze their performance. These algorithms will be baselines to compare with the results of the next question.
<br><br>
*Question 2.:* Try to use unsupervised algorithms to improve your performance. You can 1) Use only unsupervised algorithms and 2) Use unsupervised algorithms to transform the features or find new features and then use these features to train a supervised algorithm.
<br><br>
*Question 3.:* Write a report about your methodology and your main results.