# K-Nearest Neighbors (KNN) and Logistic Regression

We are going to apply a **K-Nearsest Neighbors** and a **Logistic Regression** for classification on a dataset about COVID 19.

**K-Nearsest Neighbors (KNN)**
<img src="files/figures/KNN.jpg" width="350px"/>

https://fr.wikipedia.org/wiki/M%C3%A9thode_des_k_plus_proches_voisins

**Logistic Regression**
<img src="files/figures/LogReg.jpg" width="450px"/>

https://medium.datadriveninvestor.com/logistic-regression-explained-f51d32be904e

## Libraries

In [248]:
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.metrics import classification_report, plot_roc_curve

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

import matplotlib.pyplot as plt

## Data

- Download the **COVID-19 patient pre-condition dataset** (csv file) kaggle:<br>
https://www.kaggle.com/datasets/tanmoyx/covid19-patient-precondition-dataset
- Import the data and look at them with `pandas`.
- The goal is to predict the binary column `"icu"` (Intensive Care Units) using the other columns.
- Check pairwise correlations among variables

In [252]:
df = pd.read_csv('../z_data/covid/covid.csv')

## Data Preprocessing

- Check if there are missing values.
- Drop the column `"id"`.
- Drop the lines where the target variable `"icu"` is non applicable (=97), ignored (=98) or non specified (=99).
- For every binary variables, replace the values that are not 1 or 2 by np.nan; then replace all 2's by 0's.<br>
(cf. file catalogs.xlsx for further details).
- Split the data into train and test sets (80%/20%) (`train_test_split()`).
- Replace missig data (`np.nan`) by their medians.<br>
https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html<br>
Fit an "imputer" *on the train set* and then transform both train and test sets.

## K-Nearest Neighbors

- Instanciate a **KNN** (`KNeighborsClassifier()`).
- Do a **grid search with cross validation** on the nb of neighbors for 1 to 15 (`GridSearchCV()`):<br>
`np.arange(1, 15)`.
- Using the best model obtained via grid search, compute the **predictions** on the train and test sets.
- Compute the **classification reports** on the train and test sets (`classification_report`) and **ROC curve** on the test set (`plot_roc_curve`).

## Logistic Regression

- Instanciate a **logistic regression** (`LogisticRegression()`).
- Do a **grid search with cross validation** on the regularization parameter `C` from 1e-3 to 1e+3 (`GridSearchCV()`):<br>
`np.logspace(-3, 3, num=7)`.
- Using the best model obtained via grid search, compute the predictions on the train and test sets.
- Compute the **classification reports** on the train and test sets (`classification_report`) and **ROC curve** on the test set (`plot_roc_curve`).