# Main notebook for COVID-19 detection in children

## Authors:

- Marc Garcia
- Jofre Poch
- Pau Tarragó
- Pau Matas
- Tomás Gadea

## Summary of steps for the project:
- **Exploratory Data Analysis (EDA)**: entendre les dades


- **Data Cleaning**: treure NULLs, eliminar dades no importants, etc.


- **Feature Selection**: escollir les variables explicatives que li passarem al model


- **Model selection**: Escollir un model classificador (té covid o no; si té covid => PCR, si no => pa casa)


- **Model training**: entrenar el model, és copiar 4 linies de codi xd


- **Model Validation**: Veure com de bé ho ha fet. Per exemple veient els square errors de valors predits vs resposta de veritat.


- **Parameter tunning**: Canviar els paràmetres per millorar el model validation.

## Actual work:

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor

import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
%matplotlib notebook

In [None]:
path = './COPEDICATClinicSympt_DATA_2020-12-17_1642.csv'
df = pd.read_csv(path)
df.head()

In [None]:
# EDA:
print(f'Dataset has {df.shape[0]} rows and {df.shape[1]} columns.')

In [None]:
plt.figure(figsize=(10, 10))
nulls = df.isnull().sum().sort_values(ascending=False)
sns.barplot(y=df.columns, x=nulls)

In [None]:
# filter numerical variables only
impactful_variables = []
for column in df.columns:
    if (df.dtypes[column] in ['float64', 'int64']):
        impactful_variables.append(column)

df = df[impactful_variables]
len(df.columns)

In [None]:
# Response distribution

Y = df['final_diagnosis_code']
print(Y.value_counts())
plt.figure()
sns.countplot('final_diagnosis_code',data=df)

In [None]:
klk = np.asarray(df.corr()['final_diagnosis_code']).reshape(165,1)
plt.figure(figsize=(5, 20))
sns.heatmap(klk, annot=False, fmt="g", cmap='jet')