# Catching Up

## Exploratory Data Analysis (EDA)

Our main goal while doing EDA is to [summarize main characteristics of our dataset](https://en.wikipedia.org/wiki/Exploratory_data_analysis).

It's crucial that we understand what our data is composed of.

### First, we take a look.

````{toggle} configuration.py
    :show:
```{literalinclude} configuration.py   
```
````

In [5]:
from read_data import read_data # read data from file at a destinaiton defined in configuration.py
from get_scattered_chunks import get_scattered_chunks # get scattered chunks of data
from print_table import print_table # Styling and printing the table

In [4]:
data = read_data()
data_chunks = get_scattered_chunks(data, n_chunks=5, chunk_size=3)
print_table(data_chunks)

Unnamed: 0,test_name,swab_type,covid19_test_results,age,high_risk_exposure_occupation,high_risk_interactions,diabetes,chd,htn,cancer,asthma,copd,autoimmune_dis,smoker,temperature,pulse,sys,dia,rr,sats,rapid_flu_results,rapid_strep_results,ctab,labored_respiration,rhonchi,wheezes,days_since_symptom_onset,cough,cough_severity,fever,sob,sob_severity,diarrhea,fatigue,headache,loss_of_smell,loss_of_taste,runny_nose,muscle_sore,sore_throat,er_referral
0,SARS COV 2 RNA RTPCR,Nasopharyngeal,False,58,True,,False,False,False,False,False,False,False,False,36.95,81.0,126.0,82.0,18.0,97.0,,,False,False,False,False,28.0,True,Severe,,False,,False,False,False,False,False,False,False,False,False
1,"SARS-CoV-2, NAA",Oropharyngeal,False,35,False,,False,False,False,False,False,False,False,False,36.75,77.0,131.0,86.0,16.0,98.0,,,False,False,False,False,,True,Mild,False,False,,False,False,False,False,False,False,False,False,False
2,SARS CoV w/CoV 2 RNA,Oropharyngeal,False,12,,,False,False,False,False,False,False,False,False,36.95,74.0,122.0,73.0,17.0,98.0,,,,,,,,False,,,,,,,,,,,,,False
23498,"SARS-CoV-2, NAA",Nasal,False,35,False,False,False,False,False,False,False,False,False,False,37.0,69.0,136.0,84.0,12.0,100.0,,,False,,,,,False,,False,False,,False,False,False,False,False,False,False,False,
23499,"SARS-CoV-2, NAA",Nasal,False,24,False,True,False,False,False,False,False,False,False,False,36.75,70.0,128.0,78.0,12.0,99.0,,,,False,False,False,,False,,False,False,,False,False,False,False,False,False,False,False,
23500,"SARS-CoV-2, NAA",Nasal,False,52,False,False,False,False,False,False,False,False,False,False,37.0,94.0,165.0,82.0,12.0,98.0,,,,False,False,False,7.0,True,,False,False,,False,False,False,False,False,False,False,False,
46996,"SARS-CoV-2, NAA",Nasal,False,11,False,False,False,False,False,False,False,False,False,False,36.9,78.0,116.0,79.0,16.0,100.0,,,True,False,True,True,,False,,False,False,,False,False,False,False,False,False,False,False,
46997,"SARS-CoV-2, NAA",Nasal,False,30,False,False,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,False,,False,False,,False,False,False,False,False,False,False,False,
46998,"SARS-CoV-2, NAA",Nasal,False,36,False,False,False,False,False,False,False,False,False,False,36.85,81.0,122.0,81.0,14.0,100.0,,,,False,,,7.0,True,Mild,False,True,Mild,True,True,True,False,False,False,True,False,
70494,Rapid COVID-19 PCR Test,Nasal,False,32,False,,False,False,False,False,False,False,False,False,,,,,,,,,,,,,,False,,,False,,False,False,False,False,False,False,False,False,


In [6]:
from configuration import TARGET_COLUMN_NAME
from myst_nb import glue

target = data[TARGET_COLUMN_NAME]
n_positives = target.sum()

glue("n_observations", len(data), display=False)
glue("n_columns", len(data.columns), display=False)
glue("target_column_name", TARGET_COLUMN_NAME, display=False)
glue("n_positives", n_positives, display=False)

Alright! Some things that we can already learn about our dataset from this table are:
* It contains a total of {glue:}`n_observations` observations.
* There are {glue:}`n_columns` columns with mixed data types (numeric and categorical).
* Missing values certainly exist (we can easily spot `nan` entries).
* The subsample raises a strong suspicion that dataset is imbalanced, i.e. when examining our target variable ({glue:}`target_column_name`) it seems there are far more negative observations than positive ones. A quick `sum()` call reveals that indeed only {glue:}`n_positives` of the {glue:}`n_observations` observations are positive.

```{admonition} Missing Values
    :class: note
> Handling missing values requires careful judgement. Possible solutions include:
* Removing the entire feature (column) containing the missing values.
* Removing all observations with missing values.
* *Imputation*: "Filling in" missing values with some constant or statistic, such as the mean or the mode. 

The approach we'll take when dealing with missing values depends heavily on the structure of our data, for example:
* If a column contains a small number of observations (relative to the size of the dataset) and the dataset is rich enough and offers more features that could be expected to be informative, it might be best to remove it.
* If the dataset is large and the feature in question is crucial for the purposes of our analysis, remove all observations with missing values. 
* Imputation might sound like a good trade-off if there is a good reason to believe some statistic may adequately approximate the missing values, but it is also the subject of many misconceptions and often used poorly.
* There are also ML methods that can safely include missing values (such as decision trees). We will learn when and how these are used later in this course.
```