<img src="img/logo.jpg" width="40" height="40"></img>
`python-for-data-analysis.ipynb`, `flu_consultation_records.csv` and `patient_demographics.csv` (10 September, 2021) is provided to NHS England under licence from Faculty Science Ltd.

# Predicting flu consultation
---
There are two relevant datasets:

 `patient_demographics.csv` contains the demographics of the patient (age, gender, ethnicity and who this patient lives with)
 
`flu_consultation_records.csv` contains 735660 rows and 4 columns

   - `date`: the date of the record
   - `temperature`: the average temperature near the patient's home
   - `patient_id`: pseudonymised patient id
   - `has_flu_consultation`: whether the patient has a flu consultation on that day
 
The goal here is to find out whether we can predict a patient is going to make a flu consultation based on the features we have access to.

<i>Note: The data for this exercise has been generated randomly, so may display some regularity that would not be expected of real world data</i>

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# set this to the number of CPU you want to use
n_jobs = 1

## Data
---
Let's load the two datasets and see what is in there

In [None]:
# load the two datasets
demo_df = # code here
flu_df = # code here

In [None]:
# return the first 5 rows of demo_df
demo_df.head()

In [None]:
# return the first 5 rows of flu_df
flu_df.head()

### Merge data
In order to relate the patient demographics with their flu consultation records, we need to merge the two datasets

In [None]:
# Merge the dataset here


## Exploratory data analysis (EDA)
---

Check for missing values or duplicate rows using pandas and remove accordingly if any are found

In [None]:
print(f"Are there any Null values: {}")
print(f'Number of duplicated rows: {}')
# drop duplicates and null values here

print(f'Number of rows in the data: {}')

Conduct some basic checks on each column to see if there are erroneous data entries, remove them accordingly

In [None]:
# try generating some descriptive statistics


In [None]:
# Are there any bad entries in the data? If yes, remove them


Visualise the data for better understanding

In [None]:
# make some plots


What have you learned from these plots? Do you already have some insights about which demographics are more likely to have flu consultation? Are there any features that don't seem useful?

##  Preprocessing
---

In [None]:
# one-hot encode columns that are categorical (hint: pandas has built in functions for this)


In [None]:
# make some plots to sense-check how correlated are the columns


In [None]:
# define the columns you wished to include as features


In [None]:
# define the outcome variable column


### Train, test, validation split
Before you begin selecting and optimising a machine learning model, you should split your data into train, test (and maybe validation) sets. In some cases, you may only need a training and a validation set. For example, perhaps the test data has been held out from the beginning. You may also choose to just use a train/test split and utilise cross validation methods on your training data. The exact ratios for each dataset will depend on the amount of available data and specifics of the problem but an 80/20 train/test split is a good rule of thumb.

In [None]:
from sklearn.model_selection import train_test_split

# split the data into train/test sets and separate the features from the target.


## Model selection and tuning
---
There are many classification algorithms that could be used for this problem. It is up to you to decide which methods are most suitable for this binary classification task given what you have learned about the data so far.In general sklearn can be used to quickly test different types of model. We suggest using cross validation to compare the performance of a few classifiers on the training data, without worrying too much about hyperparameter tuning at this stage. Try to pick at least 3 models that are different in some significant way. Depending on which models you choose, you may need some extra preprocessing steps, e.g., normalising the data.You will need to consider what the important performance metrics are for a classification problem, and use these to decide which model is best for the task.

Note: Is the class of the outcome variable balanced? What can you do to improve the quality of model-fits?

In [None]:
# import the sklearn models that you want to try
from sklearn.model_selection import cross_val_predict
from sklearn import metrics

In [None]:
# model 1

In [None]:
# model 2

In [None]:
# model 3

In [None]:
# Write a function that takes model predicted probabilities and actual y as input. 
# Compute the accuracy, precision, recall and f1 score
# return a dictionary that has these metrics as keys and the computed metric as values

In [None]:
# how does different model performs?

Looking at these initial results, which model do you think is best to proceed with? Do you have any thoughts about why a certain model might be performing better at this problem than another. What are the limitations of each model?

## Hyperparameter tuning
---
Select your best model from the above and see if you can increase its performance using hyper parameter tuning. You may find this link helpful. Depending on your model, doing an exhaustive grid search might take a very long time. Consider limiting your grid size by either selecting one or two of the hyperparameters that you think are most important or searching over small value range for each hyper parameter. Alternatively, you could try a randomised grid search to speed things up.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Do a grid search on your hyperparameter space, what is the best hyperparameters?


Model Evaluation
---
Now compare the performance of your baseline model and the tuned model on the test set. Why is it imporant to compare performance on held out data?

In [None]:
# compare the performance of the test set for the baseline and tuned model


## ROC vs Precision-Recall
---
Draw the precision-recall curve and ROC curve for the classifiers and calculate the area under the curve in both cases. Which curve do you think is more appropriate for this problem and how might the choice effect your evaluation of the model? (Hint: consider your class balance).

In [None]:
# get roc values and precision recall values uding sklearn


In [None]:
# plot the curves


In [None]:
# calculate area under the curve.


In [None]:
print(f'Area under precision-recall curve: {}')
print(f'Area under ROC curve: {}')

## Save the model

We might want to re-use the trained model in the future without having to re-train it. To do this, we would need to save the model to disk and load the model when we need it again.

In [None]:
from joblib import dump, load

In [None]:
# Try saving your model with dump

# After saving,  check if the can be loaded with load



## Regression
---
Predicting the individual record might be challenging, there might be other factors that drives whether a patient get flu consultation on a particular day and we might not have access to those features. 

Instead of predicting daily flu consultation per patient, we might be more interested in questions like: what is the total number of flu consultation by the patient in each month rather than the number of flu consultation per patient per day.

### preprocess

Now let's rework the dataframe such that each row represents the total number of flu consultations of a given month and a given patient.

Hint: In the data, we have daily temperatures, what kind of aggregation we want to do in order to establish a meaningful relationship between temperature and month?  

In [None]:
# code

#### prepare inputs for modelling

Like how we have done for classification problems. What kind of preprocessing we need to do?

In [None]:
# code

### Model

In [None]:
# code

### Check performance

Try plotting the actual total consultations against the model predicted total consultations. What is the R2? Does the model perform better than the mean? 

In [None]:
from sklearn.metrics import r2_score

In [None]:
print(f"R2 for the test set is: {}")

In [None]:
# plot the actual against the prediction
plt.plot(y_test,y_pred,'r.')