# Machine Learning workflow

## Import libraries

### the all-time basics

In [12]:
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### the machine learning specifics

In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate, cross_val_score

## Import data

👇 Load the `titanic.csv` dataset into this notebook as a pandas dataframe, and display its first 5 rows.

This dataset is provided by the WHO through their data download [page](https://apps.who.int/malaria/maps/threats/#/download). 

In [14]:
# data = pd.read_csv("../data/raw/WHO_int_conc.csv")
df = pd.read_csv("../data/raw/WHO_ther_eff.csv")
df.head()

Unnamed: 0,ID,COUNTRY_NAME,ISO2,ADMIN2,SITE_NAME,LATITUDE,LONGITUDE,YEAR_START,YEAR_END,DRUG_NAME,PLASMODIUM_SPECIES,SAMPLE_SIZE,FOLLOW_UP (days),POSITIVE_DAY_3 (days),TREATMENT_FAILURE_PP,TREATMENT_FAILURE_KM,DATA_SOURCE,CITATION_URL
0,3,Cambodia,KH,Ratanakiri,Veun Sai,14.261043,106.794241,2010,2011,Artemether-lumefantrine,P. falciparum,60,28,3.2,5.0,4.9,"National Center for Parasitology, Entomology a...",
1,15,Cambodia,KH,Pailin Province,Pailin,12.741784,102.633104,2011,2011,Artesunate-mefloquine,P. falciparum,28,42,51.7,0.0,0.0,"National Center for Parasitology, Entomology a...",
2,23,Cambodia,KH,Pailin Province,Pailin,12.741784,102.633104,2010,2011,Dihydroartemisinin-piperaquine,P. falciparum,28,42,44.8,25.0,24.1,"National Center for Parasitology, Entomology a...",
3,24,Cambodia,KH,Pursat,Veal Veng,12.546349,103.918749,2010,2010,Dihydroartemisinin-piperaquine,P. falciparum,56,42,10.2,10.7,10.5,"National Center for Parasitology, Entomology a...",
4,25,Cambodia,KH,Ratanakiri,Veun Sai,14.261043,106.794241,2010,2010,Dihydroartemisinin-piperaquine,P. falciparum,59,42,0.0,0.0,0.0,"National Center for Parasitology, Entomology a...",


## 0. Data Exploration and Visualisation

In [None]:
data['GarageFinish'].describe()

❓How many species of penguin are there in this dataset?

In [19]:
penguin_df['species'].nunique()

NameError: name 'penguin_df' is not defined

❓How many observations for each species are there in the dataset?

In [None]:
penguin_df['species'].value_counts()

## 1. Data cleaning

In [None]:

df.dropna(inplace=True)

In [None]:
df = df.drop_duplicates()

In [20]:

sns.scatterplot(x='bill_length_mm', y='bill_depth_mm', data=penguin_df, hue='species');

NameError: name 'penguin_df' is not defined

In [None]:
# Plot like a pro - let's see all the relationships between our columns
sns.pairplot(penguin_df, hue="species", corner = True);

## 2. Train

### Choosing your model

In [15]:
model = LinearRegression()


### Organising your dataset

In [17]:
X = df[['TBD']]
y = df[['TBD']]

plt.scatter(X, y, alpha = .2)

### Fitting your model

In [None]:
model1 = model.fit(X, y)

<details>
<summary> 👉Solution </summary>
You should get a mean accuracy of around 98% which is more than 90%. So our algorithm beats the zoologist!

</details>

## 2. Cross-Validation

Cross-validation does not train a model, it evaluates a hypothetical model on the dataset. If you want to use the model to, for example, make predictions, you will need to train it outside of the cross-validation.

❓ Go ahead and train the model on the full `X` and `y` (as we've already validated the model's score, and now will use it to predict). Save the trained model under the variable `model`.

## 3. Predict

## 5. Improving the Model with More Features

Your friend who enjoys the NBA fantasy league comes to you with some insights 🏀

They say that when evaluating a player's Wins Above Replacement rating, they would typically also look at the number of ball possessions (`poss`), their defense/offense ratio, and their pacing.

❓ Visualize the correlation between these new features and the `win_rating`. You can use `matplotlib` or `seaborn`. Which **one** of the above features would you consider adding to your model?

<details>
    <summary>💡 Click here for a hint</summary>
    A seaborn <code>regplot</code> might be very handy here.
</details>

## 4. 