# Machine Learning workflow

by Ana√Øs Pepey

ü¶ü In this notebook, we will import open data and apply basic machine learning concepts to analyse the relationship between different variables. 

üó£Ô∏è Please call Asia or Ana√Øs if you feel stuck or have any question!

üêç Happy coding!

## Import libraries

### all-time basics

In [None]:
# Data manipulation
import pandas as pd
# Data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

### machine learning specifics

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score

## 0. Data Exploration and Visualisation

### Understanding our data

Let's have a look at a dataset provided by the [World Bank Open Data Page](https://data.worldbank.org/) for the year 2020.

In [None]:
df = pd.read_csv("./data/workshop/worldbank-df.csv")
df.head()

We can also look at random rows, instead of the first 5:

In [None]:
df.sample(5)

If you re-run the cell, you will obtain 5 different rows. Try it!

Python simplifies our work by providing core functions to explore dataset.

üëá Run the function below to have a look at the shape of the dataframe, which will return its `(number of rows, number of columns)`.

In [None]:
df.shape

The dataframe `df` includes 215 rows and 12 columns. It is small for machine learning standards!

üëá We can also peek at the first 5 rows:

In [None]:
df.head()

üëá Run the function below to get a quick glance at the distribution of its numerical variables accross the whole dataset. 

In [None]:
df.describe()

üëá Run the function below to get the same information for a specific column. 

In [None]:
df["incidence_per_1000"].describe()

‚ùì How would you look at the details of the `life_expectancy` column?

In [None]:
# type, uncomment and run your code here

<details><summary markdown='span'>View solution
</summary>

```python
df["life_expectancy"].describe()
```

</details>

‚ùì What is the mean value of access to electricity?

In [None]:
mean_elec = df['access_electricity'].mean()
mean_elec

We can round the number to the amount of decimal we require, for example only one decimal:

In [None]:
n_decimal = 1

In [None]:
round(mean_elec, n_decimal)

We can create sentences that integrate variables. Take a minute to understand every element of the cell below:

In [None]:
var1 = round(df['rural_pop'].min(), n_decimal)
var2 = round(df['rural_pop'].max(), n_decimal)

print(f"In 2020, depending on the country, between {var1} and {var2}% of the population lived in rural areas.")

‚ùì Can you find the maximum value of malaria incidence? Store it under the `max_malaria` variable. 

In [None]:
# type, uncomment and run your code here

<details><summary markdown='span'>View solution
</summary>

```python
max_malaria = df['incidence_per_1000'].max()
max_malaria
```

</details>

Another easy method to see which countries harbour the most malaria cases is to sort the dataframe:

In [None]:
df.sort_values(by='incidence_per_1000', ascending=False)

We can also choose to display the ten countries with the highest malaria incidence:

In [None]:
df.sort_values(by='incidence_per_1000', ascending=False)[:10]

### Plotting our data

A good practice is to plot the data to understand better the relationship between the different variables. 

üëá Run the function below to create a scatterplot.

In [None]:
df.columns

In [None]:
sns.scatterplot(x='incidence_per_1000', y='life_expectancy', data=df, alpha=0.5, hue = "life_expectancy");

We can plot to relation between the two variables:

In [None]:
sns.scatterplot(x='incidence_per_1000', y='life_expectancy', data=df, alpha=0.5, hue = "life_expectancy");
sns.regplot(x='incidence_per_1000', y='life_expectancy', data=df, scatter=False, color = 'red')

### Plot like a pro!
Let's see all the relationships between our columns

In [None]:
sns.pairplot(df, hue = 'continent', corner = True);

## 1. Data preparation

### Data cleaning

Let's save the initial shape of our dataframe to see how it evolves through cleaning. 

In [None]:
initial_shape = df.shape
initial_shape

Before using the dataset, we need to make sure it is usable. 

üëá First, we clean all the NA (Non Applicable) values.

In [None]:
df_test = df.dropna()
df_test.shape

Oh, no! We lost a significant amount of data. Let's understand where those NAs were.

In [None]:
df.isna().sum()

It seems like only two columns are responsible for the most NAs. Let's delete those.

In [None]:
cols_to_remove = ["population", "pop_growth"]

df_cleaned = df.drop(columns = cols_to_remove)
df_cleaned.head()

In [None]:
df_cleaned.isna().sum()

Much better! Now we can delete the rows containing NAs without losing so much data.

In [None]:
df_cleaned = df_cleaned.dropna()
df_cleaned.shape

üëá Then, we remove all duplicate values (if any).

In [None]:
df_cleaned = df_cleaned.drop_duplicates()

In real life, cleaning data includes a lot more steps and can get tricky.

Cleaning and preprocessing data can be the longest part of the process and can be considered the most important one. 

## 2. Model training

### Choosing your model

We start with the easiest model of them all: the linear regression, imported at the beginning on the notebook from the [Scikit-Learn](https://scikit-learn.org) library. 

<details><summary markdown='span'>Additional info
</summary>

  Scikit-Learn is an open source, well documented and honestly life saving initiative for all mahcine learning enthusiasts. 
  Do not hesitate to read their [documentation](https://scikit-learn.org/stable/user_guide.html) and explore their [tutorials](https://scikit-learn.org/stable/auto_examples/index.html). 

  A linear regression is a mathematical model that can be written as f(x) = ax + b
  where a is the slope and b is the intercept. 
</details>


In [None]:
model = LinearRegression()

The model is now loaded under the `model` variable as shown by the cell below:

In [None]:
model

### Defining the features and the target

In [None]:
df_cleaned.columns

Let's start with a very simple model.

We assume that malaria incidence can be predicted with the life expectancy of a country.

We define that our feature `X` is `life_expectancy` and our target `y` is `incidence_per_1000`.

In [None]:
X = df_cleaned[['life_expectancy']]
y = df_cleaned[['incidence_per_1000']]

plt.scatter(X, y, alpha = .2)

### Reserving some data for validation

This is a very important step!

To be able to properly assess the efficacy of our model, we should keep some data 'unseen'.
It means dividing the dataset into a `training` part and a `testing` part. 

Steps:
1. the model learns on some part of the dataset
2. the model predicts from the other part: it uses the feature(s) `X` to make a prediction of the target `y`. 
3. we compare the prediction and the true values

We first choose the proportion of data kept for `testing`. It is usually 0.2 or 0.3, meaning 20 or 30%, selected at random. 

In [None]:
size = 0.3

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = size)


In [None]:
X_train.head()

In [None]:
y_train.head()

What do you expect the approximative shapes of `X_train` and `X_test` will be?

Print the shapes of `X_train` and `X_test` below.

In [None]:
# type, uncomment and run your code here

<details><summary markdown='span'>View solution
</summary>

```python
print(X_train.shape, X_test.shape)
```

</details>

### Fitting your model

We make the model learn from the training dataset:

In [None]:
model_fitted = model.fit(X_train, y_train)

plt.scatter(X_train, y_train, alpha = 0.2)

## Model prediction

Then, we can ask the model to predict malaria incidence values `y` from the the values of life expectancy `X` it has not seen before. 

In [None]:
prediction = model_fitted.predict(X_test)

Here, we can see the predicted values calculated from `X_test` in red, against the actual values from `X_train`, `y_train` in blue.

In [None]:
plt.scatter(X_train, y_train, alpha = .2),
plt.scatter(X_test, prediction, alpha = .2, color = 'red')

It seems that our model is not doing an excellent job...

But how much of a bad prediction is that? Let's measure it!

## Model evaluation

### R2 score

In [None]:
model_score_baseline = model_fitted.score(X_test, y_test)
print("Model R2 score:", round(model_score_baseline, 2))

<details><summary markdown='span'>What is the R2 score?
</summary>

In linear regression, the R-squared score (also known as the coefficient of determination) measures the proportion of the variance in the dependent variable that is predictable from the independent variable(s).

A higher R-squared score indicates that the model explains a larger portion of the variability in the dependent variable, suggesting a stronger relationship between the independent and dependent variables.
</details>

### Accuracy

In [None]:
accuracy = cross_val_score(model_fitted, X_train, y_train).mean()
print("Model accuracy:", round(accuracy, 2))

We will now try to improve our model and increase the scores, which are currently low.

## Improving the model

### Option 1: More variables

We can repeat this workflow, including more variables from our dataset. 
To do so, we change our features `X` to include more columns, and we keep y as is. 

In [None]:
X_poly = df_cleaned[['life_expectancy', 'forest_area', 'rural_pop', 'access_electricity', 'GDP_growth', 'working_age_pop', 'life_expectancy']]
y = df_cleaned[['incidence_per_1000']]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_poly, y, test_size = size)


In [None]:
X_train.head()

The shape of our features `X` will now be different from before:

In [None]:
X_train.shape

However, `y_train` remains the same:

In [None]:
y_train.shape

### Fitting the model 

In [None]:
model_fitted2 = model.fit(X_train, y_train)

In [None]:
model_score2 = model_fitted2.score(X_test, y_test)
print("Model R2 score:", round(model_score2, 2))

In [None]:
accuracy2 = cross_val_score(model_fitted2, X_train, y_train).mean()
print("Model accuracy:", round(accuracy2, 2))

Our model improved!

### Features importance

In [None]:
features = pd.DataFrame(X_train.columns)
importance = pd.DataFrame(model_fitted2.coef_).T
result = pd.concat([features, importance], axis=1)
result

In [None]:
# sns.barplot(data = result, x = result[0], y = result[1])