## PACKAGES 

In [2]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.compose import make_column_selector
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
import time 

# Dataset 

https://www.kaggle.com/mariaren/covid19-healthy-diet-dataset

# Problem definition 

We chose a dataset combining different types of food, world population obesity and undernourished rate, and global covid cases count from around the world.

The idea is to understand how a healthy eating style could help combat the coronavirus, distinguishing the diet patterns from countries with lower COVID infection rate.

Our goal here is to provide diet recommendations base on our findings.

Each dataset provides different diet measure different categories of food, depending on what we want to focus on, so we have

- fat quantity,
- energy intake (kcal),
- food supply quantity (kg),
- protein for different categories of food

To which have been added:

- obesity rate
- undernourished rate
- the most up to date confirmed/deaths/recovered/active cases.

We are going to focus on the fat quantity dataset.

Let's start by loading the data

In [7]:
fat_quantity = pd.read_csv("./data/Fat_Supply_Quantity_Data.csv")
# Copying DataFrame 
df_fat = fat_quantity.copy()

In [9]:
# setting columns and rows that can be displayed  
pd.set_option('display.max_row', 150)
pd.set_option('display.max_columns', 50)

 ## Data Exploration and Processing

Now let's explore the dataset:
- check the head
- the columns
- the variable types

In [42]:
# df shape 
print('df shape', df_fat.shape)
print("")

# Column names
display('column names :', df_fat.columns)
print("")

# check the head 
display(df_fat.head(10))
print("")

# variable types 
display('variable types', df_fat.dtypes)
print("")

df shape (170, 31)



'column names :'

Index(['Country', 'Alcoholic Beverages', 'Animal Products', 'Animal fats',
       'Aquatic Products, Other', 'Cereals - Excluding Beer', 'Eggs',
       'Fish, Seafood', 'Fruits - Excluding Wine', 'Meat', 'Miscellaneous',
       'Milk - Excluding Butter', 'Offals', 'Oilcrops', 'Pulses', 'Spices',
       'Starchy Roots', 'Stimulants', 'Sugar Crops', 'Sugar & Sweeteners',
       'Treenuts', 'Vegetal Products', 'Vegetable Oils', 'Vegetables',
       'Obesity', 'Undernourished', 'Confirmed', 'Deaths', 'Recovered',
       'Active', 'Population'],
      dtype='object')




Unnamed: 0,Country,Alcoholic Beverages,Animal Products,Animal fats,"Aquatic Products, Other",Cereals - Excluding Beer,Eggs,"Fish, Seafood",Fruits - Excluding Wine,Meat,Miscellaneous,Milk - Excluding Butter,Offals,Oilcrops,Pulses,Spices,Starchy Roots,Stimulants,Sugar Crops,Sugar & Sweeteners,Treenuts,Vegetal Products,Vegetable Oils,Vegetables,Obesity,Undernourished,Confirmed,Deaths,Recovered,Active,Population
0,Afghanistan,0.0,21.6397,6.2224,0.0,8.0353,0.6859,0.0327,0.4246,6.1244,0.0163,8.2803,0.3103,1.0452,0.196,0.2776,0.049,0.098,0.0,0.0,0.7513,28.3684,17.0831,0.3593,4.5,29.8,0.125149,0.005058,0.098263,0.021827,38928000.0
1,Albania,0.0,32.0002,3.4172,0.0,2.6734,1.6448,0.1445,0.6418,8.7428,0.017,17.7576,0.2933,3.1622,0.1148,0.0,0.051,0.527,0.0,0.0,0.9181,17.9998,9.2443,0.6503,22.3,6.2,1.733298,0.0358,0.87456,0.822939,2838000.0
2,Algeria,0.0,14.4175,0.8972,0.0,4.2035,1.2171,0.2008,0.5772,3.8961,0.0439,8.0934,0.1067,1.1983,0.2698,0.1568,0.1129,0.2886,0.0,0.0,0.8595,35.5857,27.3606,0.5145,26.6,3.9,0.208754,0.005882,0.137268,0.065604,44357000.0
3,Angola,0.0,15.3041,1.313,0.0,6.5545,0.1539,1.4155,0.3488,11.0268,0.0308,1.2309,0.1539,3.9902,0.3282,0.0103,0.7078,0.1128,0.0,0.0,0.0308,34.701,22.4638,0.1231,6.8,25,0.050049,0.001144,0.02744,0.021465,32522000.0
4,Antigua and Barbuda,0.0,27.7033,4.6686,0.0,3.2153,0.3872,1.5263,1.2177,14.3202,0.0898,6.6607,0.1347,1.3579,0.0673,0.3591,0.0449,1.0549,0.0,0.0,0.202,22.2995,14.4436,0.2469,19.1,,0.15102,0.005102,0.140816,0.005102,98000.0
5,Argentina,0.0,30.3572,3.3076,0.0,1.3316,1.5706,0.1664,0.2091,19.2693,0.0,5.8512,0.1878,0.064,0.0213,0.0213,0.111,0.2475,0.0,0.0,0.1366,19.6449,17.3147,0.1878,28.5,4.6,3.31274,0.090444,2.953302,0.268993,45377000.0
6,Armenia,0.0,29.6642,6.2619,0.0,2.5068,1.6196,0.2218,0.5468,10.8165,0.0361,10.4709,0.2734,0.6602,0.0774,0.0103,0.0567,1.8002,0.0,0.0,0.9542,20.3384,12.8127,0.8717,20.9,4.3,5.029838,0.084675,4.261367,0.683796,2956000.0
7,Australia,0.0,24.1099,4.603,0.0,0.9908,0.7017,0.4515,0.4028,11.6002,0.052,6.5196,0.2339,1.2929,0.026,0.1007,0.0422,0.7926,0.0,0.0,1.6145,25.8901,20.3612,0.2144,30.4,<2.5,0.108907,0.003526,0.099748,0.005634,25754000.0
8,Austria,0.0,27.8268,12.8517,0.0,1.2297,1.2147,0.4259,0.2249,8.1099,0.0,5.1497,0.075,1.1367,0.012,0.102,0.045,0.4439,0.0,0.0,0.8398,22.1762,17.9323,0.2039,21.9,<2.5,3.646522,0.050819,3.187952,0.407752,8914000.0
9,Azerbaijan,0.0,32.1198,7.7987,0.0,5.4481,2.0197,0.2122,0.594,11.9993,0.017,9.9202,0.1612,0.1867,0.0255,0.017,0.1697,1.3663,0.0,0.0,2.2573,17.8802,7.1538,0.6534,19.9,<2.5,1.770736,0.01945,1.13614,0.615146,10108000.0





'variable types'

Country                      object
Alcoholic Beverages         float64
Animal Products             float64
Animal fats                 float64
Aquatic Products, Other     float64
Cereals - Excluding Beer    float64
Eggs                        float64
Fish, Seafood               float64
Fruits - Excluding Wine     float64
Meat                        float64
Miscellaneous               float64
Milk - Excluding Butter     float64
Offals                      float64
Oilcrops                    float64
Pulses                      float64
Spices                      float64
Starchy Roots               float64
Stimulants                  float64
Sugar Crops                 float64
Sugar & Sweeteners          float64
Treenuts                    float64
Vegetal Products            float64
Vegetable Oils              float64
Vegetables                  float64
Obesity                     float64
Undernourished               object
Confirmed                   float64
Deaths                      




In [44]:
# unique values 
display('Unique values per variable', df_fat.nunique())

'Unique values per variable'

Country                     170
Alcoholic Beverages           3
Animal Products             170
Animal fats                 169
Aquatic Products, Other       6
Cereals - Excluding Beer    170
Eggs                        169
Fish, Seafood               170
Fruits - Excluding Wine     168
Meat                        170
Miscellaneous               137
Milk - Excluding Butter     169
Offals                      167
Oilcrops                    170
Pulses                      160
Spices                      155
Starchy Roots               166
Stimulants                  169
Sugar Crops                  11
Sugar & Sweeteners            9
Treenuts                    162
Vegetal Products            170
Vegetable Oils              170
Vegetables                  168
Obesity                     120
Undernourished               98
Confirmed                   164
Deaths                      154
Recovered                   162
Active                      161
Population                  170
dtype: i

In [72]:
# droping first columns because of unique values very low 

# Creating list of columns to drop 
drop_col_list = df_fat.nunique().index[df_fat.nunique().values < 15].tolist()

# droping columns 
df_fat.drop(columns=drop_col_list, inplace=True, index=1)

Let's create a function to **check missing data** and unveil **the percentage of data missing** for each dataframe

In [74]:
# percentage of missing values per variable
print('percentage of missing values per variable')
(df_fat.isna().sum()/len(df_fat)).sort_values()

percentage of missing values per variable


Country                     0.000000
Vegetables                  0.000000
Vegetable Oils              0.000000
Vegetal Products            0.000000
Treenuts                    0.000000
Stimulants                  0.000000
Starchy Roots               0.000000
Pulses                      0.000000
Oilcrops                    0.000000
Offals                      0.000000
Spices                      0.000000
Miscellaneous               0.000000
Meat                        0.000000
Fruits - Excluding Wine     0.000000
Fish, Seafood               0.000000
Eggs                        0.000000
Cereals - Excluding Beer    0.000000
Animal fats                 0.000000
Animal Products             0.000000
Milk - Excluding Butter     0.000000
Population                  0.000000
Obesity                     0.017751
Confirmed                   0.035503
Deaths                      0.035503
Recovered                   0.035503
Undernourished              0.041420
Active                      0.047337
d

Delete the countries for which values are missing.

In [75]:
df_fat2 = df_fat.copy()

In [81]:
len(df_fat)
df_fat.shape

(169, 27)

In [78]:
df_fat2.dropna(inplace=True)

In [82]:
len(df_fat2)
df_fat2.shape

(153, 27)

Look at the different data types for each variable.

Explore the variables that are not of float type and see of you can convert them in to float type.

### CREATING X AND Y 

In [None]:
# Pour Y faut droper aussi les varaibles associées à Y pour pas biaser 
y = df_trees['ANNEEDEPLANTATION']
X = df_trees.drop('ANNEEDEPLANTATION', axis=1)

# Clustering

## Data preparation

Scale the dataset

## Plot some data

Now, we want to visualize some variables for each state. To do so, we use plotly express to have the possibility to hover on a scatter plot and see the statistics per country clearer as explained here.

https://plotly.com/python/hover-text-and-formatting/#:~:text=Basic%20Charts%20tutorials.-,Hover%20Labels,having%20a%20hover%20label%20appear.

Plot the "Obesity" vs "Deaths" statistics

Plot the "Animal fats" vs "Deaths" statistics

## K-means and Elbow method

We start with the K-Means model:
- use the scikit-learn method
- use the method you implemented.

Use a graphical tool, the elbow method, to estimate the optimal number of clusters k for a given task.
- Determine the optimal number of clusters for the previous 2 plots.

In [1]:
from sklearn.cluster import KMeans


Plot the obtained clusters

## Other clustering methods

We are going to explore other clustering methods, such as Mean-Shift.

You can read more about it in the next ressource:
https://scikit-learn.org/stable/modules/clustering.html


Apply the method to our datasets made of 2 variables ("Obesity" vs "Deaths")

In [2]:
# Mean-Shift
from sklearn.cluster import MeanShift, estimate_bandwidth


Plot the obtained clusters

Check out other algorithms such as DBSAN or OPTICS, why are these algorithms very interesting and in what cases? 

In [36]:
from sklearn.cluster import DBSCAN


# Classification and prediction

Given this dataset and the emphasis we have already laid on deaths through clustering, it would be interesting to study this dataset for a classification purpose and see how accurately we can predict the mortality rate in fonction of the given features.

## Creating train and test sets 

Let's separate the data into a training and testing sets using random selection.

Now drop the labels from the training set and create a new variable for the labels.

Scale the datasets.

## Random Forest

Let's try a random forest model on the prepared fat_quantity training set.

RandomForestRegressor(random_state=42)

Now we predict.

Let's perform a 10 fold cross validation.
And display the resulting scores:

## Learning Curves analysis 

Use the function seen in **Module 1 to plot learning curves with cross validation.** 

In [31]:
from sklearn.model_selection import learning_curve

def plot_learning_curve(estimator, title, X, y, axes=None, ylim=None, cv=None,
                        n_jobs=None, train_sizes=np.linspace(.1, 1.0, 5)):
....

Try to interpret the obtained learning curve.

Perform a grid search to try to obtain the best hyperparameters. What is the best score that you obtained?

## SVM

Use the SVM regressor to estimate the death rate. See if you can get a better model than with the Random forest regressor.

In [28]:
from sklearn.svm import SVR

SVR(epsilon=0.2)

## Linear regression

In [3]:
from sklearn.linear_model import LinearRegression

# Dimensionality reduction

Let's take a look at the whole dataset and see if there are any clusters.

In order to do these perform and plot a PCA of 2 components.

Dimensionality reduction is a way to reduce the number of features in your dataset without having to lose much information and keep the model’s performance. Check out the Random Forest based method and PCA for dimensionality reduction in the following ressource:

https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/

## Random Forest feature selection

Plot the feature importance graph.

Comment the graph.

## PCA dimensionality reduction

PCA is a technique which helps us in extracting a new set of variables from an existing large set of variables. Apply clustering methods on this new set of variables. Are the clusters obtained different than the clusters obtained on the "Obesity" vs "Deaths"?

Apply the Elbow method to determne the right number of clusters.

Use diverse methods to cluster the countries.