# The Perfect Diet for battling COVID-19

We have taken a dataset from Kaggle, consisting of the average, regular diet of the citizens of every country, along with the obesity rates, undernourishment rates, confirmed cases, deaths, and recoveries due to COVID-19 in the country at hand. 

Importing pandas to work with data :

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

Importing our csv file, and then printing the first 5 rows in order to check whether the file has been imported correctly :

In [None]:
data = pd.read_csv('../input/covid19-healthy-diet-dataset/Food_Supply_Quantity_kg_Data.csv')
# To display the top 5 rows
data.head(5)

As we can see, the average diet of a specific country is given in terms of proportions of different foods, drinks and substances. 

On the far right, the covid related figures are also given, showcasing the rate of cases, deaths and recoveries.

INTERPRETATION OF DATASET -

The first few columns of the dataset shows the percentages of food intake per person on an average.

Consider the first row (Afghanistan) , the value 0.014 shows that out of total food intake, the percentage of alcoholic beverages is 0.0014% on an average per person in Afghanistan. The intake of animal fats is 0.1973 percent, of animal products is 9.4341(considerably more than previous ones) and so on.

The column Recovered at the end shows the percentage pf people from total population who were able to recover from COVID 19. For eg, if the total population is 38928000.0 and the recovered percentage is 0.065141, then approximately 25,358 patients were able to recover.

We have considered Recovered as our target variable which means that based on the diet(intake of various types of food(independent variables)), we are trying to predict the percentage of people who will be able to recover. Thus, our model will be able to predict the recovery percentage of every country.

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
!conda install -c conda-forge Skater -y

Importing the necessary dependencies :

In [None]:
import skater
from skater.core.explanations import Interpretation
from skater.model import InMemoryModel
from skater.core.local_interpretation.lime.lime_tabular import LimeTabularExplainer

In [None]:
from sklearn.model_selection import train_test_split

We are calculating the important features in the case of recoveries and deaths :

In [None]:
index=data[data['Recovered'].isnull()].index
data.drop(index,inplace=True)

In [None]:
#import math
#data.replace((data['Confirmed']==math.nan)==True,0)

As we only want to calculate the foods, drinks and substances that lead to a recovery or a death, we refrain from considering other values such as obesity rate, undernourishment rate as well as factors regarding coronavirus. (Confirmed Cases, Deaths and Recoveries)

In [None]:
features=['Alcoholic Beverages',	'Animal fats',	'Animal Products',	'Aquatic Products, Other',	'Cereals - Excluding Beer',	'Eggs'	,'Fish, Seafood',	
          'Fruits - Excluding Wine', 'Meat',	'Milk - Excluding Butter',	'Miscellaneous',	'Offals',	'Oilcrops',	'Pulses',	'Spices',	'Starchy Roots',
          'Stimulants',	'Sugar & Sweeteners', 'Sugar Crops',	
          'Treenuts',	'Vegetable Oils',	'Vegetables','Vegetal Products'
          ]
data[features] = data[features].astype(float)
data['Recovered'] = data['Recovered'].astype(float)

Our target variable in this case would be the death rate due to COVID-19 :

In [None]:
X = data[features]
y = data['Recovered']
# mapping the target to a binary class 
y = y.apply(lambda x: 0 if x <= 0.047469 else 1)

# quickly check that we have a balanced target partition
y.sum() / len(y)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train.shape, X_test.shape

Training the model :

In [None]:
from xgboost import XGBClassifier, plot_importance

In [None]:
model = XGBClassifier(objective='binary:logistic', random_state=33, n_jobs=-1)
model.fit(X_train, y_train)

Predicting the values using the model we trained :

In [None]:
xgb_predictions = model.predict(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

In [None]:
#function for plotting confusion_matrix
def plot_confusion_matrix(predict_y,test_y):
    C = confusion_matrix(test_y, predict_y)
    labels = ['1','0']
    plt.figure(figsize=(10,7))
    sns.heatmap(C, annot=True, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Confusion matrix")
    plt.show()

In [None]:
plot_confusion_matrix(xgb_predictions,y_test)

Visualising the features (diet) which leads to death :

In [None]:
# ploting XGBoost default feature importances
fig = plt.figure(figsize = (18, 10))
title = fig.suptitle("Native Feature Importances from XGBoost", fontsize=14)

ax1 = fig.add_subplot(2, 2, 1)
plot_importance(model, importance_type='weight', ax=ax1, color='red')
ax1.set_title("Feature Importance with Feature Weight");

ax2 = fig.add_subplot(2, 2, 2)
plot_importance(model, importance_type='cover', ax=ax2, color='red')
ax2.set_title("Feature Importance with Sample Coverage");

ax3 = fig.add_subplot(2, 2, 3)
plot_importance(model, importance_type='gain', ax=ax3, color='red')
ax3.set_title("Feature Importance with Split Mean Gain");


## LIME

In [None]:
xgb_array = XGBClassifier(objective='binary:logistic', random_state=33, n_jobs=-1)
xgb_array.fit(X_train.values, y_train)

In [None]:
predictions = xgb_array.predict_proba(X_test.values)

In [None]:
exp = LimeTabularExplainer(X_test.values, feature_names=features, discretize_continuous=True, class_names=['Less likely', 'More likely'])

In [None]:
condition = 0
print('Reference:', y_test.iloc[condition])
print('Predicted:', predictions[condition])
exp.explain_instance(X_test.iloc[condition].values, xgb_array.predict_proba).show_in_notebook()

INTERPRETATION -

The output of LIME is a list of explanations, reflecting the contribution of each feature to the prediction of a data sample. 
There are 3 parts-
1. The leftmost part gives the total prediction probabilities for both the classes(More likely to recover -0.74 and Less likely to recover -0.26). 
For example, if we remove the features Oilcrops and Eggs from the dataset, we expect the classifier to predict outcome with probability 0.74 - 0.14 - 0.13 = 0.47. Thus, here Milk is the feature whose change impacts the output the most.

In [None]:
explainer = LimeTabularExplainer(X_test.values, feature_names=features, class_names=data['Recovered'])
condition = 0                                                                                                                                    
exp = explainer.explain_instance(X_test.iloc[condition].values, xgb_array.predict_proba)
exp.as_pyplot_figure()                                                          

Features in the green have positive correlations with the target. Negative correlations are shown in red.

In [None]:
condition=1
print('Reference:', y_test.iloc[condition])
print('Predicted:', predictions[condition])
explainer.explain_instance(X_test.iloc[condition].values, xgb_array.predict_proba).show_in_notebook()

## PERMUTATION IMPORTANCE

In [None]:
pip install eli5

In [None]:
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.tree import DecisionTreeClassifier

In [None]:
model_new = DecisionTreeClassifier(random_state=0)
model_new = model_new.fit(X_train, y_train)

In [None]:
#calculate the importance of features in model by shuffling
perm = PermutationImportance(model_new, random_state=0).fit(X_train, y_train)

In [None]:
#show the weights (accuracy ± variance) for every feature
eli5.show_weights(perm, feature_names = features,top=30) 
#since len(data.feature_names)=30 we have assigned it to top to get all the features.

## PARTIAL DEPENDENCE PLOTS

In [None]:
!pip install pdpbox
from pdpbox import pdp, get_dataset, info_plots

In [None]:
def plot_pdp(model, df, feature, cluster_flag=False, nb_clusters=None, lines_flag=False):
    
    # Create the data that we will plot
    pdp_goals = pdp.pdp_isolate(model=model, dataset=df, model_features=df.columns.tolist(), feature=feature)

    # plot it
    pdp.pdp_plot(pdp_goals, feature, cluster=cluster_flag, n_cluster_centers=nb_clusters, plot_lines=lines_flag)
    plt.show()

In [None]:
# plot the PD univariate plot
plot_pdp(model, X_train, 'Oilcrops')

In [None]:
# plot the PD univariate plot
plot_pdp(model, X_train, 'Eggs')

In [None]:
features_to_plot = ['Eggs','Oilcrops']
inter1  =  pdp.pdp_interact(model=model, dataset=X_train, model_features=features, features=features_to_plot)
pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='grid')

plt.show()

## ICE PLOTS

In [None]:
plot_pdp(model, X_train, 'Eggs', cluster_flag=True, nb_clusters=24, lines_flag=True)

## SHAP

In [None]:
pip install shap

In [None]:
import shap

In [None]:
shap.initjs()

In [None]:
explainer = shap.TreeExplainer(model)
#calculate shap values
shap_values = explainer.shap_values(X_test)

In [None]:
X_shap = pd.DataFrame(shap_values)
#display shap values
X_shap.head()

In [None]:
print('Expected Value/base value: ', explainer.expected_value)

In [None]:
test_X=pd.DataFrame(X_test,columns=features)

In [None]:
shap.dependence_plot("Eggs", shap_values, test_X)

In [None]:
shap.summary_plot(shap_values, test_X)

In [None]:
shap.summary_plot(shap_values, test_X, plot_type="bar", color='red')

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[0,:], test_X.iloc[0,:])

In [None]:
shap.initjs()
shap.force_plot(explainer.expected_value, shap_values[:1000,:], test_X.iloc[:1000,:])