# Hack-Night Kick-off

### *A notebook which you can take as a starting point in your exploration of our data.*

Boilerplate code which sets python up with the functionality we need.

In [None]:
import math
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import os
%matplotlib inline

sns.set_style('dark')

A couple of convenience variables: the version of the data we want to use.

In [None]:
tag      =  '20180420'

The directory which holds our data files.

In [None]:
plot_dir  =  '.'

Read in the table of all the foods people ate.

In [None]:
eaten  =  pd.read_csv (plot_dir + '/eaten_table_' + tag + '.csv',
                       encoding='ISO-8859-1')
eaten

The dataset is complex, so a second table is provided with descriptions of all column names. Below is a convenience function for accessing the 'Descriptions' column for a given column name

In [None]:
# read in col description table

columnDescription = pd.read_csv (plot_dir + '/eaten_table_column_info_' + tag + '.csv',
                                 encoding='ISO-8859-1')

# function for handling lookups

def get_info(colName,CDs = columnDescription):
    """returns description value from Column Description table"""
    
    return CDs.loc[CDs['Names'] == colName]['Descriptions'].values[0]

# try it for pork
get_info('Pork')

# Histograms for basic data exploration
Plot a histogram of the time people ate each food item.

In [None]:
eaten['MealTime'].hist(bins=24)
plt.show()

Plot a bar chart of the day number since the person started the survey...

In [None]:
eaten['DayNo'].hist(bins=4)
plt.show()

...you can see a slight drop-off as some people didn’t complete all 4 days.

Plot a histogram of the day of the week (Monday = 1 etc).

In [None]:
eaten['DayofWeek'].hist(bins=7)
plt.show()

# Plot a histogram of the number of kcal of each food item eaten.

In [None]:
fig, ax = plt.subplots()
eaten['KCALS'].hist(ax=ax)
# ax.loglog()
plt.show()

...has a big spike at zero - due to things like tap water.

Take a look at the FoodAisle column.

In [None]:
for f in np.unique(eaten['FoodAisle']):
    print(f)

Each food is assigned one of the above “aisles” (categories). We can use this to work out the average amount of food per day in each FoodAisle. We can also average the kcal of all the foods in this FoodAisle. Ditto GHGE, and nutrients.

Read in the table containing the average per FoodAisle...

In [None]:
eaten_table_aisle  =  pd.read_csv (plot_dir + '/eaten_table_aisle_' + tag +'.csv' ,
                         encoding='ISO-8859-1')
eaten_table_aisle

... and the extra column information e.g. RDAs.

In [None]:
eaten_table_aisle_column_info  =  pd.read_csv (plot_dir + '/eaten_table_aisle_column_info_' + tag +'.csv',
                                               encoding='ISO-8859-1')
eaten_table_aisle_column_info

## Plot the amount of KCals consumed per food aisle

In [None]:
eaten_table_aisle.columns

In [None]:
colname = 'KCALS'
# sort table
eaten_table_aisle = eaten_table_aisle.sort_values(colname,ascending=False)

# plot
ax = eaten_table_aisle[colname].plot(kind='bar',color='blue')
ax.set_ylabel(colname)

# label
ax.set_xticklabels(np.array(eaten_table_aisle['FoodAisle'],dtype=str))
plt.show()

# Stack plots

We've already been thinking of ways to visualize the data: our current best effort is in the stack_plotting module, in a file called `stack_plotting.py` which you will have pulled with this repository.  On first reading, step over these details and jump to the quick usage example below.  Once you have seen the whole notebook you may be interested in developing better visualization of the data, and maybe you will want to use this as a jumping-off point...?  You will also read about other options available in our plotting function if you look into that file.

Here we start by bringing the module into the notebook.

In [None]:
from stack_plotting import stack_plot

A quick and simple example of how to use the above function.

In [None]:
stack_plot ([["Heading 1", "A", "B", "C"],
             ["Row 1", 2.3, 3.4, 0.02],
             ["Row 2", 4.3, 9.3, 0.07],
             ["Row 3", 2.7, 8, 0.02],
             ["Row 4", 6.8, 2, 0.03],
             ["Row 5", 3.1, 4.2, 0.07],
             ["Row 6", 0.3, 4.2, 0.04],
             ["Row 7", 3.77, 4.2, 0.06],
             ["Row 8", 9.3, 9.2, 0.04],
             ["Row 9", 8.2, 8.2, 0.01]])

Now to work.  Stack up the GHGE and kcal for each FoodAisle.

In [None]:
#  Assemble three columns.
a = eaten_table_aisle['CO2e'].tolist()
a.insert(0,'CO2e')

b = eaten_table_aisle['FoodAisle'].tolist()
b.insert(0,'Food aisle')

c = eaten_table_aisle['KCALS'].tolist()
c.insert(0,'KCALS')

#  Make the plot nice and big.
plt.rcParams['figure.figsize'] = [12,8]

#  Do the plot.
stack_plot (list (zip (a,b,c)))

#  Put the plot size back, so we don't upset the rest of this notebook.
plt.rcParams['figure.figsize'] = [6,4]

# CO2e by age?

I was interested to understand if different age categories had different environmental impacts.

I averaged over number of individuals in the cohort, and calculated this per day, in order to compensate for any age cohort-specific effect of missing days.

In [None]:
eaten['age_cat'] = pd.cut(eaten['Age'], 5)

# group by participant and sum co2e of food intake
#
by_participant = eaten.groupby(['seriali','age_cat','DayNo']).sum()
day_mean_by_age_cat =  by_participant.groupby(['age_cat']).mean()['CO2e']
day_mean_by_age_cat

In [None]:
ax = day_mean_by_age_cat.plot(kind='bar',color='blue')
ax.set_ylim((4000,4500))
ax.set_ylabel('Daily emission from consumed food (g CO2e)')
ax.set_xlabel('Age category')
plt.show()

# Non-Linear Clustering Example

I wanted to know which foods were nutritionally similar, so decided to represent this using an isomapped projection- a tool for representing high dimensional data in a 2D space.


In [None]:
eaten_table_aisle['FoodAisle'].index


In [None]:
# first retrieve appropriate features

nutritional_vals = ['ACAR', 'BCAR', 'BCRYPT', 'BIOT', 'CA',
       'CHO', 'CHOL', 'CL', 'CMON', 'CN3', 'CN6', 'CU', 'ENGFIB', 'FAT', 'FE',
       'FOLT', 'FRUCT', 'GLUC', 'HFE', 'I', 'K', 'LACT', 'MALT','KCALS',
       'MG', 'MN', 'NA', 'NCF', 'NHFE', 'NIACEQU', 'NMILK', 'OSUG',
       'P', 'PANTO', 'PROT', 'RET', 'RIBO', 'SATFA', 'SE', 'STAR', 'SUCR',
       'THIA', 'TOTCAR', 'TOTNIT', 'TOTSUG', 'TRANS', 'VITA', 'VITB12',
       'VITB6', 'VITC', 'VITD', 'VITE', 'WATER', 'ZN']

label = eaten_table_aisle['FoodName'].values
X = eaten_table_aisle[nutritional_vals]



In [None]:
from sklearn.manifold import LocallyLinearEmbedding
Y = LocallyLinearEmbedding(7,2).fit_transform(X)

# plot
plt.figure(figsize = (12,12))
plt.scatter(Y[:,0],Y[:,1],
            c=eaten_table_aisle['CO2e'],
            cmap='RdYlGn_r',
            s=eaten_table_aisle['CO2e'])
for i, lab in enumerate(label):
    plt.text(Y[i,0],Y[i,1],lab)
plt.show()

Lets apply the same thing to the whole food dataset:


In [None]:
foods  =  pd.read_csv (plot_dir + '/foods_table_' + tag + '.csv',
                         encoding='ISO-8859-1')

# get rid of na rows for now
foods = foods.dropna(axis=0,how='any')

In [None]:
import sklearn.preprocessing as pre
label = foods['FoodName'].values

# using our list of desired features, extract 2d (obs by feature) array and scale
X = pre.RobustScaler().fit_transform(foods[nutritional_vals])

Because this is a big dataset, and probably very difficult to cluster, I'm going to set the neighbour classes to 40 and the num components (the dimensionality of the output array to 4). This is all a bit arbitrary, but seems to work...

In [None]:
from sklearn import manifold

# we'd expect about 30 classes, so set to 35
Y = manifold.LocallyLinearEmbedding(n_neighbors=40,n_components=4,max_iter=10000).fit_transform(X)


In [None]:
from scipy.spatial.distance import cdist

def find_nearest_foods(X,tabPosition,foodsTable):
    distX = cdist(X,X)
    leng = distX.shape[0]
    vec = distX[tabPosition,:int(leng/2)]
    orderedVec = np.argsort(vec)
    oNs = np.array(foodsTable['FoodName'])[orderedVec]
    co2e = np.array(foodsTable['CO2e'])[orderedVec]
    print('The nearest foodstuffs to {} (CO2e={:.2f}) are:\n\
        \n1) {}\n   (CO2e={:.2f})\
        \n2) {}\n   (CO2e={:.2f})\
        \n3) {}\n   (CO2e={:.2f})\
        \n4) {}\n   (CO2e={:.2f})\
        \n ...'.format(oNs[0],co2e[0],
                       oNs[1],co2e[1],
                       oNs[2],co2e[2],
                       oNs[3],co2e[3],
                       oNs[4],co2e[4]))
    return None

So sometimes this works great:

In [None]:
find_nearest_foods(Y,37,foods)

In [None]:
find_nearest_foods(Y,1500,foods)


And quite often it's completely wrong:

In [None]:
find_nearest_foods(Y,700,foods)

So this is clearly wrong, but imagine how great it would be if it was right...