## Using Text Frequency - Inverse Document Frequency to Classify Recipes.
### Abstract
The algorithm TF-IDF has been used to simplify the dataset, such that common ingredients, with a high document frequency, are less priority with respect to ingredients that are more rare. The formalised equation has been shown below.

\begin{equation*}
\ w_{i,j}  = tf_{i,j} * log( \frac{N}{df_{i}})
\end{equation*}

Where w<sub>i,j</sub> is the weighted term for the word i in the document j, tf<sub>i,j</sub> is the number of occurences of that word i in the document j. N is the total number of documents, and df<sub>i</sub> is the number of occurences of the word i in all the documents. Since each ingredient will appear at most once in each recipe, the text frequency term in this formula can be replaced with 1.

In [9]:
import pandas as pd
import json
from collections import Set
import numpy as np

with open("data/train.json") as json_file:
    training_data = json.load(json_file)
    
    
    #convert data dictionary to dataframe, with columns as ingredients 
    rows = []
    cuisine_rows = []
    
    for i,data_point in enumerate(training_data):
        data_row = {"id": data_point["id"]}
        cuisine_row = {"id": data_point["id"],"Cuisine": data_point["cuisine"]}
        
        ingredients = data_point["ingredients"]
        
        for ingredient in ingredients:
            data_row[ingredient] = 1
        
        rows.append(data_row)
        cuisine_rows.append(cuisine_row)
            
    id_with_ingredients = pd.DataFrame(rows)
    id_with_ingredients.fillna(0, inplace=True)
    id_with_ingredients.set_index("id", inplace = True)
    
    id_with_cuisine = pd.DataFrame(cuisine_rows)
    id_with_cuisine.set_index("id", inplace = True)

In [10]:
#we need a function returning a dataframe of weighted values for each word in each document
#The input should be a dataframe with terms present in the recipe showing 1 in the respective column
#The output is a dataframe with weighted terms for each ingredient

def generate_tf_idf(input_df):
    #Getting matrix of values for ingredients against recipes
    value_matrix = input_df.values
    output_matrix = np.zeros(value_matrix.shape)
    number_of_documents = value_matrix.shape[0]
    
    #loop through ingredients
    for i in range(value_matrix.shape[1]):
        document_frequency = np.sum(value_matrix[:,i])
        
        #Use tf-idf formula to calculate weighted term, looping through each recipe 
        #Since there is only at most one example of an ingredient, we can just multiply by log(N/df_i)
        
        output_matrix[:,i] =  value_matrix[:,i] * np.log(number_of_documents/document_frequency)
    
    output_df = pd.DataFrame(data=output_matrix,
          index=input_df.index.values,
          columns=input_df.columns.values)
    
    return output_df

tf_idf_df = generate_tf_idf(id_with_ingredients)

## Dimensionality Reduction
Currently, the number of ingredients that a recipe could have is too large to visualise or analyse efficiently. For that reason, the implementation of PCA (Principal Component Analysis) from the sci-kit learn library has been used to reduce the dimensionality of the data to three dimensions, for visualisation purposes. 

In [11]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)

#data to transform:
unreduced_data = tf_idf_df.values

reduced_data = pca.fit_transform(unreduced_data)
principal_df = pd.DataFrame(data = reduced_data,
                           index = id_with_ingredients.index.values,
                           columns = ['Principal Component 1', 'Principal Component 2', 'Principal Component 3'],
                          )

## Visualising the data

since the data now has three dimensions, a graph showing the clustering behaviour of the relevant cuisines can be generated, as shown below.

In [12]:
id_with_cuisine.head()

Unnamed: 0_level_0,Cuisine
id,Unnamed: 1_level_1
10259,greek
25693,southern_us
20130,filipino
22213,indian
13162,indian


In [13]:
# return a dataframe including the principal components with the 

reduced_df = pd.concat([principal_df, id_with_cuisine], axis=1)
reduced_df.head()

Unnamed: 0_level_0,Principal Component 1,Principal Component 2,Principal Component 3,Cuisine
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10259,-0.564699,-0.620127,-0.503048,greek
25693,0.803503,-0.331466,0.59691,southern_us
20130,-0.038378,0.600927,0.665032,filipino
22213,0.281238,0.31341,0.059372,indian
13162,-1.724282,-0.445946,3.634373,indian


In [20]:
import plotly.plotly as py
import plotly.graph_objs as go
from random import randint 

colours = ['rgb({},{},{})'.format(randint(0,255),randint(0,255),randint(0,255)) for i in range(len(reduced_df["Cuisine"].unique())) ]

data = []

for i in range(len(reduced_df["Cuisine"].unique())):
    cuisine = reduced_df["Cuisine"].unique()[i]
    colour = colours[i]
    
    x = reduced_df[ reduced_df['Cuisine'] == cuisine ]['Principal Component 1']
    y = reduced_df[ reduced_df['Cuisine'] == cuisine ]['Principal Component 2']
    z = reduced_df[ reduced_df['Cuisine'] == cuisine ]['Principal Component 3']
    
    trace = dict(
        name = cuisine,
        x = x, y = y, z = z,
        type = "scatter3d",    
        mode = 'markers',
        marker = dict( size=3, color=colour, line=dict(width=0) ) )
    data.append( trace )
    
    layout = dict(
    width=800,
    height=550,
    autosize=False,
    title='Recipe dataset',
    scene=dict(
        xaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        yaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        zaxis=dict(
            gridcolor='rgb(255, 255, 255)',
            zerolinecolor='rgb(255, 255, 255)',
            showbackground=True,
            backgroundcolor='rgb(230, 230,230)'
        ),
        aspectratio = dict( x=1, y=1, z=1 ),
        aspectmode = 'manual'        
    ),
)

fig = dict(data=data, layout=layout)

py.iplot(fig, filename='recipe_visualisation', validate=False, world_readable = True)