## Using Text Frequency - Inverse Document Frequency to Classify Recipes.
### Abstract
The algorithm TF-IDF has been used to simplify the dataset, such that common ingredients, with a high document frequency, are less priority with respect to ingredients that are more rare. The formalised equation has been shown below.

\begin{equation*}
\ w_{i,j}  = tf_{i,j} * log( \frac{N}{df_{i}})
\end{equation*}

Where w<sub>i,j</sub> is the weighted term for the word i in the document j, tf<sub>i,j</sub> is the number of occurences of that word i in the document j. N is the total number of documents, and df<sub>i</sub> is the number of occurences of the word i in all the documents. Since each ingredient will appear at most once in each recipe, the text frequency term in this formula can be replaced with 1.

In [43]:
import pandas as pd
import json
from collections import Set
import numpy as np

with open("data/train.json") as json_file:
    training_data = json.load(json_file)
    
    
    #convert data dictionary to dataframe, with columns as ingredients 
    rows = []
    
    for i,data_point in enumerate(training_data):
        data_row = {"id": data_point["id"]}
        
        ingredients = data_point["ingredients"]
        
        for ingredient in ingredients:
            data_row[ingredient] = 1
        
        rows.append(data_row)
            
    id_with_ingredients = pd.DataFrame(rows)
    id_with_ingredients.fillna(0, inplace=True)
    id_with_ingredients.set_index("id", inplace = True)

In [69]:
#we need a function returning a dataframe of weighted values for each word in each document
#The input should be a dataframe with terms present in the recipe showing 1 in the respective column
#The output is a dataframe with weighted terms for each ingredient

id_with_ingredients.head()
def generate_tf_idf(input_df):
    #Getting matrix of values for ingredients against recipes
    value_matrix = input_df.values
    output_matrix = np.zeros(value_matrix.shape)
    number_of_documents = value_matrix.shape[0]
    
    #loop through ingredients
    for i in range(value_matrix.shape[1]):
        document_frequency = np.sum(value_matrix[:,i])
        
        #Use tf-idf formula to calculate weighted term, looping through each recipe 
        #Since there is only at most one example of an ingredient, we can just multiply by log(N/df_i)
        
        output_matrix[:,i] =  value_matrix[:,i] * np.log(number_of_documents/document_frequency)
    
    output_df = pd.DataFrame(data=output_matrix,
          index=input_df.index.values,
          columns=input_df.columns.values)
    
    return output_df

tf_idf_df = generate_tf_idf(id_with_ingredients)