# Ingredient Prioritisation & Feature Extraction
This notebook is intended to supercede the 3 notebooks titled *prioritise_ingredients, ingredients_store* and *save_ingredient_words*.
At creation it contains all unique code sections present in those notebooks, plus additions.

The aim for this notebook should be to explore possible features that can be reached using the training data set, and how these affect the classification accuracy.

The most accurate classifier found in benchmarking has been included at the end of this notebook, to allow easy comparison of accuracy against that achieved using the simple countVectoriser approach.

### Section 1: save_ingredient_words

In [1]:
# work with ingredients to identify most common words /good candidates for stemming and classifying
import re
import pandas as pd
import numpy as np
import pickle


with open('./Dataset/unique_ingredients.txt', 'r') as file:
     content = file.readlines()
file.closed

ingredientwords = {}
print(('We are starting with %d unique ingredients') % len(content))
#print(content)

We are starting with 6714 unique ingredients


In [2]:
#ingredienttowords={}
#for ingredient in content:
#    #print(ingredient)
#    words = re.findall(r'\w+', ingredient)
#    [x.lower() for x in words]
#    #print(words)
#    ingredienttowords.setdefault(ingredient.strip(),[]).append(words)

In [None]:
##do not run this unless needed! writing to workingingredients
#inputingredientwords = ingredienttowords
#output = open('./ProcessedData/workingingredients.pkl', 'wb')
#pickle.dump(inputingredientwords, output)
#output.close()

In [4]:
pkl_file = open('./ProcessedData/workingingredients.pkl', 'rb')
workingingredients = pickle.load(pkl_file)
pkl_file.close()

ingredientslist=[]
ingredientsphrase=[]
for key, value in workingingredients.items():
    ingredientsphrase.append(str(key))
    ingredientslist.append(value)

print("Each of the %d unique ingredients broken into separate words" % len(ingredientsphrase))
df = pd.DataFrame({'Ingredient': ingredientsphrase, 'Ingredient list of words':ingredientslist})
df

Each of the 6714 unique ingredients broken into separate words


Unnamed: 0,Ingredient,Ingredient list of words
0,leg quarters,"[[leg, quarters]]"
1,loin pork roast,"[[loin, pork, roast]]"
2,verjuice,[[verjuice]]
3,cardamon,[[cardamon]]
4,rosewater,[[rosewater]]
5,venison roast,"[[venison, roast]]"
6,"chop green chilies, undrain","[[chop, green, chilies, undrain]]"
7,light butter,"[[light, butter]]"
8,kidney,[[kidney]]
9,pork tenderloin,"[[pork, tenderloin]]"


**HERE IS WHERE TO DO WORD2VEC LOOPY THING**

### Section 2: ingredients_store

In [7]:
countwordsappearing = {}
#count how many times the ingredient word appears across the corpus
for key, value in workingingredients.items():
    ingredientswords = value
    for wordlist in ingredientswords:
        #making every ingredient word lower case, all occurrences added together
        for word in wordlist:
            word = word.lower()
            countme = 0
            countwordsappearing.setdefault(word,0)
            countwordsappearing[word]= countwordsappearing[word]+1

In [None]:
##do not run this unless needed! writing to countedingredientwords
#output = open('./ProcessedData/countedingredientwords.pkl', 'wb')
#pickle.dump(countwordsappearing, output)
#output.close()

In [8]:
ingredientword=[]
occurrences=[]
for key, value in countwordsappearing.items():
    ingredientword.append(str(key))
    occurrences.append(value)

#df = pd.DataFrame(countwordsappearing.items(), 
 #               columns=['Ingredient word', 'Occurrences']).sort_values(by=['Occurrences'],ascending=False)

print("Counting word occurrences")
df = pd.DataFrame({'Ingredient': ingredientword, 'Occurrences':occurrences}).sort_values(
    by=['Occurrences'],ascending=False)
df

Counting word occurrences


Unnamed: 0,Ingredient,Occurrences
124,cheese,203
50,sauce,198
401,chicken,195
79,fat,183
78,low,149
98,mix,122
58,cream,116
333,sodium,112
167,rice,110
65,dried,100


## Important note
I have included the code making up the bulk of this file as is for completion - however I think it should be modified using a count of word occurrences across the *recipes* rather than across the *unique ingredients*, as at present we are identifying words like madras and suggesting it for removal due to low occurrences, despite the fact it appears in 36 recipes and is very likely a strong indicator for one of our cuisines.

In [9]:
#tag ingredient words using nltk and decide whether to add them to stopwords
import pickle
import pandas as pd
import nltk
#make sure these library items are downloaded
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
#nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import wordnet as wn


# read python dict back from the file
pkl_file = open('./ProcessedData/countedingredientwords.pkl', 'rb')
countedingredients = pickle.load(pkl_file)
pkl_file.close()
#print(countedingredients)

labelledwords = {}
rarewords = []
#initialize with some manual terms
mystopwords = ["Oven","it","Cuisine"]
regularwords = []

for key,value in countedingredients.items():
    # assumes words that appear over a certain threshold to be insignificant, here 73 times
    if value >=73:
        mystopwords.append(str(key))
        #labelledwords[key.lower()]=(value,"commonstopwords")
        labelledwords.setdefault(key,(value,"commonstopwords"))
        labelledwords[key]=(labelledwords[key],"commonstopwords")
    # marking all unique words that appear only once to be significant/rare
    elif value<2:
        rarewords.append(str(key))
        labelledwords[key]=(value,"rare")
    elif value >=2 and value <73:
        regularwords.append(str(key))
        labelledwords[key]=(value,"regular")

with open('./Dataset/unique_cuisines.txt', 'r') as file:
     cuisinescontent = file.readlines()
file.closed

cuisines = []
for cuisine in cuisinescontent:
    cuisines.append(str(cuisine).lower().strip())

#processing the rare and medium words to see whether they are verbs, numbers or adjectives. do not stop country words
#beware that italian is a cuisine word that is in very common words as a stop word
countryrelatedwords = ["portuguese","belgian","english"]

#only add adjectives to stop words if they are not a cuisine word 
for word in rarewords:
    processed = pos_tag([word])
    #doing some rare word tagging - split out the tag identifying the word type
    tag = ", ".join(t for w, t in processed)
    #print(word,tag)
    if str(tag)=="CD" or str(tag)=="VBN" or str(tag)=="JJ":
        #print("this tagged word is a candidate for stopwords")
        if word in cuisines:
            #print("This rare word is a cuisine, I refuse to add it to stop words!")
            labelledwords[word]=(labelledwords[word],"cuisine")    
        elif word in countryrelatedwords:
            #print("This rare word is a country word, I refuse to add it to stop words!")
            labelledwords[word]=(labelledwords[word],"country")
        else:
            #print("This rare word is not a cuisine!")
            #print(word)
            #print("I found a number or verb or adjective which is not a cuisine")
            mystopwords.append(word)
            #try:
            labelledwords[word]=(labelledwords[word],"taggedstopwordsrare")
    else:
        #print("this regular word is not tagged in a relevant way")
        labelledwords[word]=(labelledwords[word],"nottagged")
        
for word in regularwords:
    processed = pos_tag([word])
    #split out the tag identifying the word type
    #print("doing some regular word tagging")
    tag = ", ".join(t for w, t in processed)
    #print(tag)
    if str(tag)=="CD" or str(tag)=="VBN" or str(tag)=="JJ":
        #print("this tagged word is a candidate for stopwords")
        if word in cuisines:
            #print("This regularly appearing word is a cuisine, will not add to stop words!")
            labelledwords[word]=(labelledwords[word],"cuisine")
        elif word in countryrelatedwords:
            #print("This regular word is a country word, I refuse to add it to stop words!")
            labelledwords[word]=(labelledwords[word],"country")
        else:
            #print("The regular word is not a cuisine!")
            #print("I found a number or verb or adjective which is not a cuisine")
            labelledwords[word]=(labelledwords[word],"taggedstopwordsregular")
            mystopwords.append(word)
    else:
        #print("this regular word is not tagged in a relevant way")
        labelledwords[word]=(labelledwords[word],"nottagged")

#print("checking labelled stopwords")
#for key,value in labelledwords.items():
 #   print("key",key)
  #  print("value",value) 

##do not run this unless needed! writing to labelledwords
#output = open('./ProcessedData/labelledwords.pkl', 'wb')
#pickle.dump(labelledwords, output)
#output.close()

ingredientword=[]
occurrences=[]
category=[]
status=[]

for key, value in labelledwords.items():
    ingredientword.append(str(key))
    occurrencescategory = value[0]
    occurrences.append(occurrencescategory[0])
    category.append(occurrencescategory[1])
    status.append(value[1])

print("All counted ingredient words labelled according to whether they should be added to stop words")
 
df = pd.DataFrame({'ingredientword':ingredientword,'Occurrences':occurrences,'Category':category,'Status':status}).sort_values(by=['Occurrences'],ascending=False)

total =0
for itemprint in ['commonstopwords','country','taggedstopwordsrare','nottagged','cuisine','taggedstopwordsregular']:
    toprint = df.Status.value_counts()[itemprint]
    print(toprint, itemprint)
    total = total+toprint
    
print(total," unique ingredient words have been classified")

df

All counted ingredient words labelled according to whether they should be added to stop words
34 commonstopwords
3 country
91 taggedstopwordsrare
2798 nottagged
10 cuisine
134 taggedstopwordsregular
3070  unique ingredient words have been classified


Unnamed: 0,Category,Occurrences,Status,ingredientword
124,commonstopwords,203,commonstopwords,cheese
50,commonstopwords,198,commonstopwords,sauce
401,commonstopwords,195,commonstopwords,chicken
79,commonstopwords,183,commonstopwords,fat
78,commonstopwords,149,commonstopwords,low
98,commonstopwords,122,commonstopwords,mix
58,commonstopwords,116,commonstopwords,cream
333,commonstopwords,112,commonstopwords,sodium
167,commonstopwords,110,commonstopwords,rice
65,commonstopwords,100,commonstopwords,dried


### Section 3: prioritise_ingredients

In [8]:
#remove stop words from the ingredients
import pickle
import pandas as pd

# read python dict back from the file
pkl_file = open('./ProcessedData/labelledwords.pkl', 'rb')
labelledwords = pickle.load(pkl_file)
pkl_file.close()

#print("here is the labelled words")
#for key,value in labelledwords.items():
 #   print(key)
  #  print(value)

# read python dict back from the file
pkl_file = open('./ProcessedData/workingingredients.pkl', 'rb')
workingingredients = pickle.load(pkl_file)
pkl_file.close()

#print("here are the working ingredients")
#for ingredientword, details in workingingredients.items():
 #   print(ingredientword)
  #  print(details)

In [9]:
#print("here are the ingredients")
countme=0
ingredientslist=[]
ingredientsphrase=[]
for ingredientword, wordlist in workingingredients.items():
    for item,metadata in labelledwords.items():
        if item in wordlist[0]:
            if metadata[1] in ['commonstopwords','taggedstopwordsrare','taggedstopwordsregular']:
                #print("This word needs to be removed from the ingredient:", item)
                wordlist[0].remove(str(item))
                workingingredients[ingredientword]=workingingredients[ingredientword],wordlist[0]
                countme=countme+1
            else:
                #print("I'm not removing this word")
                workingingredients[ingredientword]=workingingredients[ingredientword],wordlist[0]
    
print("I removed stopwords from each ingredient, here is the end result:")
print("I removed %d words from the ingredients" % countme)

I removed stopwords from each ingredient, here is the end result:
I removed 4265 words from the ingredients


In [10]:
for item,metdata in labelledwords.items():
    if item in wordlist[0]:
            if metadata[1] in ['commonstopwords','taggedstopwordsrare','taggedstopwordsregular']:
                print("This word needs to be removed from the ingredient:", item)
    

In [12]:
ingredientsphrase=[]
ingredientslist=[]
modifiedingredientslist=[]
empty=0
for key, value in workingingredients.items():
    ingredientsphrase.append(key)
    ingredientslist.append(value[0][0])
    try:
        modifiedingredientslist.append(value[1])
        if not value[1]: 
            empty=empty+1
    except:
        modifiedingredientslist.append([])
        empty=empty+1
    
print("I have found %d ingredients to disregard altogether" % empty)
df = pd.DataFrame({'Ingredient': ingredientsphrase, 'Ingredient list of words':ingredientslist,
                   'Modified list of words':modifiedingredientslist})
df

I have found 506 ingredients to disregard altogether


Unnamed: 0,Ingredient,Ingredient list of words,Modified list of words
0,leg quarters,"[[leg, quarters]]","[leg, quarters]"
1,loin pork roast,"([[loin, roast]], [loin, roast])","[loin, roast]"
2,verjuice,[verjuice],[verjuice]
3,cardamon,[cardamon],[cardamon]
4,rosewater,[rosewater],[rosewater]
5,venison roast,"[[venison, roast]]","[venison, roast]"
6,"chop green chilies, undrain","(([['chop', 'chilies', 'undrain']], [chop, chi...","[chop, chilies, undrain]"
7,light butter,"[[light, butter]]","[light, butter]"
8,kidney,[kidney],[kidney]
9,pork tenderloin,[[tenderloin]],[tenderloin]


### Section 4: Classification accuracy testing

Benchmarks as follows. To test feature extraction pipeline, comment out four lines generating X_train_proc and X_test_proc, replace with own feature extraction pipeline and rerun.

**Headline accuracy: 78.37%**

Cuisine specific results, ordered by f-score.

                    Precision   Recall    F-score   Support
        moroccan       0.90      0.93      0.91      1331
            thai       0.87      0.87      0.87       605
           greek       0.79      0.90      0.84      1550
          indian       0.80      0.84      0.82       529
         british       0.86      0.76      0.80       152
         spanish       0.91      0.70      0.79       117
      vietnamese       0.82      0.76      0.79       165
         russian       0.78      0.77      0.78       292
     southern_us       0.82      0.71      0.76       327
          korean       0.82      0.69      0.75       276
           irish       0.68      0.81      0.74       828
    cajun_creole       0.78      0.69      0.73       229
       brazilian       0.76      0.55      0.64       163
         chinese       0.77      0.61      0.68       171
          french       0.59      0.63      0.61       508
        japanese       0.68      0.52      0.59       121
         italian       0.73      0.48      0.58       102
        jamaican       0.69      0.48      0.57       201
        filipino       0.60      0.38      0.46       104
         mexican       0.62      0.37      0.46       184

In [2]:
import csv, os, pandas as pd, pathlib, pprint, json, numpy as np
import bagOfWords as bow
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

# Make sure json training json file exists

data_directory = os.path.join(os.getcwd(), "Dataset")
train_data_file_path = os.path.join(data_directory, "train.json")
test_data_file_path = os.path.join(data_directory, "test.json")
    
if(not pathlib.Path(train_data_file_path).is_file()):
    raise Exception("Missing train.json file in " + data_directory)

    
if(not pathlib.Path(test_data_file_path).is_file()):
    raise Exception("Missing test.json file in " + data_directory)
    
# Read JSON training data

with open(train_data_file_path, 'r') as f:
     trainData = pd.read_json(f)
f.closed

with open(test_data_file_path, 'r') as f:
     testData = pd.read_json(f)
f.closed

True

In [3]:
# may also want to replace part of this code
trainData['all_ingredients'] = trainData['ingredients'].map(";".join)
cuisines = trainData['cuisine'].value_counts().index

# train-test split on fold 37
X_train, X_test, y_train, y_test = train_test_split(trainData['all_ingredients']
                                                    , trainData['cuisine'], test_size=0.2, random_state=37)

enc = LabelEncoder()
y_train_proc = enc.fit_transform(y_train.values)
y_test_proc = enc.transform(y_test.values)

In [6]:
##
#stopwords = ['romaine','black']
trainData['mod_ingredients'] = trainData['all_ingredients']
for word in stopwords:
    trainData['mod_ingredients'] = trainData['mod_ingredients'].str.replace(word,'')
    
trainData.head()

Unnamed: 0,cuisine,id,ingredients,all_ingredients,mod_ingredients
0,greek,10259,"[romaine lettuce, black olives, grape tomatoes...",romaine lettuce;black olives;grape tomatoes;ga...,lettuce; olives;grape tomatoes;garlic;pepper;...
1,southern_us,25693,"[plain flour, ground pepper, salt, tomatoes, g...",plain flour;ground pepper;salt;tomatoes;ground...,plain flour;ground pepper;salt;tomatoes;ground...
2,filipino,20130,"[eggs, pepper, salt, mayonaise, cooking oil, g...",eggs;pepper;salt;mayonaise;cooking oil;green c...,eggs;pepper;salt;mayonaise;cooking oil;green c...
3,indian,22213,"[water, vegetable oil, wheat, salt]",water;vegetable oil;wheat;salt,water;vegetable oil;wheat;salt
4,indian,13162,"[black pepper, shallots, cornflour, cayenne pe...",black pepper;shallots;cornflour;cayenne pepper...,pepper;shallots;cornflour;cayenne pepper;onio...


In [15]:
# this is the part that should be commented out and replaced with feature extraction pipeline
cv = CountVectorizer()
X_train_proc = cv.fit_transform(X_train.values)
X_test_proc = cv.transform(X_test.values)
X_train_proc.shape

(31819, 2918)

In [16]:
from sklearn.linear_model import LogisticRegression

logistic = LogisticRegression()
logistic.fit(X_train_proc, y_train_proc)

logistic.score(X_test_proc, y_test_proc)

0.78365807668133247

In [21]:
log_pred = logistic.predict(X_test_proc)
print(classification_report(y_test_proc, log_pred, target_names=cuisines))

              precision    recall  f1-score   support

     italian       0.73      0.48      0.58       102
     mexican       0.62      0.37      0.46       184
 southern_us       0.82      0.71      0.76       327
      indian       0.80      0.84      0.82       529
     chinese       0.77      0.61      0.68       171
      french       0.59      0.63      0.61       508
cajun_creole       0.78      0.69      0.73       229
        thai       0.87      0.87      0.87       605
    japanese       0.68      0.52      0.59       121
       greek       0.79      0.90      0.84      1550
     spanish       0.91      0.70      0.79       117
      korean       0.82      0.69      0.75       276
  vietnamese       0.82      0.76      0.79       165
    moroccan       0.90      0.93      0.91      1331
     british       0.86      0.76      0.80       152
    filipino       0.60      0.38      0.46       104
       irish       0.68      0.81      0.74       828
    jamaican       0.69    