# Matching FooDB to ASA24 Ingredient Descriptions
## Step 5: Matching ASA24 to FooDB
### Part 2: Ingredient Description Dependency Parsing for missing code matches

__Required Input Files__

  - *asa_descripcleaned_codematched.csv* - Output from 04_FooDB_FullMatch_Part1, all ASA24 ingredient descriptions from FL100

__Information__  
This script runs a natural language processing algorithm on ASA24 ingredient descriptions from the FL100 study. 

    1) Apply nlp to each row and examine parts of speech, tags, and dependencies for ingredient tokens.
    2) Dependency parsing and add columns to food description dataframe.


__Output__
  
  - *asa_foodb_descrip_dependencies.csv* - Input data but with dependency token columns added. 

In [22]:
#Load modules
import os
import pandas as pd
import numpy as np
import spacy
import en_core_web_md
nlp = en_core_web_md.load()

In [None]:
#Ensure working directory is the project folder
mapping = os.getcwd()
mapping

'/Users/stephanie.wilson/Desktop/SYNC/Scripts/FooDB_FNDDS'

In [4]:
#Extract observations that are missing foodB descriptions
asa = pd.read_csv('data/asa_descripcleaned_codematched.csv')

### 1) Apply nlp to each row, and examine POS

In [5]:
#Apply nlp function on each food description that will be eventualkly searched against FooDB
asa_nlp = asa['Ingredient_description'].apply(lambda x: nlp(x))

In [None]:
#If we want to see all attributes from nlp features, we can run *dir*() on the first row of descriptions
dir(asa_nlp[0])


 
 > dir(asa_nlp[0])

In [13]:
# Examine parts of speech from the first food description
for tok in asa_nlp[25]:
    pos = tok.pos_ #coarse-grain POS
    tag = tok.tag_ #fine-grain POS
    dep = tok.dep_ #word dependency
    print(
    'Token:', tok.text,
    '\nPart of Speech:', pos,
    '\nTag:', tag, ",", spacy.explain(tag),
    '\nDependency:', dep, ",", spacy.explain(dep),
    "\n"
)

Token: frozen 
Part of Speech: ADJ 
Tag: JJ , adjective (English), other noun-modifier (Chinese) 
Dependency: amod , adjectival modifier 

Token: novelties 
Part of Speech: NOUN 
Tag: NNS , noun, plural 
Dependency: compound , compound 

Token: juice 
Part of Speech: NOUN 
Tag: NN , noun, singular or mass 
Dependency: compound , compound 

Token: type 
Part of Speech: NOUN 
Tag: NN , noun, singular or mass 
Dependency: compound , compound 

Token: orange 
Part of Speech: NOUN 
Tag: NN , noun, singular or mass 
Dependency: ROOT , root 



### 2) Dependency Parsing of Ingredient Descriptions

Create functions to pull out specific dependencies

In [14]:
# get roots
def get_root(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'ROOT':
            DEPdata.append(tok)
    return(DEPdata)  
    
# get compound
def get_compound(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'compound':
            DEPdata.append(tok)
    return(DEPdata)  

# get nominal subjects
def get_nsubj(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'nsubj':
            DEPdata.append(tok)
    return(DEPdata)    

# get adjectival modifier 
def get_amod(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'amod':
            DEPdata.append(tok)
    return(DEPdata)  

# get noun modifier 
def get_nmod(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'nmod':
            DEPdata.append(tok)
    return(DEPdata)  

# get noun phrase as adverbial modifier 
def get_npadvmod(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'npadvmod':
            DEPdata.append(tok)
    return(DEPdata)  

# get nominal subject (passive) 
def get_nsubjpass(doc):    
    DEPdata = []

    for tok in doc:
        dep = tok.dep_
        if dep == 'nsubjpass':
            DEPdata.append(tok)
    return(DEPdata)  


It was determined from examining the dependency output that tokens registering as a 'compound' accurately described the food, albeit generally. However, not all ingredient descriptions had a compound token, so an order needed to be established to see which dependencies were most important in describing food. 

The following order was manually found to optimize description accuracy.  
  - compound > nsubjpass > nmod > nsubj > amod > npadvmod > ROOT

In [15]:
#Apply the above function to all ingredient desecriptions descriptions
asa_nlp_dep = pd.DataFrame(asa_nlp.apply(lambda x: get_compound(x)))
asa_nlp_dep = asa_nlp_dep.rename(columns = {'Ingredient_description':'compound'}) 

In [17]:
# Ensure order as listed above
# Compound must be first
asa_nlp_dep['nsubjpass'] = asa_nlp.apply(lambda x: get_nsubjpass(x))
asa_nlp_dep['nmod'] = asa_nlp.apply(lambda x: get_nmod(x))
asa_nlp_dep['nsubj'] = asa_nlp.apply(lambda x: get_nsubj(x))
asa_nlp_dep['amod'] = asa_nlp.apply(lambda x: get_amod(x))
asa_nlp_dep['npadvmod'] = asa_nlp.apply(lambda x: get_npadvmod(x))
asa_nlp_dep['ROOT'] = asa_nlp.apply(lambda x: get_root(x))

In [18]:
#Ensure correct order
list(asa_nlp_dep.columns)

['compound', 'nsubjpass', 'nmod', 'nsubj', 'amod', 'npadvmod', 'ROOT']

In [19]:
asa_nlp_dep['Ingredient_description'] = asa['Ingredient_description']
asa_updated = pd.merge(asa, asa_nlp_dep, on = 'Ingredient_description', how = 'left')

In [23]:
list(asa_updated.columns)

['Ingredient_code',
 'Ingredient_description',
 'orig_food_id',
 'orig_food_common_name',
 'food_V2_ID',
 'compound',
 'nsubjpass',
 'nmod',
 'nsubj',
 'amod',
 'npadvmod',
 'ROOT']

In [24]:
# Export the resulting description file with dependencies
asa_updated.to_csv('data/asa_foodb_descrip_dependencies.csv', index = None, header = True)