# Database Matching: FooDB to ASA24 Ingredient Descriptions
Stephanie Wilson, January 2023

## Step 2: Cleaning FoodB and ASA Descriptions

__Required Input Files__

  - **Content_updated.csv** - Output from 02_FooDB_FoodBCleaning
  - **asa_recode_remapped10202022.csv** - Output from 03_ingredientize_code_remap.rmd script
  - **ingredientized_asa_10-2022.csv** - Output from 04_ingredientize_merge.rmd script

__Information__  
This script prepares food descriptions in FoodB's Content.csv and in the ASA dietary data for downstream text similarity comparisons. This script specifically achieves the following:
    
    1) Create a new list of distinct food descriptions in FooDB.
    2) Resolves Word Cases
    3) Removes certain punctuation
    4) Identifies candidates for stop word removal and removes them
    5) Lemmatize text descriptions
    6) Exports cleaned food descriptions from FoodB and ASA
        - Output: Food_V2_updated_descripcleaned.csv
        - Output: asa_descripcleaned.csv
        - Output: remap_descrip_cleaned.csv
        
__Output__
  - Food_V2_descripcleaned.csv
  - asa_descripcleaned.csv
  - remap_descrip_cleaned.csv

In [1]:
#Load modules
import os
import pandas as pd
import numpy as np
import re
import nltk
wn = nltk.WordNetLemmatizer()

In [2]:
#Ensure working directory is the project folder
mapping = os.getcwd()
mapping

'/Users/stephanie.wilson/Desktop/SYNC/Scripts/FooDB_Polyphenol_Quantification'

In [3]:
#Load data
remap = pd.read_csv('Ingredientize/data/ingred_recode_remapped10202022.csv')
asa = pd.read_csv('Ingredientize/data/ingredientized_asa_10-2022.csv')
Content = pd.read_csv('FooDB/Content_updated.csv.bz2', compression='bz2', low_memory=False)

###  1) Create a new list of distinct food descriptions in FooDB.

Recall the original FooDB Food.csv contains ~900 unique food descriptions. Content.csv has _many_ more food descriptions than Food.csv. Content.csv technically lists x number of compounds and their amounts per each food description. Thus, there are duplicate food descriptions. We need to filter in unique food names. Luckily, we already have a unique food descriptor for each of these unique food names from 01_FooDB_FNDDS_FooDBCleaning.ipynb.

  - Note: Food.csv has 992 descriptions. Food_updated.csv has 993 with the addition of Code 554 from 01_FoodBCleaning)


In [4]:
# Create a copy of the Content Frame
Food_V2 = Content.copy()

In [5]:
#Drop unneeded columns
Food_V2.drop(columns=['creator_id', 'updater_id', 'orig_citation', \
    'created_at', 'updated_at', 'orig_method', 'orig_unit_expression', \
        'citation_type', 'orig_content', 'standard_content', \
            'orig_content', 'orig_min', 'orig_max', 'orig_unit'], inplace=True)

In [6]:
# How many distinct food descriptions are there in FooDB?
print(Food_V2['orig_food_common_name'].value_counts().shape[0], ' Unique food entries in FooDB Content.csv')

9913  Unique food entries in FooDB Content.csv


In [7]:
# Drop duplicate food names
Food_V2 = Food_V2.drop_duplicates(subset = 'orig_food_common_name')
Food_V2

Unnamed: 0,id,source_id,source_type,food_id,orig_food_id,orig_food_common_name,orig_food_scientific_name,orig_food_part,orig_source_id,orig_source_name,citation,preparation_type,export,food_V2_ID
0,1,1,Nutrient,4,29,Kiwi,Actinidia chinensis PLANCHON [Actinidiaceae],Fruit,FAT,FAT,DUKE,raw,0,1
1,2,1,Nutrient,6,53,Onion,Allium cepa L. [Liliaceae],Bulb,FAT,FAT,DUKE,raw,0,2
3,4,1,Nutrient,9,55,Chives,Allium schoenoprasum L. [Liliaceae],Leaf,FAT,FAT,DUKE,raw,0,3
4,5,1,Nutrient,11,70,Cashew,Anacardium occidentale L. [Anacardiaceae],Fruit,FAT,FAT,DUKE,other,0,4
7,8,1,Nutrient,12,74,Pineapple,Ananas comosus (L.) MERR. [Bromeliaceae],Fruit,FAT,FAT,DUKE,raw,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1845066,2222270,453,Compound,1021,,Herbal tea,Camellia sinensis,,C00001510,Theophylline,KNAPSACK,other,1,9909
1855742,2249584,30233,Compound,994,,Red onion,,,,,MANUAL,,1,9910
1867905,2309141,125040,Compound,944,,linseed oil,,,,,PHYTOHUB,,1,9911
1867908,2309144,125040,Compound,946,,Soybean oil,,,,,PHYTOHUB,,1,9912


### 2) Resolve Word Cases

In [8]:
#Isolate food description columns & convert into series
asa_descrip = asa['Ingredient_description'].squeeze()
Food_V2_descrip = Food_V2['orig_food_common_name'].squeeze()
remap_descrip = remap['Ingredient_description_y'].squeeze()

In [9]:
#Convert series to lowercase
#Add c designation to specify food descriptions are now cleaned
asa_descrip_c = asa_descrip.str.lower()
Food_V2_descrip_c = Food_V2_descrip.str.lower()
remap_descrip_c = remap_descrip.str.lower()

In [10]:
#Confirm that data is lowercase
asa_descrip_c

0        onions, spring or scallions (includes tops and...
1        beef, ground, 75% lean meat / 25% fat, patty, ...
2        beef, ground, 80% lean meat / 20% fat, patty, ...
3        beef, ground, 85% lean meat / 15% fat, patty, ...
4        beef, ground, 90% lean meat / 10% fat, patty, ...
                               ...                        
47809                                       butter, salted
47810    margarine-like, vegetable oil spread, 60% fat,...
47811    margarine-like, vegetable oil spread, 60% fat,...
47812    margarine, regular, 80% fat, composite, stick,...
47813    margarine, regular, 80% fat, composite, stick,...
Name: Ingredient_description, Length: 47814, dtype: object

### 3) Remove certain punctuation

In [11]:
# Create a list of punctuation to remove and their respective replacements
punctuation = {',': '', '-': ' ', '(': '', ')': '', ':': ' ', ';' : ' ', '%': ''} 

In [12]:
# Remove punctuation in Food_V2 descriptions
for x, y in punctuation.items():
    Food_V2_descrip_c = Food_V2_descrip_c.str.replace(x, y, regex=True)

In [13]:
# Remove punctuation in Food_V2 descriptions
for x, y in punctuation.items():
    asa_descrip_c = asa_descrip_c.str.replace(x, y, regex=True)

In [14]:
# Remove punctuation in Food_V2 descriptions
for x, y in punctuation.items():
    remap_descrip_c = remap_descrip_c.str.replace(x, y, regex=True)

__Remove Numbers__  
Numbers in this case are important to identify foods (ie, 5% vs 20% fat ground beef), so making the decision to not remove numbers

### 4) Identify candidates for stop word removal and remove them

In [15]:
#Collapse series into a string of words, then remove extra spaces
Food_V2_string = Food_V2_descrip_c.str.cat(sep = ' ')
asa_string = asa_descrip_c.str.cat(sep = ' ')

In [16]:
#Confirm column is collapsed
asa_string[0:50]

'onions spring or scallions includes tops and bulb '

Stop word removal in Food_V2 can also be applied toward Content.csv

In [17]:
#Split the string into a list of words
asa_string = asa_string.split()
Food_V2_string = Food_V2_string.split()

In [18]:
#Count word occurence which helps identify candidates for stop words
Food_V2_counts = pd.Series(Food_V2_string).value_counts(dropna=False)
pd.DataFrame(Food_V2_counts, columns = ['COUNTS']).head(20)

Unnamed: 0,COUNTS
and,1648
fat,1635
raw,1489
with,1393
cooked,1338
to,1149
lean,1073
beef,1023
separable,1011
soup,726


In [19]:
#Count word occurence which helps identify candidates for stop words
asa_counts = pd.Series(asa_string).value_counts(dropna=False)
pd.DataFrame(asa_counts, columns = ['COUNTS']).head(40)

Unnamed: 0,COUNTS
salt,8845
oil,8801
with,8492
raw,8319
or,7318
salad,5996
fat,5707
cooking,5470
and,5186
vitamin,5145


Stop words based off frequently occurring words and database-specific language (*) in asa and FooDB:
  - and
  - to
  - in
  - or
  - as
  - food*
  - foods*
  - distribution*

In [20]:
# Create a list of stop words and their respective replacements
# Spaces added to stop words to avoid removing stop words that technically occur within another word. 
stopwords = {' and ': ' ', ' to ': ' ', ' in ': ' ', ' or ': ' ', ' as ': ' ', \
    ' food ': ' ', ' foods ': ' ', ' distribution ': ' '} 

In [21]:
# Remove stop words in Food_V2 descriptions
for x, y in stopwords.items():
    Food_V2_descrip_c = Food_V2_descrip_c.str.replace(x, y)

In [22]:
# Remove stop words in ASA descriptions
for x, y in stopwords.items():
    asa_descrip_c = asa_descrip_c.str.replace(x, y)

In [23]:
# Remove stop words in ASA descriptions
for x, y in stopwords.items():
    remap_descrip_c = remap_descrip_c.str.replace(x, y)

In [24]:
#Found an additional replacement while comparing strings
#Replace w/o to without in FooDB, not present in asa
Food_V2_descrip_c = Food_V2_descrip_c.str.replace('w/o', 'without')

### 5) Lemmatize text descriptions

In [25]:
#Replace original food descriptions with the cleaned descriptions
asa['Ingredient_description'] = asa_descrip_c
Food_V2['orig_food_common_name'] = Food_V2_descrip_c
remap['Ingredient_description_y'] = remap_descrip_c

In [26]:
# Create Lemmatize function
def lemmatize_text(text):
    text = "".join([word for word in text if word])
    tokens = re.split('[-\W+]', text)
    text = [wn.lemmatize(word) for word in tokens]
    return set(text)

In [27]:
asa['Ingredient_descrip_lemmatized'] = asa['Ingredient_description'].apply(lambda x: lemmatize_text(x.lower()))
Food_V2['orig_food_common_name_lemmatized'] = Food_V2['orig_food_common_name'].apply(lambda x: lemmatize_text(x.lower()))
remap['Ingredient_descrip_y_lemmatized'] = remap['Ingredient_description_y'].apply(lambda x: lemmatize_text(x.lower()))

### 6) Export cleaned food descriptions from FoodB and asa

In [28]:
#Export Cleaned Food Descriptions
Food_V2.to_csv('FooDB/Food_V2_descripcleaned.csv', index = None, header = True)
asa.to_csv('data/asa_descripcleaned.csv', index = None, header = True)
remap.to_csv('data/remap_descripcleaned.csv', index = None, header = True)