## Cosine Similarity Analysis ##

The following function performs a cosine similarity analysis of a subset 'subset' of the FDC Global Branded Food Products 
    Database (GBFPD), assuming it's saved to path as 'branded_food.csv'. The cosine_similarity_analysis function takes 
    three arguments (all string arguments):

* **filename**, which corresponds to the file path of the GBFPD (assumed to be downloaded to computer. In this example, **filename** is set to "branded_food.csv", assuming that the dataset is in the same directory as this file.
* **subset** , which refers to the branded food category to be queried in step 2. **subset** is set to "Rice" for this demonstration of the analysis
* **label_color**, which refers to the color used for the points in the graph. This can be set to a hex code (e.g., #OOOFFF) or a named color (Visit https://matplotlib.org/stable/gallery/color/named_colors.html for a list of named colors). For this file, **label_color** is set to "blue" 

This function assumes that the dataset used is the USDA Global Branded Food Product Database (GBFPD), which can be downloaded from https://fdc.nal.usda.gov/download-datasets.html  as "branded_food.csv". This function also assumes that "branded_food.csv" is installed on your computer at the path **filename**. This function also assumes that **subset** is a valid value of the column "branded_food_category".

You can check the domain of the column "branded_food_category" with the following lines of code(given that 'branded_food.csv' is in the same folder as this notebook):

**df = pd.read_csv("branded_food.csv", low_memory = False)** 

**y = "Some string"** (replace "Some string" with a string of your choice, checking if that string is a valid value of branded_food_category)

**print(y in df['branded_food_category'].unique())**



This function involves the following 16 steps:
1. Import Food Data Central Food Products Database
2. Get subset matching branded_food_category == **subset**
3. Calculate and get Readability Scores for FoodData Central ingredient lists
4. Generate Matrix of pairwise differences for Flesch Reading Ease
5. Create Difference Matrix- Dale-Chall Index
6. Convert FDC IDs from float to string
7. Create list of all possible FDC ID pairs (repeating)
8. Create and fit CountVectorizer model to ingredient lists 
9. Transform CountVectorizer model to array
10. Calculate cosine similarity of ingredient lists
11. Create Dataframe with readability differences and cosine similarities
12. Prepare dataframe for plot generation and Summary Statistics
13. Get Scatter Plot- Dale Chall diff vs Cosine Similarity
14. Get Scatter Plot- Dale Chall diff vs Flesch diff
15. Get Scatter Plot- Flesch diff vs Cosine Similarity
16. Get Pearson Correlations via scipy.stats.pearsonr()


In [None]:
#Packages Used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import readability
import time
import numbers
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import pearsonr
import itertools
from itertools import combinations,chain,product,permutations
from batch_test import batchCosineSimilarity
import sys

In [None]:
#Step 1: Import Food Data Central Food Products Database 
FoodDataCentral = pd.read_csv("branded_food.csv", low_memory = False)
subset = "Rice"
label_color = "blue"

In [None]:
#Generate list of branded food categories (Optional, can be used to determine branded food category for analysis, if applicable)
grouped_counts = pd.DataFrame(FoodDataCentral.groupby(['branded_food_category'])['branded_food_category'].count())
grouped_counts.columns = ["bfc_count"]
grouped_counts.sort_values(by = "bfc_count", ascending = False)[50:70]

In [None]:
#Step 2: Get Database subset by branded food category ('Rice' in this case) and remove rows with empty values or only non-words 
#FoodDataCentral = FoodDataCentral.query("branded_food_category == 'Rice'")
if isinstance(subset, numbers.Number):
        delim_subset = "Random"
        FoodDataCentral = FoodDataCentral.sample(n=subset, random_state=11)
    else:
        FDC_query = "".join(("branded_food_category == '",subset,"'")) #Note: subset must be in the domain of 'branded_food_category', or else a ValueError is raised.
        delim_subset = subset.replace(" ","-")
        FoodDataCentral = FoodDataCentral.query(FDC_query)
    
    
    FoodDataCentral = FoodDataCentral.dropna(subset = ['ingredients'])
    FoodDataCentral = FoodDataCentral[FoodDataCentral["ingredients"] != "---"]
    FoodDataCentral = FoodDataCentral[FoodDataCentral["ingredients"] != ","]

In [None]:
#Step 3: Calculate and get Readability Scores for FoodData Central ingredient lists
readability_scores = []
    for index, row in FoodDataCentral.iterrows():
        
        #Get ingredient list and convert to lowercase string (required to calculate syllables in readability.getmeasures())
        ingredients = row["ingredients"]
        ingredients = ingredients.lower()
        
        # Set ingredient lists with incalculable readability to NA
        if pd.isna(ingredients) or ingredients in ["---",","]:
            curr_record = (row['fdc_id'], row['gtin_upc'], pd.NA,pd.NA)
            readability_scores.append(curr_record)

        else:
            """Readability.getmeasures() automatically tokenizes the input using spaces as a delimiter for words and /n as a delimiter for sentences. 
                The function then calculates and returns a set of readability measures. In this case,
                we are getting specific measures from the set (Flesch Reading Ease and Dale-Chall Readability) """
            
            #Prepare data for readability measurement
            token_ing = word_tokenize(ingredients)
            ingredients = token_ing
            
            #Get readability measures
            measures = readability.getmeasures(ingredients)
            curr_record = (row['fdc_id'], row['gtin_upc'], row['branded_food_category'],row['ingredients'],
                           measures['readability grades']['Kincaid'],
                           measures['readability grades']['FleschReadingEase'],
                           measures['readability grades']['DaleChallIndex'],
                           measures['sentence info']['words'],
                           measures['sentence info']['complex_words_dc'])
            readability_scores.append(curr_record)

    #Gather data in Pandas DataFrame
    readScores_FDC = pd.DataFrame(data = readability_scores, columns = ["fdc_id", "gtin_upc","branded_food_category","ingredients",
                                                                        "Kincaid_Score","FleschReadingEase","DaleChallIndex",
                                                                        "num_words","complex_words_dc"])
    #Remove NAs and save dataset to CSV
    readScores_FDC = readScores_FDC.dropna()
    readScores_FDC.to_csv("".join((delim_subset,"_FoodData_Central_Readability.csv")), sep=",")
    FoodDataCentral = readScores_FDC

In [None]:
#Step 4: Generate Matrix of pairwise differences for Flesch Reading Ease, as a 1-d list
difference_matrix_fl = [[abs(y - x) for x in FoodDataCentral["FleschReadingEase"]] for y in FoodDataCentral["FleschReadingEase"]]
difference_matrix_fl = list(chain(*difference_matrix_fl))

In [None]:
#Step 5: Generate Matrix of pairwise differences for Dale Chall Readability, as a 1-d list
difference_matrix_dc = [[abs(y - x) for x in FoodDataCentral["DaleChallIndex"]] for y in FoodDataCentral["DaleChallIndex"]]
difference_matrix_dc = list(chain(*difference_matrix_dc))

In [None]:
#Step 6: Convert FDC IDs from float to string
FoodDataCentral["fdc_id"] = FoodDataCentral["fdc_id"].astype("str")

In [None]:
#Get FDC indices for referencing matrices (Optional)
#fdc_indices = dict(enumerate(FoodDataCentral["fdc_id"]))
#print(fdc_indices)

In [None]:
#Step 7: Create list of FDC ID pairs (w/repeats)
fdcID_pairs = list(itertools.product(FoodDataCentral["fdc_id"],repeat=2))


In [None]:
#Step 8: Create and fit word count vectorizer model to ingredient lists 
documents = list(FoodDataCentral['ingredients'].values)  #Gather list of ingredient lists from dataset
count_vectorizer = CountVectorizer(documents, stop_words='english') #Create Count Vectorizer Model
count_vectorizer.fit(documents)  #Fit model to ingredient list

In [None]:
#Step 9: Transform model to array
documents_1 = list(FoodDataCentral['ingredients'].values) 
vectors = count_vectorizer.transform(documents_1).toarray()
np.save("".join((delim_subset,"_IngredientList_Vectors.npy")),vectors)

In [None]:
#Step 10: Calculate cosine similarity of ingredient lists (Skip to Step 10-11 (Alternate) if available memory is a concern)
cos_sim = cosine_similarity(vectors)
cos_sim_flat = list(cos_sim)
cos_sim_flat = list(chain(*cos_sim_flat))

In [None]:
#Step 11: Create Dataframe with readability differences and cosine similarities
analysis_set = pd.DataFrame(fdcID_pairs, columns = ["fdc_id_1","fdc_id_2"])
analysis_set["DaleChallDiff"] = difference_matrix_dc
analysis_set["Cosine_similarity"] = cos_sim_flat
analysis_set["Flesch_diff"] = difference_matrix_fl
analysis_set["Subset"] = subset

In [None]:
#Step 10-11 (Alternate): In case of low available memory to allocate, the calculations can be done on smaller chunks of the dataset, then combined:

#Create Dataframe for containing all values of Dale Chall difference, Flesch Reading Ease difference, and cosine similarity

analysis_set = pd.DataFrame(fdcID_pairs, columns = ["fdc_id_1","fdc_id_2"])
print(analysis_set["fdc_id_1"].head(5))
analysis_set["DaleChallDiff"] = difference_matrix_dc
analysis_set["Flesch_diff"] = difference_matrix_fl
analysis_set["Cosine_similarity"] = np.nan #Needed as a placeholder for coalescing after

indices = list(FoodDataCentral["fdc_id"])

#Perform Cosine similarity in batches
vectors_df = batchCosineSimilarity(vectors,indices,delim_subset,3) 
vectors_df = pd.read_csv("".join((delim_subset,"_CosSim_total.csv")), chunksize=100000)

chunks = 1
for df in vectors_df:
    print("Chunk " + str(chunks))
    chunks += 1
    df = df.drop(columns='Unnamed: 0')

    df["fdc_id_1"] = df["fdc_id_1"].astype("str")
    df["fdc_id_2"] = df["fdc_id_2"].astype("str")
    analysis_set = analysis_set.merge(df, on=["fdc_id_1","fdc_id_2"], how="left",suffixes = ('', '_x'))
    analysis_set["Cosine_similarity"] = analysis_set["Cosine_similarity"].combine_first(analysis_set["Cosine_similarity_x"])
    analysis_set = analysis_set.drop(columns="Cosine_similarity_x")
    analysis_set = analysis_set.drop_duplicates()

In [None]:
#Step 12: Prepare dataframe for plot generation and Summary Statistics
analysis_set.dropna(inplace=True)
analysis_set.replace([np.inf, -np.inf], np.nan, inplace=True)
analysis_set.dropna(inplace=True)
analysis_set.to_csv("".join((delim_subset,"_Readability_and_CosineSimilarity_scores.csv")))

In [None]:
#Step 13: Get Scatter Plot- Dale Chall diff vs Cosine Similarity
plt.scatter(analysis_set["DaleChallDiff"], analysis_set["Cosine_similarity"], c = label_color)
plt.xlabel("Dale-Chall Index Pairwise difference")
plt.ylabel("Cosine Similarity")
plt.title("Difference In Dale-Chall Index vs Cosine Similarity (" + delim_subset + ")")
plt.savefig("".join((delim_subset,"_DaleChall_vs_CosineSimilarity.png")))
plt.show()

In [None]:
#Step 14: Get Scatter Plot- Dale Chall diff vs Flesch diff
plt.scatter(analysis_set["DaleChallDiff"], analysis_set["Flesch_diff"], c = label_color)
plt.xlabel("Dale-Chall Pairwise difference")
plt.ylabel("Flesch Pairwise Difference")
plt.title("Difference In Dale-Chall Index vs Difference in Flesch Reading Ease (" + delim_subset + ")")
plt.savefig("".join((delim_subset,"_DaleChall_vs_Flesch.png")))
plt.show()

In [None]:
#Step 15: Get Scatter Plot- Flesch diff vs Cosine Similarity
plt.scatter(analysis_set["Flesch_diff"], analysis_set["Cosine_similarity"], c = label_color)
plt.xlabel("Flesch Pairwise difference")
plt.ylabel("Cosine Similarity")
plt.title("Difference In Flesch Reading Ease vs Cosine Similarity (" + delim_subset + ")")
plt.savefig("".join((delim_subset,"_Flesch_vs_CosineSimilarity.png")))
plt.show()

In [None]:
#Step 16: Get Pearson Correlations (codes: DC = Dale-Chall, FL = Flesch Reading Ease, CS = Cosine similarity)

##pearsonr() returns a tuple with two values (from left to right)- Pearson Correlation Coefficient and p-value
DC_CS_pearson = pearsonr(analysis_set["DaleChallDiff"], analysis_set["Cosine_similarity"])
FL_CS_pearson = pearsonr(analysis_set["Flesch_diff"], analysis_set["Cosine_similarity"])
DC_FL_pearson = pearsonr(analysis_set["DaleChallDiff"], analysis_set["Flesch_diff"])
Pearson_corrs = [DC_CS_pearson, FL_CS_pearson, DC_FL_pearson]
print("Dale-Chall vs Cosine Similarity " + DC_CS_pearson, "Flesch vs Cosine Similarity " + DC_CS_pearson, "Dale-Chall vs Cosine Similarity " + DC_CS_pearson, )