## Cosine Similarity Analysis ##

This notebook shows the step by step process of the cosine_similarity_analysis(filename,subset) function. For purposes of showing the process across multiple cells, the function will not be explicitly defined. The function's arguments are as follows: 

* **filename** is effectively "branded_food.csv", which corresponds to the branded food dataset within FoodData Central's Food Products Database 
* **subset** is set to "Rice", which refers to rice food products within the branded food database.

This function assumes that the dataset used is FDC's Food Product database, and that it is installed on your computer at the path **filename**. This function also assumes that **subset** is a valid value of the column "branded_food_category".

You can check the domain of the column "branded_food_category" with the following lines of code(given that 'branded_food.csv' is in the same folder as this notebook):

**x = pd.read_csv("branded_food.csv", low_memory = False)** 

**print(unique(x['branded_food_category]))**

This function involves the following 16 steps:
1. Import Food Data Central Food Products Database
2. Get subset matching branded_food_category == **subset**
3. Calculate and get Readability Scores for FoodData Central ingredient lists
4. Generate Matrix of pairwise differences for Flesch Reading Ease
5. Create Difference Matrix- Dale-Chall Index
6. Convert FDC IDs from float to string
7. Create list of all possible FDC ID pairs (repeating)
8. Create and fit CountVectorizer model to ingredient lists 
9. Transform CountVectorizer model to array
10. Calculate cosine similarity of ingredient lists
11. Create Dataframe with readability differences and cosine similarities
12. Prepare dataframe for plot generation and Summary Statistics
13. Get Scatter Plot- Dale Chall diff vs Cosine Similarity
14. Get Scatter Plot- Dale Chall diff vs Flesch diff
15. Get Scatter Plot- Flesch diff vs Cosine Similarity
16. Get Pearson Correlations via scipy.stats.pearsonr()


In [None]:
#Packages Used
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import readability
import time
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.stats import pearsonr
import itertools
from itertools import combinations,chain,product,permutations

In [None]:
#Step 1: Import Food Data Central Food Products Database 
FoodDataCentral = pd.read_csv("branded_food.csv", low_memory = False)

In [None]:
#Generate list of branded food categories (Optional, can be used to determine branded food category for analysis)
grouped_counts = pd.DataFrame(FoodDataCentral.groupby(['branded_food_category'])['branded_food_category'].count())
grouped_counts.columns = ["bfc_count"]
grouped_counts.sort_values(by = "bfc_count", ascending = False)[50:70]

In [None]:
#Step 2: Get Database subset by branded food category ('Rice' in this case) and remove rows with empty values or only non-words 
FoodDataCentral = FoodDataCentral.query("branded_food_category == 'Rice'")
FoodDataCentral = FoodDataCentral.dropna(subset = ['ingredients'])
FoodDataCentral = FoodDataCentral[FoodDataCentral["ingredients"] != "---"]
FoodDataCentral = FoodDataCentral[FoodDataCentral["ingredients"] != ","]

In [None]:
#Step 3: Calculate and get Readability Scores for FoodData Central ingredient lists
readability_scores = []
for index, row in FoodDataCentral.iterrows():
    #num_words = len(word_tokenize(row['ingredients']))
    if pd.isna(row["ingredients"]) or row["ingredients"] in ["---"]:
        curr_record = (row['fdc_id'], row['gtin_upc'], pd.NA,pd.NA)
        readability_scores.append(curr_record)
        
    else:
        """Readability.getmeasures() automatically tokenizes the input by default and returns a set of readability measures. In this case,
        we are getting a specific measure from the set (flesch-kincaid) """
        curr_record = (row['fdc_id'], row['gtin_upc'], row['branded_food_category'],readability.getmeasures(row["ingredients"])['readability grades']['Kincaid'],
                       readability.getmeasures(row["ingredients"])['readability grades']['FleschReadingEase'],
                       readability.getmeasures(row["ingredients"])['readability grades']['DaleChallIndex'],
                       readability.getmeasures(row["ingredients"])['sentence info']['words'],
                       readability.getmeasures(row["ingredients"])['sentence info']['complex_words_dc'])
        readability_scores.append(curr_record)
        

readScores_FDC = pd.DataFrame(data = readability_scores, columns = ["fdc_id", "gtin_upc","branded_food_category",
                                                                    "Kincaid_Score","FleschReadingEase","DaleChallIndex",
                                                                    "num_words","complex_words_dc"])

readScores_FDC.to_csv("FoodData_Central_Readability.csv", sep=",")
for value in readScores_FDC.columns.values:
    FoodDataCentral[value] = readScores_FDC[value]
    

In [None]:
#Step 4: Generate Matrix of pairwise differences for Flesch Reading Ease
difference_matrix_fl = [[abs(y - x) for x in readScores_FDC["FleschReadingEase"]] for y in readScores_FDC["FleschReadingEase"]]
difference_matrix_fl = list(chain(*difference_matrix_fl))

In [None]:
#Step 5: Create Difference Matrix- Dale-Chall Index
difference_matrix_dc = [[abs(y - x) for x in readScores_FDC["DaleChallIndex"]] for y in readScores_FDC["DaleChallIndex"]]
difference_matrix_dc = list(chain(*difference_matrix_dc))

In [None]:
fdc_id_matrix = [[(y,x) for x in readScores_FDC["fdc_id"]] for y in readScores_FDC["fdc_id"]]
fdc_id_matrix = list(chain(*fdc_id_matrix))

In [None]:
print(fdc_id_matrix == fdcID_pairs)

In [None]:
#Step 6: Convert FDC IDs from float to string
FoodDataCentral["fdc_id"] = FoodDataCentral["fdc_id"].astype("str")

In [None]:
#Get FDC indices for referencing matrices (Optional)
fdc_indices = dict(enumerate(FoodDataCentral["fdc_id"]))

In [None]:
#Step 7: Create list of FDC ID pairs (w/repeats)
fdcID_pairs = list(itertools.product(readScores_FDC["fdc_id"],repeat=2))
print(fdcID_pairs[1:10])

In [None]:
#Step 8: Create and fit CountVectorizer model to ingredient lists 

#Gather list of ingredient lists from dataset
documents = list(FoodDataCentral['ingredients'].values) 

#Create Count Vectorizer Model
count_vectorizer = CountVectorizer(documents, stop_words='english')

#Fit model to ingredient list
count_vectorizer.fit(documents)

In [None]:
#Step 9: Transform CountVectorizer model to array
documents_1 = list(FoodDataCentral['ingredients'].values) 
vectors = count_vectorizer.transform(documents_1).toarray()
np.save("Vectors_Batch_3_test",vectors)

In [None]:
#Step 10: Calculate cosine similarity of ingredient lists
cos_sim = cosine_similarity(vectors)
cos_sim_flat = list(cos_sim)
cos_sim_flat = list(chain(*cos_sim_flat))
cos_sim_flat = cos_sim_flat / np.linalg.norm(cos_sim_flat)

In [None]:
#Step 11: Create Dataframe with readability differences and cosine similarities
analysis_set = pd.DataFrame(fdcID_pairs, columns = ["fdc_id 1","FDC_id 2"])
analysis_set["DaleChallDiff"] = difference_matrix_dc
analysis_set["Cosine_similarity"] = cos_sim_flat
analysis_set["Flesch_diff"] = difference_matrix_fl

In [None]:
#Step 12: Prepare dataframe for plot generation and Summary Statistics
analysis_set.dropna(inplace=True)
analysis_set.replace([np.inf, -np.inf], np.nan, inplace=True)
analysis_set.dropna(inplace=True)
analysis_set.to_csv("Readability_and_CosineSimilarity_scores.csv")

In [None]:
#Step 13: Get Scatter Plot- Dale Chall diff vs Cosine Similarity
plt.scatter(analysis_set["DaleChallDiff"], analysis_set["Cosine_similarity"])
plt.show()
plt.savefig("DaleChall_vs_CosineSimilarity.png")

In [None]:
#Step 14: Get Scatter Plot- Dale Chall diff vs Flesch diff
plt.scatter(analysis_set["DaleChallDiff"], analysis_set["Flesch_diff"])
plt.show()
plt.savefig("DaleChall_vs_Flesch.png")

In [None]:
#Step 15: Get Scatter Plot- Flesch diff vs Cosine Similarity
plt.scatter(analysis_set["Flesch_diff"], analysis_set["Cosine_similarity"])
plt.show()
plt.savefig("Flesch_vs_CosineSimilarity.png")

In [None]:
#Step 16: Get Pearson Correlations
print(pearsonr(analysis_set["DaleChallDiff"], analysis_set["Cosine_similarity"]))
print(pearsonr(analysis_set["Flesch_diff"], analysis_set["Cosine_similarity"]))
print(pearsonr(analysis_set["DaleChallDiff"], analysis_set["Flesch_diff"]))