# Homework10

Exercises with text processing and NLP modeling

## Goals

- Understand similarities and differences between the processes of working with text, images and tabular data
- Practice with different methods of encoding and modeling text data
- See different methods for extracting information or patterns from text datasets

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/text_utils.py

In [1]:
import matplotlib.cm as cm
import matplotlib.pyplot as plt

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from data_utils import display_silhouette_plots, object_from_json_url
from text_utils import get_top_words

You can tell it's gonna be a good homework from the number of imports.
# 🙃

## Have protein, need seasoning

Let's create a model to help us season our foods. In the end, what we want is a model that receives a short list of ingredients and returns a list of seasonings or complementary ingredients for our original ingredients list.

In order to do that we need a dataset of recipes. We'll load that into a text dataset where each recipe is a document and the ingredients are our document *tokens*.

Let's take a look at the recipe dataset and become familiar with the data and how it's organized.

We'll load our recipes and do a bit of exploratory data analysis to look for patterns first to see if this kind of modeling makes any sense.

### Load Data

Here's our dataset. Let's load it into an object for inspection:

In [5]:
DATAPATH = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/refs/heads/main/datasets/text/recipes"
recipes_obj = object_from_json_url(f"{DATAPATH}/recipes_min16.json")

### Look at Data

How's the data organized?

How many recipes do we have?

Do all recipes have the same number of ingredients?

Anything else stand out about the data?

In [6]:
# TODO: Look at Data here
print(recipes_obj[:5])
# TODO: How many recipes
print(len(recipes_obj))
# TODO: How many ingredients do the shortest and longest recipes have?
rec_len = [len(recipe["ingredients"]) for recipe in recipes_obj]
shortest = min(rec_len)
longest = max(rec_len)
print(f"Shortest recipe has {shortest} ingredients")
print(f"Longest recipe has {longest} ingredients")

[{'id': 18009, 'ingredients': ['raisins', 'baking powder', 'egg', 'sugar', 'milk', 'flour']}, {'id': 35687, 'ingredients': ['parmesan cheese', 'salt', 'cornmeal', 'black pepper', 'sausage', 'olive oil', 'leeks', 'water']}, {'id': 38527, 'ingredients': ['salt', 'corn starch', 'butter', 'lemon juice', 'baking powder', 'heavy cream', 'peaches', 'sugar', 'flour']}, {'id': 41217, 'ingredients': ['corn starch', 'orange juice', 'rice', 'ginger', 'vinegar', 'vegetable oil', 'garlic', 'sriracha', 'sesame seeds', 'chicken broth', 'soy sauce', 'egg', 'onion', 'white pepper', 'orange zest', 'sugar']}, {'id': 42969, 'ingredients': ['cilantro', 'rice', 'ginger', 'garlic', 'yogurt', 'curry powder', 'onion', 'cumin']}]
5015
Shortest recipe has 5 ingredients
Longest recipe has 27 ingredients


### Create Input Features

Our dataset doesn't really have to be a `DataFrame` here. It can, but it doesn't have to be.

Each recipe right now is described as a list of ingredients, but what we really want is a list of *sentences*, where each *sentence* is a Python `string` with all of the ingredients for a given recipe.

Instead of:<br>```["salt", "baking soda", "water", "mushroom"]```,

we want:<br>```"salt baking soda water mushroom"```

The `join()` function might help.

Another thing to consider is wether we want to do anything special about multi-word ingredients, like *baking soda*.

Do we want to let our vectorizer (spoiler) split that into two tokens, or do we want to guarantee that *baking* and *soda* always stay together? 

In [56]:
# TODO: turn list of objects into list of strings
for recipe in recipes_obj:
   recipe["ingredients_list"] = " ".join(recipe["ingredients"])


print(recipes_obj[1]["ingredients_list"])

ingredients = [ingredients["ingredients_list"] for ingredients in recipes_obj]
ingredients[1]

parmesan cheese salt cornmeal black pepper sausage olive oil leeks water


'parmesan cheese salt cornmeal black pepper sausage olive oil leeks water'

### Encode Data

The fun part.

Let's vectorize our list of ingredient strings into a sparse document matrix using `CountVectorizer` or `TfidfVectorizer`.

The resulting matrix will have one row for each recipe, and the columns will encode the ingredients.

In [87]:
# TODO: Vectorize ingredients from our recipe list
vec = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=20_000 , ngram_range=(1, 2))
recipes_vct = vec.fit_transform(ingredients)

# TODO: How many words are in our vocabulary?
vocab = vec.get_feature_names_out()
len(vocab)

2588

In [None]:
recipes_vct[1]

### Cluster Data

Now that we have our recipes/documents vectorized we can study them a little bit, and look for patterns.

What happens if we cluster our recipes ? What do the cluster centers represent ?

When might this be useful ?

In [None]:
# TODO: cluster recipes
cluster_km = KMeans(n_clusters=7, random_state=1010)
cluster_km.fit(recipes_vct)
ing_kmeans = cluster_km.predict(recipes_vct)

In [None]:
from sklearn.metrics import silhouette_score

num_clusters = list(range(2,15))

# collect distance, silhouette and balance scores
score_scores = []

# get distance, likelihood and balance for different clustering sizes
for n in num_clusters:
  mm = KMeans(n_clusters=n)
  mm.fit_predict(recipes_vct)
  score_scores.append(mm.score(recipes_vct))


# plot scores as function of number of clusters
plt.plot(num_clusters, score_scores, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Distance Squared Sum Score")
plt.title("K-means Clustering")
#plt.ylim([-4300, -3900])
#plt.xlim([5, 10])
plt.show()


Here we can see that the number of clusters we need is bigger than 5 and less than 8 or 9. So I will just use k = 7. Afer I saw that two centers had the same word.

### Cluster Centers

Use the `get_top_words()` function to decode the `cluster_centers` back into ingredients.

In [None]:
# TODO: Look at cluster centers
clusters_centers = cluster_km.cluster_centers_
print(clusters_centers)

get_top_words(cluster_km.cluster_centers_, vocab, 6)

### Interpretation

<span style="color:hotpink">
What do these cluster centers represent ?<br>
Is there anything interesting about recipe cluster centers ?<br>
</span>

<span style="color:lightgreen;">
We are using TF-IDF so it's scaling words on their importance and returns a sparce matrix with importance values. We are passing those values to the clustering K-Means algorithm and creating 7 clusters. The center of each culster is the most important word in each cluster. It's the average value of all points in this cluster. <br>
The Interessting thing for me that the clusters are getting ingriedents together that are actually could be cooked together.
<br>
The interesting thing is also the n-gram range captured ingredients that are fit together like oil and olive oil and still having oil and olive also separated.

</span>

### Plot Clusters

Let's plot our clusters to see if we have to adjust any of the clustering parameters.

Since we can't plot in $500$ dimensions, we should use `PCA` to look at our clusters in $2D$ and $3D$.

In [None]:
# TODO: PCA to reduce the dimensions of our feature space
pca = PCA(n_components=3)
rec_df = pca.fit_transform(recipes_vct.toarray())

# TODO: plot clusters 
plt.figure(figsize=(20, 20))
plt.title("Clusters of Recipes")
plt.scatter(rec_df[:, 0], rec_df[:, 1], c=ing_kmeans, alpha=0.6)
plt.show()

### Plot Silhouette Plots

We can also check the quality of our clustering by looking at the silhouette plots that we get from calling:<br>
`display_silhouette_plots(vectors, clusters)`.

In [None]:

display_silhouette_plots(recipes_vct, ing_kmeans)
from sklearn.metrics import silhouette_score

score = silhouette_score(recipes_vct, ing_kmeans)
print(score)

### Interpretation

<span style="color:hotpink">
How many clusters did you end up with ?<br>
How do they look ?<br>
</span>

<span style="color:lightgreen;">
I tried many different. From 3 to 20. <br>

The higher clusters the worse. Two clusters is also not useful. 
The best number was 8 witht the best score and plots. Otherwise the error was bigger across the components. This was after I changed the vectorizer. The Count Vectorizer was not as good. The best number of clusters was 3, which I was not very convinced with. I changed to the TF vetorizer. and it was kind of better with 7 clusterrs which makes more sence as we are talking about food and ingreiedents that could be very diverse. 
</span>

## Recipe Completion

Ok. On to the main event.

Let's create some recipes.

We'll do this using a technique similar to what is used for movie/product recommendations. Given an initial set of ingredients, we'll look at recipes that have similar ingredients and "recommend" additional ingredients.

We already have all of the recipes in our dataset encoded as `tf-idf` vectors. The rest of our algorithm will be something like:
1. Start with an initial set of ingredients
2. Encode ingredients
3. Find a set of recipes that are similar to our list of ingredients
4. Find common ingredients that are in the similar recipes, but not in our list of ingredients
5. Pick representative ingredient to add to recipe
6. Repeat

Let's start.

### 1. Initial list of ingredients

This is just a string with ingredients:

In [None]:
recipe_seed_str = "flour"  # feel free to change this

### 2. Encode ingredients

Transform the string into a `tf-idf` vector:

In [None]:
# TODO: transform string into sparse vector
recipe_seed_vct = vec.transform([recipe_seed_str])
recipe_seed_vct

### 3. Find similar recipes

The meat of the algorithm. No pun intended.

In order to find similar recipes, we'll first calculate the distance between our current list of ingredients and all recipes in our dataset.

We can start with euclidean distance and later try other kinds, but the overall processing will be the same:

1. Start with an empty list to store distances
2. Loop over the `tf-idf` recipe vectors and for each vector:
   1. Subtract the ingredient list
   2. Square the difference (to square a sparse matrix `A`, use `A.multiply(A)`)
   3. Sum the terms of the result
   4. Take the square root of the sum
   5. Append to distance list
3. Find the indices of the smallest distances (this operation is called `argsort` and will give us the indices of the recipes that are most similar to our list of ingredients)
4. Check the recipes to see if they are indeed similar (`inverse_transform()` the vectors at the indices calculated above)

In [13]:
# argsort a list (get sequence of indices that would sort the list)
# https://stackoverflow.com/a/3382369
def argsort(L, reverse=False):
  return sorted(range(len(L)), key=L.__getitem__, reverse=reverse)

In [89]:
# TODO: list to keep distances
recipe_dists = []

# TODO: loop over vectors and append euclidean distances to list
for vec in recipes_vct:
    diff = recipe_seed_vct - vec
    sq_diff = diff.multiply(diff)
    sum_sq = sq_diff.sum()
    dist = (sum_sq)**0.5
    recipe_dists.append(dist)

print(len(recipe_dists))

# TODO: argsort list of distances to find indices of similar recipes
idxs = argsort(recipe_dists)

# TODO: check first 4 recipes
print(idxs[:4])
for i in idxs[:4]:
    print(ingredients[i])


5015
[677, 4405, 69, 1846]
salt flour yogurt oil ghee
salt cake flour baking soda buttermilk flour
salt pistachios butter flour cake flour cherries sugar vanilla
salt butter bread flour egg yeast sugar flour


WOW!! This is amazing!

### 4. Find ingredients to recommend

We have a way to get a set of similar recipes with similar ingredients, and now want to find a *meaningful*, or *representative*, ingredient to add to our ingredients list.

Let's consider ingredients in the $16$ most similar recipes. What we are trying to do is find an ingredient that is in a lot of these recipes, but not yet in our list of ingredients.

There are many possible ways of doing this. We could count the number of times different ingredients show up in these $16$ recipes using Python dictionaries and/or sets, but what we're trying to do here is very similar to what a `TfidfVectorizer` does: calculate relative importance of terms in a series of documents.

Let's re-encode these $16$ recipes using their own separate `TfidfVectorizer`, then sum the importance of each ingredient and look at ingredients with the highest importance scores.

We could re-use the vectors/scores from the original `TfidfVectorizer`, but they're gonna be influenced by the relative frequencies of all of the ingredients that showed up in all of the recipes. Using a separate vectorizer is a little bit more precise.

The steps we need to take are:

1. Separate the $16$ recipes most similar to our list of ingredients
   1. We have lots of representations of our recipes, but `recipes` (list of strings) might be the easiest one to use here
2. Create a new `TfidfVectorizer` and encode the $16$ recipes
3. Sum the resulting vectors to get overall importance scores for each ingredient/token
4. Convert resulting vector to a list using `A.tolist()[0]`
5. `argsort` the importance scores to get sequence of ingredient indices ordered from most to least important
6. Find the most important ingredient that isn't on the ingredient list

In [90]:
# TODO: Get 16 most similar recipes
sxten_recpies = []
for i in idxs[:16]:
    sxten_recpies.append(ingredients[i])
print("First", sxten_recpies)

# TODO: Encode the 16 recipes
sxten_vec = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=10_000)
sxten_df = sxten_vec.fit_transform(sxten_recpies)

# TODO: Sum the recipe vectors by column to get ingredient importance scores
ing_score = sxten_df.sum(axis=0)
print("this is ing Score",ing_score)

# TODO: Convert sparse vector to regular list with A.tolist()[0]
vec_list = ing_score.tolist()[0]
print("Regular List",vec_list)

# TODO: argsort the importance scores
sxten_im = argsort(vec_list, reverse=True) #True for the most important (before we wanted the least not the most)
print("Indicies", sxten_im)

# TODO: Find most important ingredient not yet on the list of ingredients
vocab = sxten_vec.get_feature_names_out()
print(vocab)

for idx in sxten_im:
    ingredient = vocab[idx]
    if ingredient not in recipe_seed_str:
        print("not in the list", ingredient)
    else: 
        print("in the list", ingredient)

First ['salt flour yogurt oil ghee', 'salt cake flour baking soda buttermilk flour', 'salt pistachios butter flour cake flour cherries sugar vanilla', 'salt butter bread flour egg yeast sugar flour', 'salt wheat flour honey olive oil yeast water flour', 'kimchi salt pork canola oil egg rice flour scallions flour', 'lemon butter egg apples rice flour sugar flour', 'salt butter bread flour egg yeast sugar milk flour', 'wheat flour salt cumin seed flour water ghee', 'wheat flour salt honey black pepper warm water tomato sauce basil corn flour oregano olive oil yeast rosemary flour', 'cream cheese salt butter corn flour egg vanilla sugar milk flour', 'wheat flour salt rice ginger cinnamon peanut oil nutmeg paprika flour', 'butter egg tea sugar flour', 'butter vanilla sugar milk flour', 'salt butter black pepper crab flour', 'salt vegetables egg baking powder shrimp rice flour cold water oil flour']
this is ing Score [[5.11735017 4.31274261 5.46509298 4.41377885]]
Regular List [5.1173501678

### 5. Add ingredient to recipe

This is simply adding a word to `recipe_seed_str`

In [None]:
# TODO: add the first important ingredient to list of ingredients
for idx in sxten_im[:1]:
    ingredient = vocab[idx]
    if ingredient not in recipe_seed_str:
        recipe_seed_str = f'{recipe_seed_str} {ingredient}'
recipe_seed_str

### 6. Repeat (Optional)

Now we can repeat this process until we get an empty list of important ingredients: 
1. Encode current recipe
2. Find similar recipes
3. Find important ingredients
4. Add important ingredient

Might be helpful to define a couple of functions, like `find_similar_recipes()` and `find_important_ingredients()`...

Only do this step if you're really curious about experimenting with generating unconventional ingredient lists. It's not going to be graded.

In [104]:
# TODO: Create find_similar_recipes(ingredients, recipes, vectorizer)
def find_similar_recipes(recipes, seed, vectorizer):
    ##v Vectorizing
    recipes_vct = vectorizer.fit_transform(recipes)
    seed_vec = vectorizer.transform([seed])
    
    ## Distance from the seed to each vector
    distance = []
    for vector in recipes_vct:
        difference = seed_vec - vector
        difference_squared = difference.multiply(difference)
        squared_sum = difference_squared.sum()
        sum_square_root = (squared_sum)**0.5
        distance.append(sum_square_root)
    
    #sorts the nearst to the seed vector
    indicis = argsort(distance)

    top_ten_recipes = []
    for index in indicis[:15]:
        recipe = recipes[index]
        top_ten_recipes.append(recipe)


    return top_ten_recipes

# TODO: Create find_important_ingredients(recipes)
def find_important_ingredients(recipes):
    ## Encoding
    vectorizer = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=30_000, ngram_range=(1,2))
    recipes_vec = vectorizer.fit_transform(recipes)

    ## Score for each ingredient in the sparse matrix
    ingredients_score = recipes_vec.sum(axis=0)

    ## Sparse score vector to regular list
    regular_list = ingredients_score.tolist()[0]

    ## Sort the most important ingredients top to buttom
    indicis = argsort(regular_list, reverse=True)

    ## The actual words
    vocab = vectorizer.get_feature_names_out()
    
    top_ingredients = []
    for index in indicis[:10]:
        ingredient = vocab[index]
        top_ingredients.append(ingredient)

    return top_ingredients

In [107]:
all_recipes = ingredients 

In [108]:
vectorizer = TfidfVectorizer(stop_words="english", min_df=5, max_df=0.75, max_features=30_000 , ngram_range=(1, 2))

In [109]:
seed = "flour"

In [110]:
# TODO: Create recipe by repeating calls to find_similar_recipes() and find_important_ingredients()
seed_set = set(seed.split())

for rount in range(25):
    recipes = find_similar_recipes(all_recipes, seed, vectorizer) ## 10 related recipes
    top_ingredient = find_important_ingredients(recipes) ## 5 Top related ingredients

    added = False 
    for ingredient in top_ingredient:
        if ingredient not in seed_set:    
            seed += f' {ingredient}'
            seed_set.add(ingredient)
            print(f'{ingredient} was added to your recipe')
            added = True
            break
    if not added: 
        print("No new ingredients found")
        break

print(f'MY FINAL RECIPE: {seed}')

butter was added to your recipe
sugar was added to your recipe
sugar flour was added to your recipe
salt was added to your recipe
egg was added to your recipe
salt egg was added to your recipe
vanilla was added to your recipe
egg vanilla was added to your recipe
milk was added to your recipe
sugar milk was added to your recipe
vanilla sugar was added to your recipe
milk flour was added to your recipe
salt butter was added to your recipe
No new ingredients found
MY FINAL RECIPE: flour butter sugar sugar flour salt egg salt egg vanilla egg vanilla milk sugar milk vanilla sugar milk flour salt butter


Not bad, we will have very sweet sugar vanilla cake maybe with a bit of salt!