# Exploratory Data Analysis of Epicurious Scrape in a JSON file

This is an idealized workflow for Aaron Chen in looking at data science problems. It likely isn't the best path, nor has he rigidly applied or stuck to this ideal, but he wishes that he worked this way more frequently.

## Purpose: Work through some exploratory data analysis of the Epicurious scrape on stream. Try to write some functions to help process the data.

### Author: Aaron Chen


---

### If needed, run shell commands here

In [2]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.2.0/en_core_web_sm-3.2.0-py3-none-any.whl (13.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.9/13.9 MB[0m [31m43.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
You should consider upgrading via the '/home/awchen/Repos/Projects/MeaLeon/.venv/bin/python -m pip install --upgrade pip' command.[0m[33m
[0m[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


---

## External Resources

List out references or documentation that has helped you with this notebook

### Code
Regex Checker: https://regex101.com/

#### Scikit-learn
1. https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda
2. 

### Data

For this notebook, the data is stored in the repo base folder/data/raw

### Process

Are there steps or tutorials you are following? Those are things I try to list in Process

___

## Import necessary libraries

In [16]:
from datetime import datetime
# import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
# from spacy.lemmatizer import Lemmatizer
from tqdm import tqdm
from typing import Any

---

## Define helper functions

My workflow is to try things with code cells, then when the code cells get messy and repetitive, to convert into helper functions that can be called.

When the helper functions are getting used a lot, it is usually better to convert them to scripts or classes that can be called/instantiated

In [None]:
# def remove_empties(deficiency_text: List) -> List:
#     """This function takes in a list of strings and removes empty strings from the list. The function is needed 
#     because if the list does not contain empty strings, the default remove() function returns None and an Error."""
    
#     filtered = list(filter(lambda x: x != '', deficiency_text))

#     return filtered

In [None]:
# def lemmatizer(doc):
#     # This takes in a doc of tokens from the NER and lemmatizes them. 
#     # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
#     doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
#     doc = u' '.join(doc)
#     return nlp.make_doc(doc)

In [None]:
# def remove_stopwords(doc):
#     # This will remove stopwords and punctuation.
#     # Use token.text to return strings, which we'll need for Gensim.
#     doc = [token for token in doc if token.is_stop != True and token.is_punct != True]
#     return doc

In [None]:
# nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
# nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

### Import local script

I started grouping this in with importing libraries, but putting them at the bottom of the list

In [4]:
import project_path

import src.dataframe_preprocessor as dfpp

---

## Define global variables 
### Remember to refactor these out, not ideal

In [5]:
data_path = "../../data/recipes-en-201706/epicurious-recipes_m2.json"

---

## Running Commentary

1. I used numbered lists to keep track of things I noticed

### To Do

1. Try to determine consistency of nested data structures
   1. Is the photoData or number of things inside photoData the same from record to record
   2. What about for tag?

Data wasn't fully consistent but logic in helper function helped handle nulls

2. How to handle nulls?
   1. Author      Filled in with "Missing Author"
   2. Tag         Filled in with "Missing Cuisine"
3. ~~Convert pubDate to actual timestamp~~  
4. ~~Convert ScrapeDate to actual timestamp~~
   1. This was ignored as the datestamp was not useful (generally within minutes of the origin of UNIX time)
   
**5. Append new columns for relevant nested structures and unfold them**

6. Determine actual types of `ingredients` and `prepSteps`
7. Continue working through test example of single recipe to feed into spaCy and then sklearn.feature_extraction.text stack
8. Will need to remove numbers, punctuation

---

## Importing and viewing the data as a dataframe

In [47]:
repo = pd.read_json(path_or_buf=data_path) # type:ignore
pd.read_json(data_path, typ='frame') # type:ignore

dfpp.preprocess_dataframe(df=repo) # type:ignore
print(repo.shape)
repo.head(10) # type:ignore

(34756, 14)


Unnamed: 0,id,dek,hed,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,cuisine_name,photo_filename,photo_credit,author_name,date_published,recipe_url
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,Missing Cuisine,51247610_fried-chicken_1x1.jpg,Michael Graydon and Nikole Herriott,Missing Author Name,2014-08-19 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
1,54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,Italian,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Edda Servi Machlin,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
2,54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,Jewish,EP_09022015_honeycake-2.jpg,"Photo by Chelsea Kyle, Food Styling by Anna St...",Marcy Goldman,2008-09-10 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
3,54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,4.0,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,Jewish,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Faye Levy,2008-09-08 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
4,54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,Jewish,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Joan Nathan,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
5,54a408a919925f464b3733d3,Although Nelly Custis omitted sugar in her rec...,Rice Pancakes,0.0,"[1 1/2 cups cooked rice, 2 cups heavy cream, 2...","[1. Combine the rice, cream, and butter. Add t...",0,0,Missing Cuisine,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Stephen A. McLeod,2012-02-17 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
6,54a408aa19925f464b3733d6,Editor's note: This recipe is adapted with per...,Jack-O'-Lantern,1.0,"[2 tablespoons shortening, 2 tablespoons flour...",[1. Preheat the oven to 350°F. Lightly grease ...,1,0,Missing Cuisine,350068.jpg,Jennifer Newberry Mead,Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
7,54a408ab19925f464b3733da,Editor's note: This recipe is reprinted with p...,Seven-Minute Frosting,3.53,"[1 1/2 cups sugar, 1/3 cup cold water, 2 egg w...","[1. Combine the sugar, water, egg whites, and ...",8,75,Missing Cuisine,EP_12162015_placeholders_bright.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
8,54a408ac19925f464b3733de,Editor's note: This recipe is reprinted with p...,Creamy White Frosting,2.0,"[1 cup vegetable shortening, 1 1/2 teaspoons v...","[1. With a mixer on medium speed, beat togethe...",5,0,Missing Cuisine,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
9,54a408ac6529d92b2c003653,Editor's note: This recipe is reprinted with p...,Host of Ghosts,3.17,"[One purchased 9-inch angel food cake, 2 recip...",[1. Place the cake on the cake plate. Reserve ...,12,100,Missing Cuisine,350067.jpg,Jennifer Newberry Mead,Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...


In [7]:
repo_greds = repo['ingredients']
repo_greds

0        [1 tablespoons yellow mustard seeds, 1 tablesp...
1        [3 pounds small-leaved bulk spinach, Salt, 1/2...
2        [3 1/2 cups all-purpose flour, 1 tablespoon ba...
3        [1 small ripe avocado, preferably Hass (see No...
4        [2 pounds fresh tomatoes, unpeeled and cut in ...
                               ...                        
34751    [1 tablespoon unsalted butter, at room tempera...
34752    [8 tablespoons (1 stick) salted butter, at roo...
34753    [3 tablespoons unsalted butter, plus more for ...
34754    [Coarse salt, 2 lime wedges, 2 ounces tomato j...
34755    [1 bottle (375 ml) sour beer, such as Almanac ...
Name: ingredients, Length: 34756, dtype: object

In [8]:
all_recipes_list = repo_greds.str.join(" ")
all_recipes_list

0        1 tablespoons yellow mustard seeds 1 tablespoo...
1        3 pounds small-leaved bulk spinach Salt 1/2 cu...
2        3 1/2 cups all-purpose flour 1 tablespoon baki...
3        1 small ripe avocado, preferably Hass (see Not...
4        2 pounds fresh tomatoes, unpeeled and cut in q...
                               ...                        
34751    1 tablespoon unsalted butter, at room temperat...
34752    8 tablespoons (1 stick) salted butter, at room...
34753    3 tablespoons unsalted butter, plus more for g...
34754    Coarse salt 2 lime wedges 2 ounces tomato juic...
34755    1 bottle (375 ml) sour beer, such as Almanac C...
Name: ingredients, Length: 34756, dtype: object

In [12]:
nlp = spacy.load("en_core_web_sm")
first_rec:str = all_recipes_list[0] 
print(type(first_rec))
# first_rec will be Text

<class 'str'>


In [13]:
doc_first_rec = nlp(first_rec)
for token in doc_first_rec:
    if token.like_num == False:
        print(token.text, token.pos_, token.ent_type_, token.lemma_, token.is_digit)
    else:
        continue

tablespoons NOUN  tablespoon False
yellow ADJ  yellow False
mustard NOUN  mustard False
seeds NOUN  seed False
tablespoons NOUN  tablespoon False
brown ADJ  brown False
mustard NOUN  mustard False
seeds NOUN  seed False
teaspoons NOUN  teaspoon False
coriander NOUN  coriander False
seeds NOUN  seed False
cup NOUN QUANTITY cup False
apple NOUN  apple False
cider NOUN  cider False
vinegar NOUN  vinegar False
cup NOUN  cup False
kosher ADJ  kosher False
salt NOUN  salt False
cup NOUN  cup False
sugar NOUN  sugar False
cup NOUN  cup False
chopped VERB  chop False
fresh ADJ  fresh False
dill NOUN  dill False
skinless NOUN  skinless False
, PUNCT  , False
boneless NOUN  boneless False
chicken NOUN  chicken False
thighs NOUN  thigh False
( PUNCT  ( False
about ADV QUANTITY about False
pounds NOUN QUANTITY pound False
) PUNCT  ) False
, PUNCT  , False
halved VERB  halve False
, PUNCT  , False
quartered VERB  quarter False
if SCONJ  if False
large ADJ  large False
Vegetable ADJ  vegetable False

In [14]:
type(STOP_WORDS)

set

In [18]:
def custom_preprocessor(recipe_ingreds: str) -> list:
    """This function replaces the default sklearn CountVectorizer preprocessor to use spaCy. sklearn CountVectorizer's preprocessor only performs accent removal and lowercasing.

    Args:
        A string to tokenize from a recipe representing the ingredients used in the recipe

    Returns:
        A list of strings that have been de-accented and lowercased to be used in tokenization
    """
    preprocessed = [token for token in nlp(recipe_ingreds)]

    return preprocessed

In [20]:
def custom_lemmatizer(ingredients: list) -> Any: # spacy nlp.Doc
    """This takes in a string representing the recipe and an NLP model and lemmatize with the NER. 
    
    Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    Remove punctuation

    Args:
        ingredients: string
        nlp_mod: spacy model (try built in first, by default called nlp)
    
    Returns:
        nlp.Doc
    """
    lemmas = [token.lemma_ for token in ingredients if (token.is_alpha and token.lemma_ != "-PRON-" and len(token.lemma_) > 1)]
    return lemmas
    # return doc

In [22]:
cv = CountVectorizer(strip_accents='unicode', lowercase=True, preprocessor=custom_preprocessor, tokenizer=custom_lemmatizer, stop_words='english', ngram_range=(1,4))

first_five_recipe_repo = all_recipes_list[0:4] 
repo_transformed = cv.fit_transform(first_five_recipe_repo)

In [23]:
cv.get_feature_names_out()

array(['Dash', 'Dash nutmeg', 'Hass', 'Hass note', 'Hass note teaspoon',
       'Hass note teaspoon fresh', 'Honey', 'Honey flaky',
       'Honey flaky sea', 'Honey flaky sea salt', 'Kosher', 'Kosher salt',
       'Kosher salt Honey', 'Kosher salt Honey flaky', 'Maldon',
       'Maldon toast', 'Maldon toast benne', 'Maldon toast benne sesame',
       'Salt', 'Salt cup', 'Salt cup dark', 'Salt cup dark seedless',
       'Salt freshly', 'Salt freshly ground', 'Salt freshly ground black',
       'allspice', 'allspice cup', 'allspice cup vegetable',
       'allspice cup vegetable oil', 'almond', 'almond optional', 'apple',
       'apple cider', 'apple cider vinegar', 'apple cider vinegar cup',
       'avocado', 'avocado preferably', 'avocado preferably Hass',
       'avocado preferably Hass note', 'bagel', 'bagel slice',
       'bagel slice strip', 'bagel slice strip lox', 'bake',
       'bake powder', 'bake powder teaspoon', 'bake powder teaspoon bake',
       'bake soda', 'bake soda teas

1. It looks like the ngrams display that flattening the lists of ingredients into 1 long string is not a great idea, as you can see things like "buttermilk cup purpose" and that doesn't have much meaning. The original repo['ingredients'] has each recipe's ingredients contained in a list of strings for each record. Flattening the entire corpus probably works ok to get the entire vocabulary, but it isn't as helpful if we want each 'ingredient' in the list to be a 'sentence' in a 'document'. For example, 'apple cider vinegar' is useful but 'apple cider vinegar cup' looks like it read in the first word from the next ingredient.


2. Need to add cooking specific stopwords:
   1. Units (tablespoon, dash, cup)
   2. Colors alone likely not helpful, but used in n-grams probably helpful
   3. Adjectives?


3. Lemma vs word itself
   1. Lemma might shrink the overall vocabulary, but we see problems with things like "baking soda" becoming "bake soda" which doesn't make as much sense to a user

each recipe is a list of ingredients (strings).

each ingredient is kind of like a sentence but separated by commas since each ingredient is an element in the list

this implies that each recipe is kind of like a document since those function as collections of sentences

we need to figure out a way to extract ngrams from 1-4 from each ingredient in the entire list of recipes (the entire list of recipes should be a corpus)

In [34]:
# repo['ingredients'].tolist()
single_rep = repo['ingredients'].tolist()[0]
print(single_rep)

['1 tablespoons yellow mustard seeds', '1 tablespoons brown mustard seeds', '1 1/2 teaspoons coriander seeds', '1 cup apple cider vinegar', '2/3 cup kosher salt', '1/3 cup sugar', '1/4 cup chopped fresh dill', '8 skinless, boneless chicken thighs (about 3 pounds), halved, quartered if large', 'Vegetable oil (for frying; about 10 cups)', '2 cups buttermilk', '2 cups all-purpose flour', 'Kosher salt', 'Honey, flaky sea salt (such as Maldon), toasted benne or sesame seeds, hot sauce (for serving)', 'A deep-fry thermometer']


In [35]:
single_rep[0]

'1 tablespoons yellow mustard seeds'

In [39]:
for recipe in repo['ingredients'].tolist():
    for ingred in recipe:
        print(recipe, ingred)

In [32]:
recipe_megalist = [ingred for recipe in repo['ingredients'].tolist() for ingred in recipe]

TypeError: 'float' object is not iterable

In [37]:
limited = [ingred for recipe in repo['ingredients'].tolist()[0:3] for ingred in recipe[0:3]]
print(limited)

['1 tablespoons yellow mustard seeds', '1 tablespoons brown mustard seeds', '1 1/2 teaspoons coriander seeds', '3 pounds small-leaved bulk spinach', 'Salt', '1/2 cup dark seedless raisins', '3 1/2 cups all-purpose flour', '1 tablespoon baking powder', '1 teaspoon baking soda']


In [31]:
recipe_megalist

[['1 tablespoons yellow mustard seeds',
  '1 tablespoons brown mustard seeds',
  '1 1/2 teaspoons coriander seeds',
  '1 cup apple cider vinegar',
  '2/3 cup kosher salt',
  '1/3 cup sugar',
  '1/4 cup chopped fresh dill',
  '8 skinless, boneless chicken thighs (about 3 pounds), halved, quartered if large',
  'Vegetable oil (for frying; about 10 cups)',
  '2 cups buttermilk',
  '2 cups all-purpose flour',
  'Kosher salt',
  'Honey, flaky sea salt (such as Maldon), toasted benne or sesame seeds, hot sauce (for serving)',
  'A deep-fry thermometer'],
 ['3 pounds small-leaved bulk spinach',
  'Salt',
  '1/2 cup dark seedless raisins',
  '1 cup lukewarm water',
  '6 tablespoons olive oil',
  '1/2 small onion, minced',
  '1/4 cup pignoli (pine nuts)',
  'Freshly ground black pepper',
  'Dash nutmeg'],
 ['3 1/2 cups all-purpose flour',
  '1 tablespoon baking powder',
  '1 teaspoon baking soda',
  '1/2 teaspoon salt',
  '4 teaspoons ground cinnamon',
  '1/2 teaspoon ground cloves',
  '1/2 tea

In [40]:
repo['ingredients'].isna().sum()

100

In [41]:
repo[repo['ingredients'].isna()]

Unnamed: 0,id,dek,hed,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,cuisine_name,photo_filename,photo_credit,author_name,date_published,recipe_url
757,54a40dbe19925f464b37426b,,Opening a Fresh Coconut,3.00,,[Heat the oven to 375°. Pierce the two or thre...,2,50,Missing Cuisine,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Missing Author Name,2004-08-20 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
1443,54a412f919925f464b375208,,Cracking and Grating Coconut,3.00,,[Extracting the meat from a coconut is not as ...,1,0,Missing Cuisine,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Missing Author Name,2004-08-20 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
1498,54a413496529d92b2c00552a,,To Carve a Rib Roast,2.33,,"[For all its grandeur, a standing rib roast is...",4,100,Missing Cuisine,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Missing Author Name,2004-08-20 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
1560,54a413dc19925f464b375434,,To Quick-Roast and Peel Chilies or Peppers,2.00,,"[Broiler method:, Lay chilies or peppers, on t...",3,50,Missing Cuisine,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Missing Author Name,2004-08-20 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
1562,54a413de6529d92b2c00567c,,Roast Smoked Loin of Pork,1.00,,[Smoked pork loin is a great delicacy. When pu...,0,0,Missing Cuisine,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",James Beard,2004-08-20 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32394,54a47f5b6529d92b2c02cf42,Traditionally tortillas are heated a few at a ...,To Warm Tortillas,2.00,,[Wrap stacks of 8 tortillas in foil and chill ...,2,100,Missing Cuisine,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Missing Author Name,2004-08-20 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
32579,5509b22b3f2f10b2690881ef,This California twist on the corned beef and c...,Suzanne Goin's Corned Beef and Cabbage with Pa...,0.00,,[],0,0,Irish,EP_12162015_placeholders_bright.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Missing Author Name,2015-03-11 00:00:00+00:00,https://www.epicurious.com/suzanne-goin-s-corn...
32784,555ba627644d45515b7586f5,"Making tomato water sounds fancy, but all you'...","Seared Scallops with Tomato Water, Lime, and Mint",0.00,,[],0,0,Missing Cuisine,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...","Roberta's in Brooklyn chef, Carlo Mirarchi",2014-07-10 17:08:48+00:00,https://www.epicurious.com/recipes/seared-scal...
33156,560311e408929a1609a0c8f7,,reserve this recipe id for future use,0.00,,[],0,0,Missing Cuisine,EP_12162015_placeholders_bright.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Missing Author Name,2014-04-21 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...


In [43]:
missing_ingreds_indices = repo[repo['ingredients'].isna()].index.tolist()

In [44]:
raw_data = pd.read_json(path_or_buf=data_path)
missing_from_raw_data = raw_data.loc[missing_ingreds_indices]
missing_from_raw_data

Unnamed: 0,id,dek,hed,pubDate,author,type,url,photoData,tag,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,dateCrawled
757,54a40dbe19925f464b37426b,,Opening a Fresh Coconut,2004-08-20T04:00:00.000Z,[],recipe,/recipes/food/views/opening-a-fresh-coconut-10...,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'ingredient', 'name': 'Coconut', ...",3.00,,[Heat the oven to 375°. Pierce the two or thre...,2,50,1498549640
1443,54a412f919925f464b375208,,Cracking and Grating Coconut,2004-08-20T04:00:00.000Z,[],recipe,/recipes/food/views/cracking-and-grating-cocon...,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'occasion', 'name': 'Winter', 'ur...",3.00,,[Extracting the meat from a coconut is not as ...,1,0,1498549396
1498,54a413496529d92b2c00552a,,To Carve a Rib Roast,2004-08-20T04:00:00.000Z,[],recipe,/recipes/food/views/to-carve-a-rib-roast-15825,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'ingredient', 'name': 'Beef', 'ur...",2.33,,"[For all its grandeur, a standing rib roast is...",4,100,1498549540
1560,54a413dc19925f464b375434,,To Quick-Roast and Peel Chilies or Peppers,2004-08-20T04:00:00.000Z,[],recipe,/recipes/food/views/to-quick-roast-and-peel-ch...,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'ingredient', 'name': 'Pepper', '...",2.00,,"[Broiler method:, Lay chilies or peppers, on t...",3,50,1498549536
1562,54a413de6529d92b2c00567c,,Roast Smoked Loin of Pork,2004-08-20T04:00:00.000Z,[{'name': 'James Beard'}],recipe,/recipes/food/views/roast-smoked-loin-of-pork-...,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'ingredient', 'name': 'Pork', 'ur...",1.00,,[Smoked pork loin is a great delicacy. When pu...,0,0,1498549536
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32394,54a47f5b6529d92b2c02cf42,Traditionally tortillas are heated a few at a ...,To Warm Tortillas,2004-08-20T04:00:00.000Z,[],recipe,/recipes/food/views/to-warm-tortillas-14142,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'technique', 'name': 'Bake', 'url...",2.00,,[Wrap stacks of 8 tortillas in foil and chill ...,2,100,1498548444
32579,5509b22b3f2f10b2690881ef,This California twist on the corned beef and c...,Suzanne Goin's Corned Beef and Cabbage with Pa...,2015-03-11T00:00:00.000Z,[],recipe,/suzanne-goin-s-corned-beef-and-cabbage-with-p...,"{'id': '5674617eb47c050a284a4e11', 'filename':...","{'category': 'cuisine', 'name': 'Irish', 'url'...",0.00,,[],0,0,1498546948
32784,555ba627644d45515b7586f5,"Making tomato water sounds fancy, but all you'...","Seared Scallops with Tomato Water, Lime, and Mint",2014-07-10T17:08:48.000Z,"[{'name': 'Roberta's in Brooklyn chef, Carlo M...",recipe,/recipes/seared-scallops-with-tomato-water-lim...,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'ingredient', 'name': 'Scallop', ...",0.00,,[],0,0,1498547038
33156,560311e408929a1609a0c8f7,,reserve this recipe id for future use,2014-04-21T04:00:00.000Z,[],recipe,/recipes/food/views/reserve-this-recipe-id-for...,"{'id': '5674617eb47c050a284a4e11', 'filename':...","{'category': 'source', 'name': 'Bon Appétit', ...",0.00,,[],0,0,1498547066


These recipes are all missing their ingredients and that's why they return NaN! Not even an empty string, something must've gone wrong with the scraper. We have to filter these out to do NLP

In [46]:
filtered_repo = repo.drop(repo[repo['ingredients'].isna()].index)
filtered_repo

Unnamed: 0,id,dek,hed,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,cuisine_name,photo_filename,photo_credit,author_name,date_published,recipe_url
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,Missing Cuisine,51247610_fried-chicken_1x1.jpg,Michael Graydon and Nikole Herriott,Missing Author Name,2014-08-19 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
1,54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,Italian,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Edda Servi Machlin,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
2,54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,Jewish,EP_09022015_honeycake-2.jpg,"Photo by Chelsea Kyle, Food Styling by Anna St...",Marcy Goldman,2008-09-10 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
3,54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,4.00,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,Jewish,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Faye Levy,2008-09-08 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
4,54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,Jewish,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Joan Nathan,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34751,59541a31bff3052847ae2107,Buttering the bread before you waffle it ensur...,Waffled Ham and Cheese Melt with Maple Butter,0.00,"[1 tablespoon unsalted butter, at room tempera...","[Preheat the waffle iron on low., Spread a thi...",0,0,Missing Cuisine,waffle-ham-and-cheese-melt-062817.jpg,"Photo by Maes Studio, Inc.",Daniel Shumski,2017-06-29 14:59:01.368000+00:00,https://www.epicurious.com/recipes/food/views/...
34752,5954233ad52ca90dc28200e7,"Spread this easy compound butter on waffles, p...",Maple Butter,0.00,"[8 tablespoons (1 stick) salted butter, at roo...",[Combine the ingredients in a medium-size bowl...,0,0,Missing Cuisine,EP_12162015_placeholders_bright.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Daniel Shumski,2017-06-01 14:57:00+00:00,https://www.epicurious.com/recipes/food/views/...
34753,595424c2109c972493636f83,Leftover mac and cheese is not exactly one of ...,Waffled Macaroni and Cheese,0.00,"[3 tablespoons unsalted butter, plus more for ...",[Preheat the oven to 375°F. Butter a 9x5-inch ...,0,0,Missing Cuisine,waffle-mac-n-cheese-062816.jpg,"Photo by Maes Studio, Inc.",Daniel Shumski,2017-06-29 14:54:24.234000+00:00,https://www.epicurious.com/recipes/food/views/...
34754,5956638625dc3d1d829b7166,A classic Mexican beer cocktail you can sip al...,Classic Michelada,0.00,"[Coarse salt, 2 lime wedges, 2 ounces tomato j...",[Place about 1/4 cup salt on a small plate. Ru...,0,0,Missing Cuisine,Classic Michelada 07292017.jpg,,Kat Odell,2017-06-15 16:41:00+00:00,https://www.epicurious.com/recipes/food/views/...


In [48]:
recipe_megalist = [ingred for recipe in filtered_repo['ingredients'].tolist() for ingred in recipe]

In [49]:
recipe_megalist

['1 tablespoons yellow mustard seeds',
 '1 tablespoons brown mustard seeds',
 '1 1/2 teaspoons coriander seeds',
 '1 cup apple cider vinegar',
 '2/3 cup kosher salt',
 '1/3 cup sugar',
 '1/4 cup chopped fresh dill',
 '8 skinless, boneless chicken thighs (about 3 pounds), halved, quartered if large',
 'Vegetable oil (for frying; about 10 cups)',
 '2 cups buttermilk',
 '2 cups all-purpose flour',
 'Kosher salt',
 'Honey, flaky sea salt (such as Maldon), toasted benne or sesame seeds, hot sauce (for serving)',
 'A deep-fry thermometer',
 '3 pounds small-leaved bulk spinach',
 'Salt',
 '1/2 cup dark seedless raisins',
 '1 cup lukewarm water',
 '6 tablespoons olive oil',
 '1/2 small onion, minced',
 '1/4 cup pignoli (pine nuts)',
 'Freshly ground black pepper',
 'Dash nutmeg',
 '3 1/2 cups all-purpose flour',
 '1 tablespoon baking powder',
 '1 teaspoon baking soda',
 '1/2 teaspoon salt',
 '4 teaspoons ground cinnamon',
 '1/2 teaspoon ground cloves',
 '1/2 teaspoon ground allspice',
 '1 cup 

In [50]:
len(recipe_megalist)

341271

In [51]:
filtered_repo_transformed = cv.fit_transform(recipe_megalist)

In [52]:
cv.get_feature_names_out()

array(['Aalborg', 'Aalborg Linie', 'Aalborg Linie desire', ...,
       'ﬁnely chop dill', 'ﬁnely chop dill parsley', 'ﬂour'], dtype=object)

In [53]:
cv.get_feature_names_out().shape

(420290,)