# Exploratory Data Analysis of Epicurious Scrape in a JSON file

This is an idealized workflow for Aaron Chen in looking at data science problems. It likely isn't the best path, nor has he rigidly applied or stuck to this ideal, but he wishes that he worked this way more frequently.

## Purpose: Work through some exploratory data analysis of the Epicurious scrape on stream. Try to write some functions to help process the data.

### Author: Aaron Chen

---

### If needed, run shell commands here

In [None]:
#!python -m spacy download en_core_web_sm

---

## External Resources

List out references or documentation that has helped you with this notebook

### Code
Regex Checker: https://regex101.com/

#### Scikit-learn
1. https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda


### Data

For this notebook, the data is stored in the repo base folder/data/raw

### Process

Are there steps or tutorials you are following? Those are things I try to list in Process

---

## Import necessary libraries

In [None]:
# | hide
import project_path
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
from tqdm import tqdm
from typing import Dict
import unicodedata

# import local scripts
import src.dataframe_preprocessor as dfpp

---

## Define helper functions

My workflow is to try things with code cells, then when the code cells get messy and repetitive, to convert into helper functions that can be called.

When the helper functions are getting used a lot, it is usually better to convert them to scripts or classes that can be called/instantiated

In [None]:
# | export
def preprocess_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """This function takes in a pandas DataFrame from pd.read_json and performs some preprocessing by unpacking the nested dictionaries and creating new columns with the simplified structures. It will then drop the original columns that would no longer be needed.

    Args:
        pd.DataFrame

    Returns:
        pd.DataFrame
    """

    def null_filler(to_check: Dict[str, str], key_target: str) -> str:
        """This function takes in a dictionary that is currently fed in with a lambda function and then performs column specific preprocessing.

        Args:
            to_check: dict
            key_target: str

        Returns:
            str
        """

        # Only look in the following keys, if the input isn't one of these, it should be recognized as an improper key
        valid_keys = ["name", "filename", "credit"]

        # This dictionary converts the input keys into substrings that can be used in f-strings to fill in missing values in the record
        translation_keys = {
            "name": "Cuisine",
            "filename": "Photo",
            "credit": "Photo Credit",
        }

        if key_target not in valid_keys:
            # this logic makes sure we are only looking at valid keys
            return (
                "Improper key target: can only pick from 'name', 'filename', 'credit'."
            )

        else:
            if pd.isna(to_check):
                # this logic checks to see if the dictionary exists at all. if so, return Missing
                return f"Missing {translation_keys[key_target]}"
            else:
                if key_target == "name" and (to_check["category"] != "cuisine"):
                    # This logic checks for the cuisine, if the cuisine is not there (and instead has 'ingredient', 'type', 'item', 'equipment', 'meal'), mark as missing
                    return f"Missing {translation_keys[key_target]}"
                else:
                    # Otherwise, there should be no issue with returning
                    return to_check[key_target]

    # Dive into the tag column and extract the cuisine label. Put into new column or fills with "missing data"
    df["cuisine_name"] = df["tag"].apply(
        lambda x: null_filler(to_check=x, key_target="name")
    )
    # df["cuisine_name"] = df["tag"].apply(lambda x: x['name'] if not pd.isna(x) and x['category'] == 'cuisine' else 'Cuisine Missing')

    # this lambda function goes into the photo data column and extracts just the filename from the dictionary
    df["photo_filename"] = df["photoData"].apply(
        lambda x: null_filler(to_check=x, key_target="filename")
    )
    # df["photo_filename"] = df['photoData'].apply(lambda x: x['filename'] if not pd.isna(x) else 'Missing photo')

    # This lambda function goes into the photo data column and extracts just the photo credit from the dictionary
    df["photo_credit"] = df["photoData"].apply(
        lambda x: null_filler(to_check=x, key_target="credit")
    )
    # df["photo_credit"] = df['photoData'].apply(lambda x: x['credit'] if not pd.isna(x) else 'Missing credit')

    # for the above, maybe they can be refactored to one function where the arguments are a column name, dictionary key name, the substring return

    # this lambda funciton goes into the author column and extract the author name or fills iwth "missing data"
    df["author_name"] = df["author"].apply(
        lambda x: x[0]["name"] if x else "Missing Author Name"
    )

    # This function takes in the given pubDate column and creates a new column with the pubDate values converted to datetime objects
    df["date_published"] = pd.to_datetime(df["pubDate"], infer_datetime_format=True)

    # drop some original columns to clean up the dataframe
    df.drop(
        labels=["tag", "photoData", "author", "type", "dateCrawled", "pubDate"],
        axis=1,
        inplace=True,
    )

    return df

---

## Define global variables 

**Remember to refactor these out, not ideal**

In [None]:
# | hide
data_path = "../data/recipes-en-201706/epicurious-recipes_m2.json"

---

## Running Commentary

1. I used numbered lists to keep track of things I noticed

### To Do

1. Try to determine consistency of nested data structures
   1. Is the photoData or number of things inside photoData the same from record to record
   2. What about for tag?

Data wasn't fully consistent but logic in helper function helped handle nulls

2. How to handle nulls?
   1. Author      Filled in with "Missing Author"
   2. Tag         Filled in with "Missing Cuisine"
3. ~~Convert pubDate to actual timestamp~~  
4. ~~Convert ScrapeDate to actual timestamp~~
   1. This was ignored as the datestamp was not useful (generally within minutes of the origin of UNIX time)
   
**5. Append new columns for relevant nested structures and unfold them**

6. Determine actual types of `ingredients` and `prepSteps`
7. Continue working through test example of single recipe to feed into spaCy and then sklearn.feature_extraction.text stack
8. Will need to remove numbers, punctuation

---

## Importing and viewing the data as a dataframe

In [None]:
data_path = "../data/recipes-en-201706/epicurious-recipes_m2.json"

epic_dataframe = pd.read_json(data_path, typ="frame")

dfpp.preprocess_dataframe(df=epic_dataframe)
epic_dataframe.head(10)

Unnamed: 0_level_0,dek,hed,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,cuisine_name,photo_filename,photo_credit,author_name,date_published,recipe_url
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,Missing Cuisine,51247610_fried-chicken_1x1.jpg,Michael Graydon and Nikole Herriott,Missing Author Name,2014-08-19 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,Italian,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Edda Servi Machlin,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,Kosher,EP_09022015_honeycake-2.jpg,"Photo by Chelsea Kyle, Food Styling by Anna St...",Marcy Goldman,2008-09-10 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,4.0,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,Kosher,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Faye Levy,2008-09-08 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,Kosher,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Joan Nathan,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408a919925f464b3733d3,Although Nelly Custis omitted sugar in her rec...,Rice Pancakes,0.0,"[1 1/2 cups cooked rice, 2 cups heavy cream, 2...","[1. Combine the rice, cream, and butter. Add t...",0,0,Missing Cuisine,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Stephen A. McLeod,2012-02-17 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408aa19925f464b3733d6,Editor's note: This recipe is adapted with per...,Jack-O'-Lantern,1.0,"[2 tablespoons shortening, 2 tablespoons flour...",[1. Preheat the oven to 350°F. Lightly grease ...,1,0,Missing Cuisine,350068.jpg,Jennifer Newberry Mead,Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408ab19925f464b3733da,Editor's note: This recipe is reprinted with p...,Seven-Minute Frosting,3.53,"[1 1/2 cups sugar, 1/3 cup cold water, 2 egg w...","[1. Combine the sugar, water, egg whites, and ...",8,75,Missing Cuisine,EP_12162015_placeholders_bright.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408ac19925f464b3733de,Editor's note: This recipe is reprinted with p...,Creamy White Frosting,2.0,"[1 cup vegetable shortening, 1 1/2 teaspoons v...","[1. With a mixer on medium speed, beat togethe...",5,0,Missing Cuisine,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...
54a408ac6529d92b2c003653,Editor's note: This recipe is reprinted with p...,Host of Ghosts,3.17,"[One purchased 9-inch angel food cake, 2 recip...",[1. Place the cake on the cake plate. Reserve ...,12,100,Missing Cuisine,350067.jpg,Jennifer Newberry Mead,Matthew Mead,2008-09-09 04:00:00+00:00,https://www.epicurious.com/recipes/food/views/...


### Throw this into CountVectorizer

In [None]:
all_recipes_list = epic_dataframe["ingredients"].str.join(" ")
# .apply(" ".join).str.lower()
# print(type(all_recipes_list))
all_recipes_list
# for i in range(0,5):
#     print((i, all_recipes_list[i]))

id
54a2b6b019925f464b373351    1 tablespoons yellow mustard seeds 1 tablespoo...
54a408a019925f464b3733bc    3 pounds small-leaved bulk spinach Salt 1/2 cu...
54a408a26529d92b2c003631    3 1/2 cups all-purpose flour 1 tablespoon baki...
54a408a66529d92b2c003638    1 small ripe avocado, preferably Hass (see Not...
54a408a719925f464b3733cc    2 pounds fresh tomatoes, unpeeled and cut in q...
                                                  ...                        
59541a31bff3052847ae2107    1 tablespoon unsalted butter, at room temperat...
5954233ad52ca90dc28200e7    8 tablespoons (1 stick) salted butter, at room...
595424c2109c972493636f83    3 tablespoons unsalted butter, plus more for g...
5956638625dc3d1d829b7166    Coarse salt 2 lime wedges 2 ounces tomato juic...
59566daa25dc3d1d829b7169    1 bottle (375 ml) sour beer, such as Almanac C...
Name: ingredients, Length: 34656, dtype: object

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
first_rec = all_recipes_list[0]
print(first_rec)

1 tablespoons yellow mustard seeds 1 tablespoons brown mustard seeds 1 1/2 teaspoons coriander seeds 1 cup apple cider vinegar 2/3 cup kosher salt 1/3 cup sugar 1/4 cup chopped fresh dill 8 skinless, boneless chicken thighs (about 3 pounds), halved, quartered if large Vegetable oil (for frying; about 10 cups) 2 cups buttermilk 2 cups all-purpose flour Kosher salt Honey, flaky sea salt (such as Maldon), toasted benne or sesame seeds, hot sauce (for serving) A deep-fry thermometer


In [None]:
doc_first_rec = nlp(first_rec)
for token in doc_first_rec:
    if token.like_num == False:
        print(token.text, token.pos_, token.ent_type_, token.lemma_, token.is_digit)
    else:
        continue

tablespoons NOUN  tablespoon False
yellow ADJ  yellow False
mustard NOUN  mustard False
seeds VERB  seed False
tablespoons NOUN  tablespoon False
brown ADJ  brown False
mustard NOUN  mustard False
seeds VERB  seed False
teaspoons NOUN  teaspoon False
coriander NOUN  coriander False
seeds VERB  seed False
cup NOUN QUANTITY cup False
apple NOUN  apple False
cider NOUN  cider False
vinegar NOUN  vinegar False
cup NOUN QUANTITY cup False
kosher ADJ  kosher False
salt NOUN  salt False
cup NOUN QUANTITY cup False
sugar NOUN  sugar False
cup NOUN QUANTITY cup False
chopped VERB  chop False
fresh ADJ  fresh False
dill NOUN  dill False
skinless NOUN  skinless False
, PUNCT  , False
boneless ADJ  boneless False
chicken NOUN  chicken False
thighs NOUN  thigh False
( PUNCT  ( False
about ADV QUANTITY about False
pounds NOUN QUANTITY pound False
) PUNCT  ) False
, PUNCT  , False
halved VERB  halve False
, PUNCT  , False
quartered VERB  quarter False
if SCONJ  if False
large ADJ  large False
Vegetab

In [None]:
epic_dataframe["recipe_url"][0]

'https://www.epicurious.com/recipes/food/views/pickle-brined-fried-chicken-51247610'

In [None]:
# this is looking for accented characters inside text, which may not be necessary
# for word in test_rec_list:
#     print(unicodedata.normalize("NFKD", word))