# Exploratory Data Analysis of Epicurious Scrape in a JSON file

This is an idealized workflow for Aaron Chen in looking at data science problems. It likely isn't the best path, nor has he rigidly applied or stuck to this ideal, but he wishes that he worked this way more frequently.

## Purpose: Work through some exploratory data analysis of the Epicurious scrape on stream. Try to write some functions to help process the data.

### Author: Aaron Chen


---

### If needed, run shell commands here

In [None]:
# !python -m spacy download en_core_web_sm

---

## External Resources

List out references or documentation that has helped you with this notebook

### Code
Regex Checker: https://regex101.com/

#### Scikit-learn
1. https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda
2. 

### Data

For this notebook, the data is stored in the repo base folder/data/raw

### Process

Are there steps or tutorials you are following? Those are things I try to list in Process

___

## Import necessary libraries

In [1]:
from datetime import datetime
import matplotlib.pyplot as plt
# import numpy as np
import pandas as pd
import pyLDAvis
import pyLDAvis.sklearn
from sklearn.base import TransformerMixin
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
# from spacy.lemmatizer import Lemmatizer
from tqdm import tqdm
from typing import Any

  from imp import reload


---

## Define helper functions

My workflow is to try things with code cells, then when the code cells get messy and repetitive, to convert into helper functions that can be called.

When the helper functions are getting used a lot, it is usually better to convert them to scripts or classes that can be called/instantiated

In [2]:
def custom_lemmatizer(ingredients: list) -> Any: # spacy nlp.Doc
    """This takes in a string representing the recipe and an NLP model and lemmatize with the NER. 
    
    Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
    Remove punctuation

    Args:
        ingredients: string
        nlp_mod: spacy model (try built in first, by default called nlp)
    
    Returns:
        nlp.Doc
    """
    lemmas = [token.lemma_ for token in ingredients if (token.is_alpha and token.pos_ not in ["PRON", "VERB"] and len(token.lemma_) > 1)]
    return lemmas
    # return doc

In [3]:
def custom_preprocessor(recipe_ingreds: str) -> list:
    """This function replaces the default sklearn CountVectorizer preprocessor to use spaCy. sklearn CountVectorizer's preprocessor only performs accent removal and lowercasing.

    Args:
        A string to tokenize from a recipe representing the ingredients used in the recipe

    Returns:
        A list of strings that have been de-accented and lowercased to be used in tokenization
    """
    preprocessed = [token for token in nlp(recipe_ingreds)]

    return preprocessed

In [4]:
def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

### Import local script

I started grouping this in with importing libraries, but putting them at the bottom of the list

In [5]:
import project_path

import src.dataframe_preprocessor as dfpp

pyLDAvis.enable_notebook()

---

## Define global variables 
### Remember to refactor these out, not ideal

In [6]:
data_path = "../../data/recipes-en-201706/epicurious-recipes_m2.json"
food_stopwords_path = "../../food_stopwords.csv"

---

## Running Commentary

1. I used numbered lists to keep track of things I noticed

### To Do

1. Try to determine consistency of nested data structures
   1. Is the photoData or number of things inside photoData the same from record to record
   2. What about for tag?

Data wasn't fully consistent but logic in helper function helped handle nulls

2. How to handle nulls?
   1. Author      Filled in with "Missing Author"
   2. Tag         Filled in with "Missing Cuisine"
3. ~~Convert pubDate to actual timestamp~~  
4. ~~Convert ScrapeDate to actual timestamp~~
   1. This was ignored as the datestamp was not useful (generally within minutes of the origin of UNIX time)
   
**5. Append new columns for relevant nested structures and unfold them**

6. Determine actual types of `ingredients` and `prepSteps`
7. Continue working through test example of single recipe to feed into spaCy and then sklearn.feature_extraction.text stack
8. Will need to remove numbers, punctuation

---

## Importing and viewing the data as a dataframe

In [7]:
repo = pd.read_json(path_or_buf=data_path) # type:ignore
pd.read_json(data_path, typ='frame') # type:ignore

dfpp.preprocess_dataframe(df=repo) # type:ignore
print(repo.shape)
repo.head(10) # type:ignore

recipe_megalist = [ingred for recipe in repo['ingredients'].tolist() for ingred in recipe]


nlp = spacy.load("en_core_web_sm")

# this is a redeem for variable naming mixed with a free pun-ish me daddy, flushtrated will be the list of all stopword to exclude so named because we're throwing these words down the drain

flushtrated = {x for x in pd.read_csv(food_stopwords_path)}
flushtrated = flushtrated.union(STOP_WORDS)
flushtrated_list = list(flushtrated)

(34656, 14)




In [None]:
cv = CountVectorizer(strip_accents='unicode', lowercase=True, preprocessor=custom_preprocessor, tokenizer=custom_lemmatizer, stop_words=flushtrated_list, ngram_range=(1,4), min_df=10)
repo_transformed = cv.fit_transform(tqdm(recipe_megalist))
cv.get_feature_names_out().shape

In [None]:
repo_transformed.shape

We can try to filter out the adjectives in the lemmatization step, because spaCy allows filtering based on Parts of Speech. But this might exclude them from the ngrams. Let's try augmenting stopwords and excluding colors that way.

In [None]:
lda_20 = LatentDirichletAllocation(n_components=20, n_jobs=-1, verbose=100, random_state=200)
lda_20_repo_transformed = lda_20.fit_transform(repo_transformed)

In [None]:
pyLDAvis.sklearn.prepare(lda_20, repo_transformed, cv)

## Manual Topic Labeling Based on LDA
1. aliums, alium prep, chocolate and rosemary
2. oil, cheese, game meats
3. peppers and parsley

Based on these three topics, I think it is better to train a new model, since these models don't seem to carry much information

In [None]:
cv_auto_stopwords_085 = CountVectorizer(strip_accents='unicode', lowercase=True, preprocessor=custom_preprocessor, tokenizer=custom_lemmatizer, stop_words=None, ngram_range=(1,4), max_df=0.85)

repo_transformed_auto_stopwords_085 = cv_auto_stopwords_085.fit_transform(tqdm(recipe_megalist))
cv_auto_stopwords_085.get_feature_names_out().shape
lda_20_auto_stopwords_085 = LatentDirichletAllocation(n_components=20, n_jobs=-1, verbose=100, random_state=200)
lda_20_repo_transformed_auto_stopwords_085 = lda_20_auto_stopwords_085.fit_transform(repo_transformed_auto_stopwords_085)
pyLDAvis.sklearn.prepare(lda_20_auto_stopwords_085, repo_transformed_auto_stopwords_085, cv_auto_stopwords_085)

This LDA is probably less useful

1. Has a lot of prep and units of measurement
2. 

In [8]:
additional_to_exclude = {'red', 'green', 'black', 'yellow', 'white', 'inch', 'mince', 'chop', 'fry', 'trim', 'flat', 'beat', 'brown', 'golden', 'balsamic', 'halve', 'blue', 'divide', 'trim', 'unbleache', 'granulate'}
flushtrated_augment = flushtrated.union(additional_to_exclude)
flushtrated_augment = list(flushtrated_augment)

cv_stopwords_aug = CountVectorizer(strip_accents='unicode', lowercase=True, preprocessor=custom_preprocessor, tokenizer=custom_lemmatizer, stop_words=flushtrated_augment, ngram_range=(1,4), min_df=10)

repo_transformed_stopwords_aug = cv_stopwords_aug.fit_transform(tqdm(recipe_megalist))
cv_stopwords_aug.get_feature_names_out().shape
lda_20_stopwords_aug = LatentDirichletAllocation(n_components=20, n_jobs=-1, verbose=100, random_state=200)
lda_20_repo_transformed_aug = lda_20_stopwords_aug.fit_transform(repo_transformed_stopwords_aug)
pyLDAvis.sklearn.prepare(lda_20_stopwords_aug, repo_transformed_stopwords_aug, cv_stopwords_aug)

100%|█████████████████████████████████████████████████████████| 341271/341271 [14:16<00:00, 398.27it/s]


[Parallel(n_jobs=24)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   1 tasks      | elapsed:    2.4s
[Parallel(n_jobs=24)]: Done   2 out of  24 | elapsed:    2.4s remaining:   26.3s
[Parallel(n_jobs=24)]: Done   3 out of  24 | elapsed:    2.4s remaining:   17.0s
[Parallel(n_jobs=24)]: Done   4 out of  24 | elapsed:    2.5s remaining:   12.3s
[Parallel(n_jobs=24)]: Done   5 out of  24 | elapsed:    2.5s remaining:    9.5s
[Parallel(n_jobs=24)]: Done   6 out of  24 | elapsed:    2.5s remaining:    7.5s
[Parallel(n_jobs=24)]: Done   7 out of  24 | elapsed:    2.5s remaining:    6.1s
[Parallel(n_jobs=24)]: Done   8 out of  24 | elapsed:    2.5s remaining:    5.0s
[Parallel(n_jobs=24)]: Done   9 out of  24 | elapsed:    2.5s remaining:    4.2s
[Parallel(n_jobs=24)]: Done  10 out of  24 | elapsed:    2.5s remaining:    3.6s
[Parallel(n_jobs=24)]: Done  11 out of  24 | elapsed:    2.6s remaining:    3.0s
[Parallel(n_jobs=24)]: Done  12 out of  24 | elapse



[Parallel(n_jobs=24)]: Done   1 tasks      | elapsed:    0.6s
[Parallel(n_jobs=24)]: Done   2 out of  24 | elapsed:    0.6s remaining:    6.8s
[Parallel(n_jobs=24)]: Done   3 out of  24 | elapsed:    0.6s remaining:    4.4s
[Parallel(n_jobs=24)]: Done   4 out of  24 | elapsed:    0.6s remaining:    3.1s
[Parallel(n_jobs=24)]: Done   5 out of  24 | elapsed:    0.6s remaining:    2.4s
[Parallel(n_jobs=24)]: Done   6 out of  24 | elapsed:    0.6s remaining:    1.9s
[Parallel(n_jobs=24)]: Done   7 out of  24 | elapsed:    0.6s remaining:    1.6s
[Parallel(n_jobs=24)]: Done   8 out of  24 | elapsed:    0.6s remaining:    1.3s
[Parallel(n_jobs=24)]: Done   9 out of  24 | elapsed:    0.6s remaining:    1.1s
[Parallel(n_jobs=24)]: Done  10 out of  24 | elapsed:    0.7s remaining:    0.9s
[Parallel(n_jobs=24)]: Done  11 out of  24 | elapsed:    0.7s remaining:    0.8s
[Parallel(n_jobs=24)]: Done  12 out of  24 | elapsed:    0.7s remaining:    0.7s
[Parallel(n_jobs=24)]: Done  13 out of  24 | el

  default_term_info = default_term_info.sort_values(


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


These topic models/word groupings also don't seem to make much sense, so let's throw this into a TF-IDF and see what happens, even though the authors of LDA don't like doing this.

In [9]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer()

repo_tfidf_stopwords_aug = tfidf.fit_transform(repo_transformed_stopwords_aug)
tfidf_lda_20_stopwords_aug = LatentDirichletAllocation(n_components=20, n_jobs=-1, verbose=100, random_state=200)
tfidf_lda_20_repo_transformed_aug = tfidf_lda_20_stopwords_aug.fit_transform(repo_tfidf_stopwords_aug)
pyLDAvis.sklearn.prepare(tfidf_lda_20_stopwords_aug, repo_tfidf_stopwords_aug, cv_stopwords_aug)

[Parallel(n_jobs=24)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=24)]: Done   1 tasks      | elapsed:    2.0s
[Parallel(n_jobs=24)]: Done   2 out of  24 | elapsed:    2.1s remaining:   22.7s
[Parallel(n_jobs=24)]: Done   3 out of  24 | elapsed:    2.1s remaining:   14.8s
[Parallel(n_jobs=24)]: Done   4 out of  24 | elapsed:    2.1s remaining:   10.7s
[Parallel(n_jobs=24)]: Done   5 out of  24 | elapsed:    2.1s remaining:    8.2s
[Parallel(n_jobs=24)]: Done   6 out of  24 | elapsed:    2.2s remaining:    6.5s
[Parallel(n_jobs=24)]: Done   7 out of  24 | elapsed:    2.2s remaining:    5.3s
[Parallel(n_jobs=24)]: Done   8 out of  24 | elapsed:    2.2s remaining:    4.4s
[Parallel(n_jobs=24)]: Done   9 out of  24 | elapsed:    2.2s remaining:    3.6s
[Parallel(n_jobs=24)]: Done  10 out of  24 | elapsed:    2.2s remaining:    3.1s
[Parallel(n_jobs=24)]: Done  11 out of  24 | elapsed:    2.2s remaining:    2.6s
[Parallel(n_jobs=24)]: Done  12 out of  24 | elapse



[Parallel(n_jobs=24)]: Done   1 tasks      | elapsed:    0.5s
[Parallel(n_jobs=24)]: Done   2 out of  24 | elapsed:    0.5s remaining:    6.0s
[Parallel(n_jobs=24)]: Done   3 out of  24 | elapsed:    0.6s remaining:    4.0s
[Parallel(n_jobs=24)]: Done   4 out of  24 | elapsed:    0.6s remaining:    2.8s
[Parallel(n_jobs=24)]: Done   5 out of  24 | elapsed:    0.6s remaining:    2.2s
[Parallel(n_jobs=24)]: Done   6 out of  24 | elapsed:    0.6s remaining:    1.7s
[Parallel(n_jobs=24)]: Done   7 out of  24 | elapsed:    0.6s remaining:    1.4s
[Parallel(n_jobs=24)]: Done   8 out of  24 | elapsed:    0.6s remaining:    1.2s
[Parallel(n_jobs=24)]: Done   9 out of  24 | elapsed:    0.6s remaining:    1.0s
[Parallel(n_jobs=24)]: Done  10 out of  24 | elapsed:    0.6s remaining:    0.8s
[Parallel(n_jobs=24)]: Done  11 out of  24 | elapsed:    0.6s remaining:    0.7s
[Parallel(n_jobs=24)]: Done  12 out of  24 | elapsed:    0.6s remaining:    0.6s
[Parallel(n_jobs=24)]: Done  13 out of  24 | el

  default_term_info = default_term_info.sort_values(


  from imp import reload
  from imp import reload
  from imp import reload
  from imp import reload


pyLDAvis calls to a deprecated function inside CountVectorizer, which is incompatible with TFIDF. Can we can find an alternate version?