# Exploratory Data Analysis of Epicurious Scrape in a JSON file

This is an idealized workflow for Aaron Chen in looking at data science problems. It likely isn't the best path, nor has he rigidly applied or stuck to this ideal, but he wishes that he worked this way more frequently.

## Purpose: Work through some exploratory data analysis of the Epicurious scrape on stream. Try to write some functions to help process the data.

### Author: Aaron Chen


---

### If needed, run shell commands here

In [None]:
# !python -m spacy download en_core_web_sm
# !python -c "import tkinter"

---

## External Resources

List out references or documentation that has helped you with this notebook

### Code
Regex Checker: https://regex101.com/

#### Scikit-learn
1. https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda
2. 

### Data

For this notebook, the data is stored in the repo base folder/data/raw

### Process

Are there steps or tutorials you are following? Those are things I try to list in Process

___

## Import necessary libraries

In [1]:
from datetime import datetime
# from examples import utils
from joblib import dump, load
import matplotlib.pyplot as plt
from openTSNE import TSNE
import pandas as pd
from sklearn import tree
from sklearn.base import TransformerMixin
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
# from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import spacy
from tkinter import N
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
from tqdm import tqdm
from typing import Any
import umap



---

## Define helper functions

My workflow is to try things with code cells, then when the code cells get messy and repetitive, to convert into helper functions that can be called.

When the helper functions are getting used a lot, it is usually better to convert them to scripts or classes that can be called/instantiated

In [None]:
def plot_top_words(model, feature_names, n_top_words, title):
    fig, axes = plt.subplots(2, 5, figsize=(30, 15), sharex=True)
    axes = axes.flatten()
    for topic_idx, topic in enumerate(model.components_):
        top_features_ind = topic.argsort()[: -n_top_words - 1 : -1]
        top_features = [feature_names[i] for i in top_features_ind]
        weights = topic[top_features_ind]

        ax = axes[topic_idx]
        ax.barh(top_features, weights, height=0.7)
        ax.set_title(f"Topic {topic_idx +1}", fontdict={"fontsize": 30})
        ax.invert_yaxis()
        ax.tick_params(axis="both", which="major", labelsize=20)
        for i in "top right left".split():
            ax.spines[i].set_visible(False)
        fig.suptitle(title, fontsize=40)

In [None]:
def concat_matrices_to_df(df, vectorized_ingred_matrix, cv):
    """This function takes in a dataframe and concats the matrix generated by either CountVectorizer or TFIDF-Transformer onto the records so that the recipes can be used for classification purposes.

    Args: 
        df: preprocessed dataframe from preprocess_dataframe
        vectorized_ingred_matrix: sparse csr matrix created from doing fit_transform on the recipe_megalist
     
    Returns:
        A pandas dataframe with the vectorized_ingred_matrix appended as columns to df
    """
    repo_tfidf_df = pd.DataFrame(vectorized_ingred_matrix.toarray(), columns=cv.get_feature_names_out(), index=df.index)
    return pd.concat([df, repo_tfidf_df], axis=1)

In [None]:
def plot_3d(points, points_color, title):
    x, y, z = points.T

    fig, ax = plt.subplots(
        figsize=(6, 6),
        facecolor="white",
        tight_layout=True,
        subplot_kw={"projection": "3d"},
    )
    fig.suptitle(title, size=16)
    col = ax.scatter(x, y, z, c=points_color, s=50, alpha=0.8)
    ax.view_init(azim=-60, elev=9)
    ax.xaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.zaxis.set_major_locator(ticker.MultipleLocator(1))

    fig.colorbar(col, ax=ax, orientation="horizontal", shrink=0.6, aspect=60, pad=0.01)
    plt.show()

In [None]:
def add_2d_scatter(ax, points, points_color, title=None):
    x, y = points.T
    ax.scatter(x, y, c=points_color, s=50, alpha=0.8)
    ax.set_title(title)
    ax.xaxis.set_major_formatter(ticker.NullFormatter())
    ax.yaxis.set_major_formatter(ticker.NullFormatter())

In [None]:
def plot_2d(points, points_color, title):
    fig, ax = plt.subplots(figsize=(3, 3), facecolor="white", constrained_layout=True)
    fig.suptitle(title, size=16)
    add_2d_scatter(ax, points, points_color)
    plt.show()

### Import local script

I started grouping this in with importing libraries, but putting them at the bottom of the list

In [2]:
import project_path

import src.dataframe_preprocessor as dfpp
import src.nlp_processor as nlp_proc

---

## Define global variables 
### Remember to refactor these out, not ideal

In [3]:
data_path = "../../data/recipes-en-201706/epicurious-recipes_m2.json"
food_stopwords_path = "../../food_stopwords.csv"

---

## Running Commentary

1. I used numbered lists to keep track of things I noticed

### To Do

1. Try to determine consistency of nested data structures
   1. Is the photoData or number of things inside photoData the same from record to record
   2. What about for tag?

Data wasn't fully consistent but logic in helper function helped handle nulls

2. How to handle nulls?
   1. Author      Filled in with "Missing Author"
   2. Tag         Filled in with "Missing Cuisine"
3. ~~Convert pubDate to actual timestamp~~  
4. ~~Convert ScrapeDate to actual timestamp~~
   1. This was ignored as the datestamp was not useful (generally within minutes of the origin of UNIX time)
   
**5. Append new columns for relevant nested structures and unfold them**

6. Determine actual types of `ingredients` and `prepSteps`
7. Continue working through test example of single recipe to feed into spaCy and then sklearn.feature_extraction.text stack
8. Will need to remove numbers, punctuation

---

## Importing and viewing the data as a dataframe

In [None]:
sheeeeeesh = pd.read_json(path_or_buf=data_path) # type:ignore
# pd.read_json(data_path, typ='frame') # type:ignore

letsgoooo = dfpp.preprocess_dataframe(df=sheeeeeesh) # type:ignore

recipe_megalist = [ingred for recipe in letsgoooo['ingredients'].tolist() for ingred in recipe]

nlp = spacy.load("en_core_web_sm")

# this is a redeem for variable naming mixed with a free pun-ish me daddy, flushtrated will be the list of all stopword to exclude so named because we're throwing these words down the drain

flushtrated = {x for x in pd.read_csv(food_stopwords_path)}
additional_to_exclude = {'red', 'green', 'black', 'yellow', 'white', 'inch', 'mince', 'chop', 'fry', 'trim', 'flat', 'beat', 'brown', 'golden', 'balsamic', 'halve', 'blue', 'divide', 'trim', 'unbleache', 'granulate', 'Frank', 'alternative', 'american', 'annie', 'asian', 'balance', 'band', 'barrel', 'bay', 'bayou', 'beam', 'beard', 'bell', 'betty', 'bird', 'blast', 'bob', 'bone', 'breyers', 'calore', 'carb', 'card', 'chachere', 'change', 'circle', 'coffee', 'coil', 'country', 'cow', 'crack', 'cracker', 'crocker', 'crystal', 'dean', 'degree', 'deluxe', 'direction', 'duncan', 'earth', 'eggland', 'ener', 'envelope', 'eye', 'fantastic', 'far', 'fat', 'feather', 'flake', 'foot', 'fourth', 'frank', 'french', 'fusion', 'genoa', 'genovese', 'germain', 'giada', 'gold', 'granule', 'greek', 'hamburger', 'helper', 'herbe', 'hines', 'hodgson', 'hunt', 'instruction', 'interval', 'italianstyle', 'jim', 'jimmy', 'kellogg', 'lagrille', 'lake', 'land', 'laurentiis', 'lawry', 'lipton', 'litre', 'll', 'maid', 'malt', 'mate', 'mayer', 'meal', 'medal', 'medallion', 'member', 'mexicanstyle', 'monte', 'mori', 'nest', 'nu', 'oounce', 'oscar', 'ox', 'paso', 'pasta', 'patty', 'petal', 'pinche', 'preserve', 'quartere', 'ranch', 'ranchstyle', 'rasher', 'redhot', 'resemble', 'rice', 'ro', 'roni', 'scissor', 'scrap', 'secret', 'semicircle', 'shard', 'shear', 'sixth', 'sliver', 'smucker', 'snicker', 'source', 'spot', 'state', 'strand', 'sun', 'supreme', 'tablepoon', 'tail', 'target', 'tm', 'tong', 'toothpick', 'triangle', 'trimming', 'tweezer', 'valley', 'vay', 'wise', 'wishbone', 'wrapper', 'yoplait', 'ziploc'}

flushtrated = flushtrated.union(STOP_WORDS)
flushtrated = flushtrated.union(additional_to_exclude)
flushtrated_list = list(flushtrated)

In [None]:
custom_nlp_proc = nlp_proc.NLP_Processor("en_core_web_sm")

cv = CountVectorizer(strip_accents='unicode', 
                        lowercase=True, 
                        preprocessor=custom_nlp_proc.custom_preprocessor, 
                        tokenizer=custom_nlp_proc.custom_lemmatizer, 
                        stop_words=flushtrated_list, 
                        token_pattern=r"(?u)\b[a-zA-Z]{2,}\b", 
                        ngram_range=(1,4), 
                        min_df=10
                        )

cv.fit(tqdm(recipe_megalist))

temp = letsgoooo["ingredients"].apply(" ".join).str.lower()

repo_transformed = cv.transform(tqdm(temp))

cv.get_feature_names_out().shape

In [None]:
tfidf = TfidfTransformer()

repo_tfidf = tfidf.fit_transform(repo_transformed)

repo_tfidf.shape

In [None]:
recipes_with_cv = concat_matrices_to_df(letsgoooo, repo_tfidf, cv)

We can try to filter out the adjectives in the lemmatization step, because spaCy allows filtering based on Parts of Speech. But this might exclude them from the ngrams. Let's try augmenting stopwords and excluding colors that way.

In [None]:
filtered_df = recipes_with_cv.drop(['dek', 'hed', 'aggregateRating', 'ingredients', 'prepSteps',
       'reviewsCount', 'willMakeAgainPct', 'photo_filename',
       'photo_credit', 'author_name', 'date_published', 'recipe_url'], axis=1)

filtered_df.head()

In [None]:
reduced_df = filtered_df[filtered_df['cuisine_name'] != 'Missing Cuisine']
y = reduced_df['cuisine_name']
X = reduced_df.drop(['id', 'cuisine_name'], axis=1)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=240, stratify=y)

In [None]:
rfc_clf = RandomForestClassifier(max_depth=50, random_state=572, class_weight="balanced", n_jobs=-1)

rfc_clf.fit(X_train, y_train)
print(rfc_clf.score(X_test, y_test))

In [4]:
joblib_basepath = '../../joblib/2022.08.23/'

cv_path = joblib_basepath + 'countvec.joblib'
tfidf_path = joblib_basepath + 'tfidf.joblib'
full_df_path = joblib_basepath + 'recipes_with_cv.joblib'
reduced_df_path = joblib_basepath + 'reduced_df.joblib'
rfc_path = joblib_basepath + 'rfc_clf.joblib'

In [None]:
dump(cv, cv_path)
dump(tfidf, tfidf_path)
dump(recipes_with_cv, full_df_path)
dump(reduced_df, reduced_df_path)
dump(rfc_clf, rfc_path)

In [5]:
cv = load(cv_path)
tfidf = load(tfidf_path)
recipes_with_cv = load(full_df_path)
reduced_df = load(reduced_df_path)
rfc_clf = load(rfc_path)

Sklearn works with DOT formatted trees. ETE does not support this yet. It is a feature being added for ETE milestone 4 but that is as of this stream 50% complete https://github.com/etetoolkit/ete/issues/361

In [6]:
# reduced_df = filtered_df[filtered_df['cuisine_name'] != 'Missing Cuisine']
y = reduced_df['cuisine_name']
X = reduced_df.drop(['id', 'cuisine_name'], axis=1)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=240, stratify=y)

In [8]:
X_train.shape

(11124, 3351)

In [9]:
we_were_talking_about_variable_name = TruncatedSVD(n_components=100, n_iter=15, random_state=268)
we_were_talking_about_variable_name_svd = we_were_talking_about_variable_name.fit_transform(X_train)

In [None]:
t_sne = TSNE(n_components=2, verbose=20, random_state=144, n_jobs=-1)

vis_t_sne = t_sne.fit_transform(we_were_talking_about_variable_name_svd)

In [None]:
plt.figure(figsize=(16,10))
plt.scatter(x=vis_t_sne[:,0], y=vis_t_sne[:,1], c=colors, s=sizes, alpha=0.3, cmap='viridis'); 
plt.colorbar()

In [None]:
sheeeeeesh['cuisine_name']

In [None]:
sheeeeeesh['cuisine_name'][49]