# Exploratory Data Analysis of Epicurious Scrape in a JSON file

This is an idealized workflow for Aaron Chen in looking at data science problems. It likely isn't the best path, nor has he rigidly applied or stuck to this ideal, but he wishes that he worked this way more frequently.

## Purpose: Work through some exploratory data analysis of the Epicurious scrape on stream. Try to write some functions to help process the data.

### Author: Aaron Chen


---

### If needed, run shell commands here

In [None]:
#!python -m spacy download en_core_web_sm

---

## External Resources

List out references or documentation that has helped you with this notebook

### Code
Regex Checker: https://regex101.com/

#### Scikit-learn
1. https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda
2. 

### Data

For this notebook, the data is stored in the repo base folder/data/raw

### Process

Are there steps or tutorials you are following? Those are things I try to list in Process

___

## Import necessary libraries

In [1]:
import project_path
# from collections import defaultdict 
from datetime import datetime
# import dvc.api
# import gensim
# import gensim.corpora as corpora
# from gensim.models import CoherenceModel, Phrases
# from gensim.utils import simple_preprocess
# import logging
# import mlflow
# import mlflow.sklearn
# import mlflow.spacy
import numpy as np
import pandas as pd
# from pprint import pprint
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
# from spacy.lemmatizer import Lemmatizer
from tqdm import tqdm
# from typing import List
import unicodedata

import src.dataframe_preprocessor as dfpp


---

## Define helper functions

My workflow is to try things with code cells, then when the code cells get messy and repetitive, to convert into helper functions that can be called.

When the helper functions are getting used a lot, it is usually better to convert them to scripts or classes that can be called/instantiated

In [None]:
def preprocess_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """ This function takes in a pandas DataFrame from pd.read_json and performs some preprocessing by unpacking the nested dictionaries and creating new columns with the simplified structures. It will then drop the original columns that would no longer be needed.

    Args:
        pd.DataFrame
    
    Returns:
        pd.DataFrame
    """

    def null_filler(to_check: dict['str','str'], key_target: str) -> str:
        """ This function takes in a dictionary that is currently fed in with a lambda function and then performs column specific preprocessing.
        
        Args:
            to_check: dict
            key_target: str
            
        Returns:
            str
        """

        # Only look in the following keys, if the input isn't one of these, it should be recognized as an improper key
        valid_keys = ['name', 'filename', 'credit']

        # This dictionary converts the input keys into substrings that can be used in f-strings to fill in missing values in the record
        translation_keys = {
                            'name': "Cuisine"
                            , 'filename': "Photo"
                            , 'credit': "Photo Credit"
                            }

        if key_target not in valid_keys:
            # this logic makes sure we are only looking at valid keys
            return "Improper key target: can only pick from 'name', 'filename', 'credit'."

        else:
            if pd.isna(to_check):
                # this logic checks to see if the dictionary exists at all. if so, return Missing
                return f'Missing {translation_keys[key_target]}'
            else:
                if key_target == 'name' and (to_check['category'] != 'cuisine'):
                    # This logic checks for the cuisine, if the cuisine is not there (and instead has 'ingredient', 'type', 'item', 'equipment', 'meal'), mark as missing
                    return f'Missing {translation_keys[key_target]}'
                else:
                    # Otherwise, there should be no issue with returning 
                    return to_check[key_target]
                

    # Dive into the tag column and extract the cuisine label. Put into new column or fills with "missing data"
    df["cuisine_name"] = df["tag"].apply(lambda x: null_filler(to_check=x, key_target='name'))
    # df["cuisine_name"] = df["tag"].apply(lambda x: x['name'] if not pd.isna(x) and x['category'] == 'cuisine' else 'Cuisine Missing')

    # this lambda function goes into the photo data column and extracts just the filename from the dictionary
    df["photo_filename"] = df["photoData"].apply(lambda x: null_filler(to_check=x, key_target='filename'))
    # df["photo_filename"] = df['photoData'].apply(lambda x: x['filename'] if not pd.isna(x) else 'Missing photo')

    # This lambda function goes into the photo data column and extracts just the photo credit from the dictionary   
    df["photo_credit"] = df["photoData"].apply(lambda x: null_filler(to_check=x, key_target='credit'))
    # df["photo_credit"] = df['photoData'].apply(lambda x: x['credit'] if not pd.isna(x) else 'Missing credit')

    # for the above, maybe they can be refactored to one function where the arguments are a column name, dictionary key name, the substring return 

    # this lambda funciton goes into the author column and extract the author name or fills iwth "missing data"
    df["author_name"] = df["author"].apply(lambda x: x[0]['name'] if x else 'Missing Author Name')

    # This function takes in the given pubDate column and creates a new column with the pubDate values converted to datetime objects
    df['date_published'] = pd.to_datetime(df['pubDate'], infer_datetime_format=True)
    
    # drop some original columns to clean up the dataframe
    df.drop(labels=["tag", 'photoData', "author", "type", 'dateCrawled', 'pubDate'], axis=1, inplace=True)

    return df

In [None]:
import this

In [None]:
import antigravity

In [None]:
# def remove_empties(deficiency_text: List) -> List:
#     """This function takes in a list of strings and removes empty strings from the list. The function is needed 
#     because if the list does not contain empty strings, the default remove() function returns None and an Error."""
    
#     filtered = list(filter(lambda x: x != '', deficiency_text))

#     return filtered

In [None]:
# def lemmatizer(doc):
#     # This takes in a doc of tokens from the NER and lemmatizes them. 
#     # Pronouns (like "I" and "you" get lemmatized to '-PRON-', so I'm removing those.
#     doc = [token.lemma_ for token in doc if token.lemma_ != '-PRON-']
#     doc = u' '.join(doc)
#     return nlp.make_doc(doc)

In [None]:
# def remove_stopwords(doc):
#     # This will remove stopwords and punctuation.
#     # Use token.text to return strings, which we'll need for Gensim.
#     doc = [token for token in doc if token.is_stop != True and token.is_punct != True]
#     return doc

In [None]:
# nlp.add_pipe(lemmatizer,name='lemmatizer',after='ner')
# nlp.add_pipe(remove_stopwords, name="stopwords", last=True)

### Import local script

I started grouping this in with importing libraries, but putting them at the bottom of the list

In [None]:
# import project_path
# import src.nhsn_vac_df_builder as nhsn_vac

---

## Define global variables 
### Remember to refactor these out, not ideal

In [None]:
data_path = "../../data/recipes-en-201706/epicurious-recipes_m2.json"

---

## Running Commentary

1. I used numbered lists to keep track of things I noticed

### To Do

1. Try to determine consistency of nested data structures
   1. Is the photoData or number of things inside photoData the same from record to record
   2. What about for tag?

Data wasn't fully consistent but logic in helper function helped handle nulls

2. How to handle nulls?
   1. Author      Filled in with "Missing Author"
   2. Tag         Filled in with "Missing Cuisine"
3. ~~Convert pubDate to actual timestamp~~  
4. ~~Convert ScrapeDate to actual timestamp~~
   1. This was ignored as the datestamp was not useful (generally within minutes of the origin of UNIX time)
   
**5. Append new columns for relevant nested structures and unfold them**

6. Determine actual types of `ingredients` and `prepSteps`
7. Continue working through test example of single recipe to feed into spaCy and then sklearn.feature_extraction.text stack
8. Will need to remove numbers, punctuation

---

## Importing and viewing the data as a dataframe

In [None]:
epic_dataframe = pd.read_json(path_or_buf=data_path)

In [None]:
epic_dataframe.head()

In [None]:
epic_dataframe['type'].value_counts()

In [None]:
epic_dataframe.shape

In [None]:
epic_dataframe.describe()

In [None]:
epic_dataframe.info()

In [None]:
epic_dataframe['aggregateRating'].value_counts()

Columns:

    Index

    ID: string

    dek: appears to be description of the recipe, string

    hed: Appears to be title, string

    pubDate: appears to be publication date, may need to reformat to datetime objects

    author: appears that each record contains an array (list), inside each list is a dictionary with 'name' as the key and author name as the value. Notably, not a unique identifier for the value. Because the data is technically nested, may need to extract and transform and add columns to dataframe

    type: string, but all the values are exactly the same and they are all in the category of "recipe". Drop column

    url: Appears to be a long string leading to where the recipe can be found on Epicurious's website

    photoData: nested structure, inside each record is a dictionary

    tag: each record contains a dictionary. may need to extract and transform and add columns to dataframe

    aggregateRating: float, let's say it's out of 4.0

    ingredients: appears to be a list, does look like a list of strings

    prepSteps: appears to be a list, does look like a list of strings

    reviewsCount: int

    willMakeAgainPct: integer	
    
    dateCrawled: appears to be a unix timestamp

Let's take a look at the possibly problematic columns and see if the data structures make sense or how we can approach transforming them into new columns for the dataset

In [None]:
epic_dataframe.loc[0]

In [None]:
epic_dataframe.loc[0]["photoData"]

It looks like photoData contains:
    1. photo ID, string
    2. photo filename, string
    3. photo caption, string
    4. photo credit, string
    5. promoTitle, string
    6. title, string
       1. caption, promoTitle, and title could be all the same
    7. orientation, string
    8. restrictCropping: boolean

Of these, maybe we should keep
id => photoID
filename => photoFilename
caption => photoCaption
credit => photoCredit


In [None]:
epic_dataframe.loc[0]["tag"]

In [None]:
epic_dataframe.loc[100]["tag"]

In [None]:
epic_dataframe.loc[10]["tag"]

In [None]:
epic_dataframe.loc[1]["tag"]

In [None]:
epic_dataframe.loc[1]["ingredients"]

In [None]:
type(epic_dataframe.loc[1]["ingredients"])

In [None]:
epic_dataframe.loc[1]["dek"]

### Let's skip ahead and throw this into CountVectorizer

In [None]:
all_recipes_list = epic_dataframe['ingredients'].str.join(" ")
# .apply(" ".join).str.lower()
#print(type(all_recipes_list))
all_recipes_list
# for i in range(0,5):
#     print((i, all_recipes_list[i]))

In [None]:
nlp = spacy.load("en_core_web_sm")

In [None]:
first_rec = all_recipes_list[0]
print(first_rec)

In [None]:
doc_first_rec = nlp(first_rec)
for token in doc_first_rec:
    if token.like_num == False:
        print(token.text, token.pos_, token.ent_type_, token.lemma_, token.is_digit)
    else:
        continue

In [None]:
epic_dataframe['tag'][2]

In [None]:
epic_dataframe['cuisine_name'] = epic_dataframe['tag'].apply(lambda x: x['name'] if not pd.isna(x) and x['category'] == 'cuisine' else 'Cuisine Missing')

In [None]:
epic_dataframe['cuisine_name'].head()

In [None]:
epic_dataframe['cuisine_name'][0]

In [None]:
print(epic_dataframe.shape)
print(epic_dataframe[epic_dataframe['cuisine_name'] != 'Missing'].shape)

In [None]:
epic_dataframe[epic_dataframe['cuisine_name'] == 'Cuisine Missing'] 

In [None]:
# this lambda function goes into the photo data column and extracts just the filename from the dictionary
epic_dataframe["photo_filename"] = epic_dataframe['photoData'].apply(lambda x: x['filename'] if not pd.isna(x) 
 else 'Missing photo')

# This lambda function goes into the photo data column and extracts just the photo credit from the dictionary 
epic_dataframe["photo_credit"] = epic_dataframe['photoData'].apply(lambda x: x['credit'] if not pd.isna(x) 
 else 'Missing credit')

In [None]:
epic_dataframe.head()

In [None]:
# This lambda function cleans up the column and adds a new column dataframe
epic_dataframe["author_name"] = epic_dataframe['author'].apply(lambda x: x[0]['name'] if not pd.isna(x) 
 else 'Missing author name')

In [None]:
first_five_epic = epic_dataframe.head()

In [None]:
first_five_epic

In [None]:
first_five_epic.iloc[0]["author"][0]

In [None]:
first_five_epic.iloc[1]["author"][0]

In [None]:
first_five_epic["author"].apply(lambda x: x[0]['name'] if x else 'Missing author name')

This lambda function works enough! It goes into author column and extracts the author as long as the record isn't an empty list. This can be refactored into a helper function. But we need to apply to the whole dataset

In [None]:
epic_dataframe["author_name"] = epic_dataframe["author"].apply(lambda x: x[0]['name'] if x else 'Missing author name')

In [None]:
epic_dataframe

In [None]:
epic_dataframe2 = preprocess_dataframe(pd.read_json(path_or_buf=data_path))

epic_dataframe2.head()

## Let's add a feature to fix the datetimes

In [None]:
test_pubdate_array = epic_dataframe2['pubDate'][0:5]
test_pubdate_array

In [None]:
print(type(test_pubdate_array[0]))

In [None]:
test_pubdate_array[0][:10]

In [None]:
epic_dataframe2['publication_date'] = epic_dataframe2['pubDate'].apply(lambda x: datetime.strptime(x[:10], "%Y-%m-%d"))

In [None]:
epic_dataframe2['publication_date']

In [None]:
epic_dataframe2['publication_date_todt'] = pd.to_datetime(epic_dataframe2['pubDate'], infer_datetime_format=True)

In [None]:
epic_dataframe2['publication_date_todt']

Don't need the apply with lambda function anymore because to_datetime succesfully resolved the odd string

In [None]:
epic_dataframe2['date_scraped'] = pd.to_datetime(epic_dataframe2['dateCrawled'], infer_datetime_format=True)
epic_dataframe2['date_scraped']

In [None]:
epic_dataframe2['date_scraped'].describe()

Based on the timestamps, it seems like we can drop the crawled/scraped column because the values don't really make sense and would not help

To Do after break:
- Refactor datetime processing into functions
- Consider deploying as a pd.pipe()


In [None]:
epic_dataframe['tag']

In [None]:
epic_dataframe = pd.read_json(path_or_buf=data_path)
preprocess_dataframe(epic_dataframe)
epic_dataframe.head(10)

In [None]:
epic_dataframe.describe()

In [None]:
epic_dataframe.info()

In [4]:
data_path = "../../data/recipes-en-201706/epicurious-recipes_m2.json"

epic_dataframe = pd.read_json(data_path, typ='frame')

dfpp.preprocess_dataframe(df=epic_dataframe)
epic_dataframe.head(10)

NameError: name 'Series' is not defined

In [None]:
# this is looking for accented characters inside text, which may not be necessary
for word in test_rec_list:     
    print(unicodedata.normalize("NFKD", word))

In [None]:
cv = CountVectorizer(input=aa)