# Exploratory Data Analysis of Epicurious Scrape in a JSON file

This is an idealized workflow for Aaron Chen in looking at data science problems. It likely isn't the best path, nor has he rigidly applied or stuck to this ideal, but he wishes that he worked this way more frequently.

## Purpose: Work through some exploratory data analysis of the Epicurious scrape on stream. Try to write some functions to help process the data.

### Author: Aaron Chen

---

### If needed, run shell commands here

In [None]:
#!python -m spacy download en_core_web_sm

---

## External Resources

List out references or documentation that has helped you with this notebook

### Code
Regex Checker: https://regex101.com/

#### Scikit-learn
1. https://scikit-learn.org/stable/modules/decomposition.html#latent-dirichlet-allocation-lda


### Data

For this notebook, the data is stored in the repo base folder/data/raw

### Process

Are there steps or tutorials you are following? Those are things I try to list in Process

---

## Import necessary libraries

In [None]:
# | hide
import project_path
from datetime import datetime
import numpy as np
import pandas as pd
from sklearn.base import TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

import spacy
import en_core_web_sm
from spacy.lang.en.stop_words import STOP_WORDS
from tqdm import tqdm
from typing import Dict
import unicodedata

# import local scripts
import src.dataframe_preprocessor as dfpp



---

## Define helper functions

My workflow is to try things with code cells, then when the code cells get messy and repetitive, to convert into helper functions that can be called.

When the helper functions are getting used a lot, it is usually better to convert them to scripts or classes that can be called/instantiated

---

## Define global variables 

**Remember to refactor these out, not ideal**

In [None]:
# | hide
data_path = "../data/recipes-en-201706/epicurious-recipes_m2.json"

---

## Running Commentary

1. I used numbered lists to keep track of things I noticed

### To Do

1. Try to determine consistency of nested data structures
   1. Is the photoData or number of things inside photoData the same from record to record
   2. What about for tag?

Data wasn't fully consistent but logic in helper function helped handle nulls

2. How to handle nulls?
   1. Author      Filled in with "Missing Author"
   2. Tag         Filled in with "Missing Cuisine"
3. ~~Convert pubDate to actual timestamp~~  
4. ~~Convert ScrapeDate to actual timestamp~~
   1. This was ignored as the datestamp was not useful (generally within minutes of the origin of UNIX time)
   
**5. Append new columns for relevant nested structures and unfold them**

6. Determine actual types of `ingredients` and `prepSteps`
7. Continue working through test example of single recipe to feed into spaCy and then sklearn.feature_extraction.text stack
8. Will need to remove numbers, punctuation

---

## Importing and viewing the data as a dataframe

In [None]:
epic_dataframe = pd.read_json(path_or_buf=data_path)

In [None]:
epic_dataframe.head()

Unnamed: 0,id,dek,hed,pubDate,author,type,url,photoData,tag,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,dateCrawled
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,2014-08-19T04:00:00.000Z,[],recipe,/recipes/food/views/pickle-brined-fried-chicke...,"{'id': '54a2b64a6529d92b2c003409', 'filename':...","{'category': 'ingredient', 'name': 'Chicken', ...",3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,1498547035
1,54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,2008-09-09T04:00:00.000Z,[{'name': 'Edda Servi Machlin'}],recipe,/recipes/food/views/spinach-jewish-style-350152,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'cuisine', 'name': 'Italian', 'ur...",3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,1498547740
2,54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,2008-09-10T04:00:00.000Z,[{'name': 'Marcy Goldman'}],recipe,/recipes/food/views/majestic-and-moist-new-yea...,"{'id': '55e85ba4cf90d6663f728014', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,1498547738
3,54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,2008-09-08T04:00:00.000Z,[{'name': 'Faye Levy'}],recipe,/recipes/food/views/the-b-l-a-bagel-with-lox-a...,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",4.0,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,1498547740
4,54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2008-09-09T04:00:00.000Z,[{'name': 'Joan Nathan'}],recipe,/recipes/food/views/shakshuka-a-la-doktor-shak...,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,1498547740


In [None]:
epic_dataframe["type"].value_counts()

recipe    34756
Name: type, dtype: int64

In [None]:
epic_dataframe.shape

(34756, 15)

In [None]:
epic_dataframe.describe()

Unnamed: 0,aggregateRating,reviewsCount,willMakeAgainPct,dateCrawled
count,34756.0,34756.0,34756.0,34756.0
mean,2.937041,19.467257,75.435579,1498551000.0
std,1.079268,36.121799,31.064887,22763.13
min,0.0,0.0,0.0,1498546000.0
25%,2.86,3.0,70.0,1498548000.0
50%,3.27,9.0,87.0,1498548000.0
75%,3.58,22.0,98.0,1498549000.0
max,4.0,1586.0,100.0,1498869000.0


In [None]:
epic_dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34756 entries, 0 to 34755
Data columns (total 15 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                34756 non-null  object 
 1   dek               34756 non-null  object 
 2   hed               34756 non-null  object 
 3   pubDate           34756 non-null  object 
 4   author            34756 non-null  object 
 5   type              34756 non-null  object 
 6   url               34756 non-null  object 
 7   photoData         34756 non-null  object 
 8   tag               34655 non-null  object 
 9   aggregateRating   34756 non-null  float64
 10  ingredients       34656 non-null  object 
 11  prepSteps         34756 non-null  object 
 12  reviewsCount      34756 non-null  int64  
 13  willMakeAgainPct  34756 non-null  int64  
 14  dateCrawled       34756 non-null  int64  
dtypes: float64(1), int64(3), object(11)
memory usage: 4.0+ MB


In [None]:
epic_dataframe["aggregateRating"].value_counts()

0.00    3316
3.00    3113
4.00    2161
3.50    1286
3.33     894
        ... 
1.38       1
1.62       1
1.48       1
2.51       1
1.55       1
Name: aggregateRating, Length: 234, dtype: int64

Columns:

    Index

    ID: string

    dek: appears to be description of the recipe, string

    hed: Appears to be title, string

    pubDate: appears to be publication date, may need to reformat to datetime objects

    author: appears that each record contains an array (list), inside each list is a dictionary with 'name' as the key and author name as the value. Notably, not a unique identifier for the value. Because the data is technically nested, may need to extract and transform and add columns to dataframe

    type: string, but all the values are exactly the same and they are all in the category of "recipe". Drop column

    url: Appears to be a long string leading to where the recipe can be found on Epicurious's website

    photoData: nested structure, inside each record is a dictionary

    tag: each record contains a dictionary. may need to extract and transform and add columns to dataframe

    aggregateRating: float, let's say it's out of 4.0

    ingredients: appears to be a list, does look like a list of strings

    prepSteps: appears to be a list, does look like a list of strings

    reviewsCount: int

    willMakeAgainPct: integer	
    
    dateCrawled: appears to be a unix timestamp

Let's take a look at the possibly problematic columns and see if the data structures make sense or how we can approach transforming them into new columns for the dataset

In [None]:
epic_dataframe.loc[0]

id                                           54a2b6b019925f464b373351
dek                 How does fried chicken achieve No. 1 status? B...
hed                                       Pickle-Brined Fried Chicken
pubDate                                      2014-08-19T04:00:00.000Z
author                                                             []
type                                                           recipe
url                 /recipes/food/views/pickle-brined-fried-chicke...
photoData           {'id': '54a2b64a6529d92b2c003409', 'filename':...
tag                 {'category': 'ingredient', 'name': 'Chicken', ...
aggregateRating                                                  3.11
ingredients         [1 tablespoons yellow mustard seeds, 1 tablesp...
prepSteps           [Toast mustard and coriander seeds in a dry me...
reviewsCount                                                        7
willMakeAgainPct                                                  100
dateCrawled         

In [None]:
epic_dataframe.loc[0]["photoData"]

{'id': '54a2b64a6529d92b2c003409',
 'filename': '51247610_fried-chicken_1x1.jpg',
 'caption': 'Pickle-Brined Fried Chicken',
 'credit': 'Michael Graydon and Nikole Herriott',
 'promoTitle': 'Pickle-Brined Fried Chicken',
 'title': 'Pickle-Brined Fried Chicken',
 'orientation': 'landscape',
 'restrictCropping': False}

It looks like photoData contains:
    1. photo ID, string
    2. photo filename, string
    3. photo caption, string
    4. photo credit, string
    5. promoTitle, string
    6. title, string
       1. caption, promoTitle, and title could be all the same
    7. orientation, string
    8. restrictCropping: boolean

Of these, maybe we should keep
id => photoID
filename => photoFilename
caption => photoCaption
credit => photoCredit


In [None]:
epic_dataframe.loc[0]["tag"]

{'category': 'ingredient',
 'name': 'Chicken',
 'url': '',
 'photosBadgeAltText': '',
 'photosBadgeFileName': '',
 'photosBadgeID': '',
 'photosBadgeRelatedUri': ''}

In [None]:
epic_dataframe.loc[100]["tag"]

{'category': 'ingredient',
 'name': 'Champagne',
 'url': '',
 'photosBadgeAltText': '',
 'photosBadgeFileName': '',
 'photosBadgeID': '',
 'photosBadgeRelatedUri': ''}

In [None]:
epic_dataframe.loc[10]["tag"]

{'category': 'type',
 'name': 'Cake',
 'url': '',
 'photosBadgeAltText': '',
 'photosBadgeFileName': '',
 'photosBadgeID': '',
 'photosBadgeRelatedUri': ''}

In [None]:
epic_dataframe.loc[1]["tag"]

{'category': 'cuisine',
 'name': 'Italian',
 'url': '',
 'photosBadgeAltText': '',
 'photosBadgeFileName': '',
 'photosBadgeID': '',
 'photosBadgeRelatedUri': ''}

In [None]:
epic_dataframe.loc[1]["ingredients"]

['3 pounds small-leaved bulk spinach',
 'Salt',
 '1/2 cup dark seedless raisins',
 '1 cup lukewarm water',
 '6 tablespoons olive oil',
 '1/2 small onion, minced',
 '1/4 cup pignoli (pine nuts)',
 'Freshly ground black pepper',
 'Dash nutmeg']

In [None]:
type(epic_dataframe.loc[1]["ingredients"])

list

Save this aside for CountVectorization/Natural Language Processing later, continue doing feature exploration

In [None]:
epic_dataframe.loc[1]["dek"]

"Spinaci all'Ebraica"

In [None]:
epic_dataframe["tag"][2]

{'category': 'cuisine',
 'name': 'Jewish',
 'url': '',
 'photosBadgeAltText': '',
 'photosBadgeFileName': '',
 'photosBadgeID': '',
 'photosBadgeRelatedUri': ''}

In [None]:
epic_dataframe["cuisine_name"] = epic_dataframe["tag"].apply(
    lambda x: x["name"]
    if not pd.isna(x) and x["category"] == "cuisine"
    else "Cuisine Missing"
)

In [None]:
epic_dataframe["cuisine_name"].head()

0    Cuisine Missing
1            Italian
2             Jewish
3             Jewish
4             Jewish
Name: cuisine_name, dtype: object

In [None]:
epic_dataframe["cuisine_name"][0]

'Cuisine Missing'

In [None]:
print(epic_dataframe.shape)
print(epic_dataframe[epic_dataframe["cuisine_name"] != "Missing"].shape)

(34756, 16)
(34756, 16)


In [None]:
epic_dataframe[epic_dataframe["cuisine_name"] == "Cuisine Missing"]

Unnamed: 0,id,dek,hed,pubDate,author,type,url,photoData,tag,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,dateCrawled,cuisine_name
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,2014-08-19T04:00:00.000Z,[],recipe,/recipes/food/views/pickle-brined-fried-chicke...,"{'id': '54a2b64a6529d92b2c003409', 'filename':...","{'category': 'ingredient', 'name': 'Chicken', ...",3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,1498547035,Cuisine Missing
5,54a408a919925f464b3733d3,Although Nelly Custis omitted sugar in her rec...,Rice Pancakes,2012-02-17T04:00:00.000Z,[{'name': 'Stephen A. McLeod'}],recipe,/recipes/food/views/rice-pancakes-394729,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'ingredient', 'name': 'Milk/Cream...",0.00,"[1 1/2 cups cooked rice, 2 cups heavy cream, 2...","[1. Combine the rice, cream, and butter. Add t...",0,0,1498547293,Cuisine Missing
6,54a408aa19925f464b3733d6,Editor's note: This recipe is adapted with per...,Jack-O'-Lantern,2008-09-09T04:00:00.000Z,[{'name': 'Matthew Mead'}],recipe,/recipes/food/views/jack-o-lantern-350068,"{'id': '560d7907f9a841923089d7da', 'filename':...","{'category': 'type', 'name': 'Cake', 'url': ''...",1.00,"[2 tablespoons shortening, 2 tablespoons flour...",[1. Preheat the oven to 350°F. Lightly grease ...,1,0,1498547740,Cuisine Missing
7,54a408ab19925f464b3733da,Editor's note: This recipe is reprinted with p...,Seven-Minute Frosting,2008-09-09T04:00:00.000Z,[{'name': 'Matthew Mead'}],recipe,/recipes/food/views/seven-minute-frosting-350069,"{'id': '5674617eb47c050a284a4e11', 'filename':...","{'category': 'equipment', 'name': 'Mixer', 'ur...",3.53,"[1 1/2 cups sugar, 1/3 cup cold water, 2 egg w...","[1. Combine the sugar, water, egg whites, and ...",8,75,1498547740,Cuisine Missing
8,54a408ac19925f464b3733de,Editor's note: This recipe is reprinted with p...,Creamy White Frosting,2008-09-09T04:00:00.000Z,[{'name': 'Matthew Mead'}],recipe,/recipes/food/views/creamy-white-frosting-350079,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'equipment', 'name': 'Mixer', 'ur...",2.00,"[1 cup vegetable shortening, 1 1/2 teaspoons v...","[1. With a mixer on medium speed, beat togethe...",5,0,1498547740,Cuisine Missing
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34751,59541a31bff3052847ae2107,Buttering the bread before you waffle it ensur...,Waffled Ham and Cheese Melt with Maple Butter,2017-06-29T14:59:01.368Z,[{'name': 'Daniel Shumski'}],recipe,/recipes/food/views/waffled-ham-and-cheese-mel...,"{'id': '595420c2d52ca90dc28200e6', 'filename':...","{'category': 'tag', 'name': 'Small Plates', 'u...",0.00,"[1 tablespoon unsalted butter, at room tempera...","[Preheat the waffle iron on low., Spread a thi...",0,0,1498857706,Cuisine Missing
34752,5954233ad52ca90dc28200e7,"Spread this easy compound butter on waffles, p...",Maple Butter,2017-06-01T14:57:00.000Z,[{'name': 'Daniel Shumski'}],recipe,/recipes/food/views/maple-butter,"{'id': '5674617eb47c050a284a4e11', 'filename':...","{'category': 'meal', 'name': 'Breakfast', 'url...",0.00,"[8 tablespoons (1 stick) salted butter, at roo...",[Combine the ingredients in a medium-size bowl...,0,0,1498857726,Cuisine Missing
34753,595424c2109c972493636f83,Leftover mac and cheese is not exactly one of ...,Waffled Macaroni and Cheese,2017-06-29T14:54:24.234Z,[{'name': 'Daniel Shumski'}],recipe,/recipes/food/views/waffled-macaroni-and-cheese,"{'id': '5954259cd83a053f4d33bc74', 'filename':...","{'category': 'tag', 'name': 'Small Plates', 'u...",0.00,"[3 tablespoons unsalted butter, plus more for ...",[Preheat the oven to 375°F. Butter a 9x5-inch ...,0,0,1498857706,Cuisine Missing
34754,5956638625dc3d1d829b7166,A classic Mexican beer cocktail you can sip al...,Classic Michelada,2017-06-15T16:41:00.000Z,[{'name': 'Kat Odell'}],recipe,/recipes/food/views/classic-michelada,"{'id': '595553ce0eda38330f0a0a63', 'filename':...","{'category': 'ingredient', 'name': 'Beer', 'ur...",0.00,"[Coarse salt, 2 lime wedges, 2 ounces tomato j...",[Place about 1/4 cup salt on a small plate. Ru...,0,0,1498857714,Cuisine Missing


In [None]:
# this lambda function goes into the photo data column and extracts just the filename from the dictionary
epic_dataframe["photo_filename"] = epic_dataframe["photoData"].apply(
    lambda x: x["filename"] if not pd.isna(x) else "Missing photo"
)

# This lambda function goes into the photo data column and extracts just the photo credit from the dictionary
epic_dataframe["photo_credit"] = epic_dataframe["photoData"].apply(
    lambda x: x["credit"] if not pd.isna(x) else "Missing credit"
)

In [None]:
epic_dataframe.head()

Unnamed: 0,id,dek,hed,pubDate,author,type,url,photoData,tag,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,dateCrawled,cuisine_name,photo_filename,photo_credit
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,2014-08-19T04:00:00.000Z,[],recipe,/recipes/food/views/pickle-brined-fried-chicke...,"{'id': '54a2b64a6529d92b2c003409', 'filename':...","{'category': 'ingredient', 'name': 'Chicken', ...",3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,1498547035,Cuisine Missing,51247610_fried-chicken_1x1.jpg,Michael Graydon and Nikole Herriott
1,54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,2008-09-09T04:00:00.000Z,[{'name': 'Edda Servi Machlin'}],recipe,/recipes/food/views/spinach-jewish-style-350152,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'cuisine', 'name': 'Italian', 'ur...",3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,1498547740,Italian,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St..."
2,54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,2008-09-10T04:00:00.000Z,[{'name': 'Marcy Goldman'}],recipe,/recipes/food/views/majestic-and-moist-new-yea...,"{'id': '55e85ba4cf90d6663f728014', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,1498547738,Jewish,EP_09022015_honeycake-2.jpg,"Photo by Chelsea Kyle, Food Styling by Anna St..."
3,54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,2008-09-08T04:00:00.000Z,[{'name': 'Faye Levy'}],recipe,/recipes/food/views/the-b-l-a-bagel-with-lox-a...,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",4.0,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,1498547740,Jewish,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B..."
4,54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2008-09-09T04:00:00.000Z,[{'name': 'Joan Nathan'}],recipe,/recipes/food/views/shakshuka-a-la-doktor-shak...,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,1498547740,Jewish,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B..."


In [None]:
# This lambda function cleans up the column and adds a new column dataframe
epic_dataframe["author_name"] = epic_dataframe["author"].apply(
    lambda x: x[0]["name"] if not pd.isna(x) else "Missing author name"
)

  epic_dataframe["author_name"] = epic_dataframe['author'].apply(lambda x: x[0]['name'] if not pd.isna(x)


IndexError: list index out of range

This function did not work, what happened?

In [None]:
first_five_epic = epic_dataframe.head()

In [None]:
first_five_epic

Unnamed: 0,id,dek,hed,pubDate,author,type,url,photoData,tag,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,dateCrawled,cuisine_name,photo_filename,photo_credit
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,2014-08-19T04:00:00.000Z,[],recipe,/recipes/food/views/pickle-brined-fried-chicke...,"{'id': '54a2b64a6529d92b2c003409', 'filename':...","{'category': 'ingredient', 'name': 'Chicken', ...",3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,1498547035,Cuisine Missing,51247610_fried-chicken_1x1.jpg,Michael Graydon and Nikole Herriott
1,54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,2008-09-09T04:00:00.000Z,[{'name': 'Edda Servi Machlin'}],recipe,/recipes/food/views/spinach-jewish-style-350152,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'cuisine', 'name': 'Italian', 'ur...",3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,1498547740,Italian,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St..."
2,54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,2008-09-10T04:00:00.000Z,[{'name': 'Marcy Goldman'}],recipe,/recipes/food/views/majestic-and-moist-new-yea...,"{'id': '55e85ba4cf90d6663f728014', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,1498547738,Jewish,EP_09022015_honeycake-2.jpg,"Photo by Chelsea Kyle, Food Styling by Anna St..."
3,54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,2008-09-08T04:00:00.000Z,[{'name': 'Faye Levy'}],recipe,/recipes/food/views/the-b-l-a-bagel-with-lox-a...,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",4.0,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,1498547740,Jewish,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B..."
4,54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2008-09-09T04:00:00.000Z,[{'name': 'Joan Nathan'}],recipe,/recipes/food/views/shakshuka-a-la-doktor-shak...,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,1498547740,Jewish,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B..."


In [None]:
first_five_epic.iloc[0]["author"][0]

IndexError: list index out of range

In [None]:
first_five_epic.iloc[1]["author"][0]

In [None]:
first_five_epic["author"].apply(lambda x: x[0]["name"] if x else "Missing author name")

0    Missing author name
1     Edda Servi Machlin
2          Marcy Goldman
3              Faye Levy
4            Joan Nathan
Name: author, dtype: object

This lambda function now works enough! It goes into author column and extracts the author as long as the record isn't an empty list. This can be refactored into a helper function. But we need to apply to the whole dataset

In [None]:
epic_dataframe["author_name"] = epic_dataframe["author"].apply(
    lambda x: x[0]["name"] if x else "Missing author name"
)

In [None]:
epic_dataframe

Unnamed: 0,id,dek,hed,pubDate,author,type,url,photoData,tag,aggregateRating,ingredients,prepSteps,reviewsCount,willMakeAgainPct,dateCrawled,cuisine_name,photo_filename,photo_credit,author_name
0,54a2b6b019925f464b373351,How does fried chicken achieve No. 1 status? B...,Pickle-Brined Fried Chicken,2014-08-19T04:00:00.000Z,[],recipe,/recipes/food/views/pickle-brined-fried-chicke...,"{'id': '54a2b64a6529d92b2c003409', 'filename':...","{'category': 'ingredient', 'name': 'Chicken', ...",3.11,"[1 tablespoons yellow mustard seeds, 1 tablesp...",[Toast mustard and coriander seeds in a dry me...,7,100,1498547035,Cuisine Missing,51247610_fried-chicken_1x1.jpg,Michael Graydon and Nikole Herriott,Missing author name
1,54a408a019925f464b3733bc,Spinaci all'Ebraica,Spinach Jewish Style,2008-09-09T04:00:00.000Z,[{'name': 'Edda Servi Machlin'}],recipe,/recipes/food/views/spinach-jewish-style-350152,"{'id': '56746182accb4c9831e45e0a', 'filename':...","{'category': 'cuisine', 'name': 'Italian', 'ur...",3.22,"[3 pounds small-leaved bulk spinach, Salt, 1/2...",[Remove the stems and roots from the spinach. ...,5,80,1498547740,Italian,EP_12162015_placeholders_rustic.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Edda Servi Machlin
2,54a408a26529d92b2c003631,"This majestic, moist, and richly spiced honey ...",New Year’s Honey Cake,2008-09-10T04:00:00.000Z,[{'name': 'Marcy Goldman'}],recipe,/recipes/food/views/majestic-and-moist-new-yea...,"{'id': '55e85ba4cf90d6663f728014', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",3.62,"[3 1/2 cups all-purpose flour, 1 tablespoon ba...",[I like this cake best baked in a 9-inch angel...,105,88,1498547738,Jewish,EP_09022015_honeycake-2.jpg,"Photo by Chelsea Kyle, Food Styling by Anna St...",Marcy Goldman
3,54a408a66529d92b2c003638,The idea for this sandwich came to me when my ...,The B.L.A.Bagel with Lox and Avocado,2008-09-08T04:00:00.000Z,[{'name': 'Faye Levy'}],recipe,/recipes/food/views/the-b-l-a-bagel-with-lox-a...,"{'id': '5674617e47d1a28026045e4f', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",4.00,"[1 small ripe avocado, preferably Hass (see No...","[A short time before serving, mash avocado and...",7,100,1498547740,Jewish,EP_12162015_placeholders_casual.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Faye Levy
4,54a408a719925f464b3733cc,"In 1930, Simon Agranat, the chief justice of t...",Shakshuka a la Doktor Shakshuka,2008-09-09T04:00:00.000Z,[{'name': 'Joan Nathan'}],recipe,/recipes/food/views/shakshuka-a-la-doktor-shak...,"{'id': '56746183b47c050a284a4e15', 'filename':...","{'category': 'cuisine', 'name': 'Jewish', 'url...",2.71,"[2 pounds fresh tomatoes, unpeeled and cut in ...","[1. Place the tomatoes, garlic, salt, paprika,...",7,83,1498547740,Jewish,EP_12162015_placeholders_formal.jpg,"Photo by Chelsea Kyle, Prop Styling by Rhoda B...",Joan Nathan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34751,59541a31bff3052847ae2107,Buttering the bread before you waffle it ensur...,Waffled Ham and Cheese Melt with Maple Butter,2017-06-29T14:59:01.368Z,[{'name': 'Daniel Shumski'}],recipe,/recipes/food/views/waffled-ham-and-cheese-mel...,"{'id': '595420c2d52ca90dc28200e6', 'filename':...","{'category': 'tag', 'name': 'Small Plates', 'u...",0.00,"[1 tablespoon unsalted butter, at room tempera...","[Preheat the waffle iron on low., Spread a thi...",0,0,1498857706,Cuisine Missing,waffle-ham-and-cheese-melt-062817.jpg,"Photo by Maes Studio, Inc.",Daniel Shumski
34752,5954233ad52ca90dc28200e7,"Spread this easy compound butter on waffles, p...",Maple Butter,2017-06-01T14:57:00.000Z,[{'name': 'Daniel Shumski'}],recipe,/recipes/food/views/maple-butter,"{'id': '5674617eb47c050a284a4e11', 'filename':...","{'category': 'meal', 'name': 'Breakfast', 'url...",0.00,"[8 tablespoons (1 stick) salted butter, at roo...",[Combine the ingredients in a medium-size bowl...,0,0,1498857726,Cuisine Missing,EP_12162015_placeholders_bright.jpg,"Photo by Chelsea Kyle, Prop Styling by Anna St...",Daniel Shumski
34753,595424c2109c972493636f83,Leftover mac and cheese is not exactly one of ...,Waffled Macaroni and Cheese,2017-06-29T14:54:24.234Z,[{'name': 'Daniel Shumski'}],recipe,/recipes/food/views/waffled-macaroni-and-cheese,"{'id': '5954259cd83a053f4d33bc74', 'filename':...","{'category': 'tag', 'name': 'Small Plates', 'u...",0.00,"[3 tablespoons unsalted butter, plus more for ...",[Preheat the oven to 375°F. Butter a 9x5-inch ...,0,0,1498857706,Cuisine Missing,waffle-mac-n-cheese-062816.jpg,"Photo by Maes Studio, Inc.",Daniel Shumski
34754,5956638625dc3d1d829b7166,A classic Mexican beer cocktail you can sip al...,Classic Michelada,2017-06-15T16:41:00.000Z,[{'name': 'Kat Odell'}],recipe,/recipes/food/views/classic-michelada,"{'id': '595553ce0eda38330f0a0a63', 'filename':...","{'category': 'ingredient', 'name': 'Beer', 'ur...",0.00,"[Coarse salt, 2 lime wedges, 2 ounces tomato j...",[Place about 1/4 cup salt on a small plate. Ru...,0,0,1498857714,Cuisine Missing,Classic Michelada 07292017.jpg,,Kat Odell


## Let's add a feature to fix the datetimes

In [None]:
test_pubdate_array = epic_dataframe["pubDate"][0:5]
test_pubdate_array

0   2014-08-19 04:00:00+00:00
1   2008-09-09 04:00:00+00:00
2   2008-09-10 04:00:00+00:00
3   2008-09-08 04:00:00+00:00
4   2008-09-09 04:00:00+00:00
Name: date_published, dtype: datetime64[ns, UTC]

In [None]:
print(type(test_pubdate_array[0][:10]))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


In [None]:
test_pubdate_array[0][:10]

Timestamp('2014-08-19 04:00:00+0000', tz='UTC')

In [None]:
epic_dataframe["publication_date"] = epic_dataframe["date_published"].apply(
    lambda x: datetime.strptime(x[:10], "%Y-%m-%d")
)

TypeError: 'Timestamp' object is not subscriptable

In [None]:
epic_dataframe["publication_date"]

In [None]:
epic_dataframe["publication_date_todt"] = pd.to_datetime(
    epic_dataframe["pubDate"], infer_datetime_format=True
)

In [None]:
epic_dataframe["publication_date_todt"]

Don't need the apply with lambda function anymore because to_datetime succesfully resolved the odd string

In [None]:
epic_dataframe["date_scraped"] = pd.to_datetime(
    epic_dataframe["dateCrawled"], infer_datetime_format=True
)
epic_dataframe["date_scraped"]

In [None]:
epic_dataframe["date_scraped"].describe()

In [None]:
epic_dataframe["tag"]

Based on the timestamps, it seems like we can drop the crawled/scraped column because the values don't really make sense and would not help

### Next Steps
- Refactor datetime processing into functions
- Consider deploying as a pd.pipe()


### Checking how to get the correct path inside the project_path script

In [None]:
import os
import sys

module_path = os.path.abspath(os.path.join(os.pardir, os.pardir))
if module_path not in sys.path:
    sys.path.append(module_path)

In [None]:
sys.path

In [None]:
module_path

In [None]:
os.pardir