# Mission: To Build the Most Complete Collection of Recipes

HealthyDonald's is a new fast-food chain making waves across Asia, Europe, and the US.

They aim to gain more insights into the unique food cultures around the globe. To achieve this, they set out to build the most complete collection of recipes from all over the world.

[They started with this dataset](https://github.com/fictivekin/openrecipes) — one of the largest recipe collections available on GitHub.

**Now, it's time to tidy up this somewhat chaotic dataset.**

In this Jupyter notebook, you can find the exact steps I took to clean up this data. No fancy edits from the first time I wrote the code.

What could improve this cleanup even more? Perhaps using an ML model to discern, based on the descriptions, whether a dish is best suited for breakfast, dinner, or lunch.

*For now, enjoy the stream of consciousness...*

![Data cleaning](cleaning_messy_data.png)

In [1]:
import numpy as np
import pandas as pd

In [2]:
#This is the only edit I made

import warnings
warnings.filterwarnings("ignore")

In [5]:
with open (r'your_path') as f:
    data = (line.strip() for line in f)
    data_json = "[{0}]".format(','.join(data))
recipes = pd.read_json(data_json)

## Trying to understand what I'm dealing with

In [6]:
recipes.head(3)

Unnamed: 0,_id,name,ingredients,url,image,ts,cookTime,source,recipeYield,datePublished,prepTime,description,totalTime,creator,recipeCategory,dateModified,recipeInstructions
0,{'$oid': '5160756b96cc62079cc2db15'},Drop Biscuits and Sausage Gravy,Biscuits\n3 cups All-purpose Flour\n2 Tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276011104},PT30M,thepioneerwoman,12.0,2013-03-11,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",,,,,
1,{'$oid': '5160756d96cc62079cc2db16'},Hot Roast Beef Sandwiches,12 whole Dinner Rolls Or Small Sandwich Buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,{'$date': 1365276013902},PT20M,thepioneerwoman,12.0,2013-03-13,PT20M,"When I was growing up, I participated in my Ep...",,,,,
2,{'$oid': '5160756f96cc6207a37ff777'},Morrocan Carrot and Chickpea Salad,Dressing:\n1 tablespoon cumin seeds\n1/3 cup /...,http://www.101cookbooks.com/archives/moroccan-...,http://www.101cookbooks.com/mt-static/images/f...,{'$date': 1365276015332},,101cookbooks,,2013-01-07,PT15M,A beauty of a carrot salad - tricked out with ...,,,,,


In [7]:
recipes.name = recipes.name.str.lower()
recipes.ingredients = recipes.ingredients.str.lower()

In [8]:
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173278 entries, 0 to 173277
Data columns (total 17 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   _id                 173278 non-null  object
 1   name                173278 non-null  object
 2   ingredients         173278 non-null  object
 3   url                 173278 non-null  object
 4   image               158278 non-null  object
 5   ts                  173278 non-null  object
 6   cookTime            117936 non-null  object
 7   source              173278 non-null  object
 8   recipeYield         165628 non-null  object
 9   datePublished       78110 non-null   object
 10  prepTime            130186 non-null  object
 11  description         158068 non-null  object
 12  totalTime           1570 non-null    object
 13  creator             395 non-null     object
 14  recipeCategory      388 non-null     object
 15  dateModified        161 non-null     object
 16  re

## Clean the 'id' column

In [9]:
recipes['_id'] = recipes['_id'].astype(str)
recipes['_id'] = recipes['_id'].str[10:-2]

In [11]:
recipes.drop(['totalTime', 'creator', 'recipeCategory',
              'dateModified', 'recipeInstructions'],
              inplace=True, axis=1)

In [12]:
recipes.isnull().sum()

_id                  0
name                 0
ingredients          0
url                  0
image            15000
ts                   0
cookTime         55342
source               0
recipeYield       7650
datePublished    95168
prepTime         43092
description      15210
dtype: int64

## Time stamp

In [13]:
recipes['ts'] = pd.to_datetime(recipes['ts'].apply(lambda x: x['$date']), unit='ms')

In [14]:
recipes['day'] = recipes['ts'].dt.day
recipes['month'] = recipes['ts'].dt.month
recipes['year'] = recipes['ts'].dt.year
recipes['time_stamp'] = pd.to_datetime(recipes[['year', 'month', 'day']])

In [15]:
recipes.drop(['year', 'month', 'day', 'ts', 'datePublished'], inplace=True,
             axis=1)
recipes.head(2)

Unnamed: 0,_id,name,ingredients,url,image,cookTime,source,recipeYield,prepTime,description,time_stamp
0,5160756b96cc62079cc2db15,drop biscuits and sausage gravy,biscuits\n3 cups all-purpose flour\n2 tablespo...,http://thepioneerwoman.com/cooking/2013/03/dro...,http://static.thepioneerwoman.com/cooking/file...,PT30M,thepioneerwoman,12,PT10M,"Late Saturday afternoon, after Marlboro Man ha...",2013-04-06
1,5160756d96cc62079cc2db16,hot roast beef sandwiches,12 whole dinner rolls or small sandwich buns (...,http://thepioneerwoman.com/cooking/2013/03/hot...,http://static.thepioneerwoman.com/cooking/file...,PT20M,thepioneerwoman,12,PT20M,"When I was growing up, I participated in my Ep...",2013-04-06


In [16]:
recipes.drop('recipeYield', inplace=True, axis=1)

# Prep Time

In [17]:
len_prep = recipes['prepTime'].str.len()

In [18]:
len_prep.max()

17.0

In [19]:
recipes.loc[recipes['prepTime'].str.len() == 17,
            :]

Unnamed: 0,_id,name,ingredients,url,image,cookTime,source,prepTime,description,time_stamp
2387,51607e5096cc6208e46ae3bb,filled meringue coffee cake,,http://delishhh.com/2011/03/27/filled-meringue...,http://farm6.static.flickr.com/5062/5564739251...,,delishhh,2 hour 30 minutes,It is that time of the month again; here is th...,2013-04-06


In [20]:
# noticed that some outliers (very long prepTime)
# are in this format '2 hour 30 minutes'

is_lower_prep = recipes['prepTime'].str.islower()

In [21]:
is_lower_prep.isnull().sum()

43092

In [22]:
is_lower_prep.fillna(False, inplace=True)

In [23]:
is_lower_prep.value_counts()

prepTime
False    173277
True          1
Name: count, dtype: int64

In [24]:
# since there's just one outlier I drop it

recipes.drop(index=2387, inplace=True)

In [25]:
del is_lower_prep

In [26]:
# I notice that the most common format is PT13M

len_5 = (recipes['prepTime'].str.len() < 6) & (recipes['prepTime'].str.endswith('M'))
recipes_5 = pd.DataFrame(recipes.loc[len_5,'prepTime'])

In [27]:
recipes_5.info()

<class 'pandas.core.frame.DataFrame'>
Index: 116539 entries, 0 to 173277
Data columns (total 1 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   prepTime  116539 non-null  object
dtypes: object(1)
memory usage: 1.8+ MB


In [28]:
recipes_5.tail()

Unnamed: 0,prepTime
173270,PT30M
173271,PT5M
173272,PT40M
173273,PT10M
173277,PT10M


In [29]:
recipes_5['prep_t_min'] = recipes_5['prepTime'].str[2:-1]
recipes_5['prep_t_min'] = recipes_5['prep_t_min'].astype(int)

In [31]:
# I do the same for prep time that end with hours
# And no min

len_5_h = (recipes['prepTime'].str.len() < 6) & (recipes['prepTime'].str.endswith('H'))
recipes_5_h = pd.DataFrame(recipes.loc[len_5_h,'prepTime'])

In [32]:
recipes_5_h.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5963 entries, 12 to 173274
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   prepTime  5963 non-null   object
dtypes: object(1)
memory usage: 93.2+ KB


In [33]:
recipes_5_h.tail()

Unnamed: 0,prepTime
173222,PT3H
173232,PT3H
173263,PT3H
173269,PT1H
173274,PT8H


In [34]:
recipes_5_h['prep_t_min'] = recipes_5_h['prepTime'].str[2:-1]
recipes_5_h['prep_t_min'] = recipes_5_h['prep_t_min'].astype(int) * 60

In [35]:
len_8 = (recipes['prepTime'].str.len() >= 6) & (recipes['prepTime'].str.endswith('M'))
recipes_8 = pd.DataFrame(recipes.loc[len_8,'prepTime'])

In [36]:
recipes_8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5495 entries, 20 to 173268
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   prepTime  5495 non-null   object
dtypes: object(1)
memory usage: 85.9+ KB


In [37]:
recipes_8['prepTime'] = recipes_8['prepTime'].str[2:]

In [38]:
recipes_8.head(2)

Unnamed: 0,prepTime
20,1H40M
41,1500M


In [39]:
find_h = pd.DataFrame(recipes_8['prepTime'].str.find('H'))

In [40]:
find_h.value_counts()

prepTime
 1          5470
-1            19
 2             6
Name: count, dtype: int64

In [41]:
recipes_8 = pd.merge(recipes_8, find_h, left_index=True, right_index=True)

In [42]:
recipes_8.head(1)

Unnamed: 0,prepTime_x,prepTime_y
20,1H40M,1


In [43]:
recipes_8.rename(columns={'prepTime_x': 'prept', 'prepTime_y': 'h_ind'}, inplace=True)

In [44]:
no_h = recipes_8[recipes_8['h_ind'] == -1]

In [45]:
recipes_8.drop(no_h.index, inplace=True)

In [46]:
recipes_8['h_ind'].value_counts()

h_ind
1    5470
2       6
Name: count, dtype: int64

In [47]:
no_h.head()

Unnamed: 0,prept,h_ind
41,1500M,-1
2141,150M,-1
2811,150M,-1
86719,180M,-1
86721,180M,-1


In [48]:
no_h['h_ind'].value_counts()

h_ind
-1    19
Name: count, dtype: int64

In [49]:
no_h['time_stamp'] = no_h['prept'].str[:-1]
no_h.loc[:, 'time_stamp'] = no_h['time_stamp'].astype(int)

In [50]:
recipes_8['time_stamp'] = recipes_8.apply(lambda r: r['prept'][:r['h_ind']], axis=1)
recipes_8['time_stamp'] = recipes_8['time_stamp'].astype(int).multiply(60)

In [51]:
recipes_8.head()

Unnamed: 0,prept,h_ind,time_stamp
20,1H40M,1,60
433,1H30M,1,60
528,1H30M,1,60
592,1H30M,1,60
608,1H30M,1,60


In [52]:
minutes_r8 = recipes_8.apply(lambda r: r['prept'][(r['h_ind']+1):-1], axis=1).astype(int)

recipes_8['time_stamp'] = recipes_8['time_stamp'].add(minutes_r8)

In [53]:
recipes_8.head()

Unnamed: 0,prept,h_ind,time_stamp
20,1H40M,1,100
433,1H30M,1,90
528,1H30M,1,90
592,1H30M,1,90
608,1H30M,1,90


In [54]:
recipes_5.head(2)

Unnamed: 0,prepTime,prep_t_min
0,PT10M,10
1,PT20M,20


In [55]:
recipes_5_h.head(2
                 )

Unnamed: 0,prepTime,prep_t_min
12,PT2H,120
75,PT3H,180


In [56]:
recipes_5['indx'] = recipes_5.index
recipes_5_h['indx'] = recipes_5_h.index

In [57]:
# Now I merge all the time_stamp of the different
# dataframes I created

time_prep = pd.merge(recipes_5, recipes_5_h,
                     how='outer')
time_prep.tail()

Unnamed: 0,prepTime,prep_t_min,indx
122497,PT3H,180,173222
122498,PT3H,180,173232
122499,PT3H,180,173263
122500,PT1H,60,173269
122501,PT8H,480,173274


In [58]:
time_prep.isnull().sum()

prepTime      0
prep_t_min    0
indx          0
dtype: int64

In [59]:
time_prep.shape[0] == recipes_5.shape[0] + recipes_5_h.shape[0]

True

In [60]:
recipes_8.rename(columns={'time_stamp':'prep_t_min'}, inplace=True)
recipes_8.head()

Unnamed: 0,prept,h_ind,prep_t_min
20,1H40M,1,100
433,1H30M,1,90
528,1H30M,1,90
592,1H30M,1,90
608,1H30M,1,90


In [61]:
no_h.head(1)

Unnamed: 0,prept,h_ind,time_stamp
41,1500M,-1,1500


In [62]:
no_h.rename(columns={'time_stamp':'prep_t_min'}, inplace=True)
no_h.head(1)

Unnamed: 0,prept,h_ind,prep_t_min
41,1500M,-1,1500


In [63]:
recipes_8['indx'] = recipes_8.index
no_h['indx'] = no_h.index

In [64]:
time_prep2 = pd.merge(no_h, recipes_8, how='outer')
time_prep2.drop(['h_ind','prept'], axis=1, inplace=True)
time_prep2.shape[0] == no_h.shape[0] + recipes_8.shape[0]

True

In [65]:
time_prep2.head(1)

Unnamed: 0,prep_t_min,indx
0,1500,41


In [66]:
time_prep_final = pd.merge(time_prep, time_prep2, on='indx', how='outer')
time_prep_final.shape[0] == time_prep.shape[0] + time_prep2.shape[0]

True

In [67]:
time_prep_final.tail(3)

Unnamed: 0,prepTime,prep_t_min_x,indx,prep_t_min_y
127994,,,173088,140
127995,,,173170,130
127996,,,173268,75


In [68]:
time_prep_final['prep_t_min_x'].fillna(0, inplace=True)
time_prep_final['prep_t_min_y'].fillna(0, inplace=True)
time_prep_final['prep_t_min'] = time_prep_final['prep_t_min_x'] + time_prep_final['prep_t_min_y']
time_prep_final.drop(['prepTime', 'prep_t_min_x', 'prep_t_min_y'], axis=1, inplace=True)

In [69]:
# now I merge everything back to the original dataframe
recipes2 = pd.merge(recipes, time_prep_final,
                    left_on=recipes.index, right_on='indx',
                    how='outer')

In [70]:
recipes2.loc[2768:2770, :]

Unnamed: 0,_id,name,ingredients,url,image,cookTime,source,prepTime,description,time_stamp,indx,prep_t_min
2768,5160804496cc6208dde659ff,breakfast tacos with kale-cilantro chimichurri...,"sauce:\n1 cup kale, packed\n½ cup cilantro, lo...",http://naturallyella.com/2012/04/02/breakfast-...,http://cdn.naturallyella.com/files/2012/04/IMG...,PT30M,naturallyella,PT15M,A great way to bring tacos in to breakfast.,2013-04-06,2769,15.0
2769,5160804596cc6208c17937ed,pasta with purple sprouting broccoli,1kg/2¼lb purple sprouting broccoli\n1 medium s...,http://www.bbc.co.uk/food/recipes/pastawithpur...,http://ichef.bbci.co.uk/food/ic/food_16x9_448/...,PT30M,bbcfood,PT30M,Purple sprouting broccoli can be a cheap and c...,2013-04-06,2770,30.0
2770,5160804896cc6208c17937ee,yorkshire curd tart,250g/9oz plain flour\n100g/3½oz caster sugar\n...,http://www.bbc.co.uk/food/recipes/yorkshire_cu...,,PT1H,bbcfood,PT1H,"An oldie, but a goodie. Lemon and nutmeg reall...",2013-04-06,2771,60.0


In [71]:
recipes2.drop(['prepTime', 'indx'], axis=1, inplace=True)
recipes2['prep_t_min'] = recipes2['prep_t_min'].replace(0, np.nan)

In [72]:
del recipes_5
del recipes_5_h
del recipes_8
del no_h
del time_prep_final
del time_prep
del time_prep2

## Cook time

In [73]:
len_ct = recipes2['cookTime'].str.len()

In [74]:
len_ct.max()

9.0

In [75]:
recipes2.loc[recipes2['cookTime'].str.len() == 9, 'cookTime']

2214      PT216H15M
5545      PT120H30M
5729      PT216H30M
166153    PT144H20M
Name: cookTime, dtype: object

In [76]:
is_low = recipes2['cookTime'].str.islower()
is_low.fillna(False, inplace=True)
recipes2.loc[is_low, 'cookTime']

Series([], Name: cookTime, dtype: object)

In [77]:
pt_in = recipes2['cookTime'].str.startswith('PT')
pt_in.fillna(False, inplace=True)
cook_time = pd.DataFrame(recipes2.loc[pt_in, 'cookTime'])
cook_time.tail(2)

Unnamed: 0,cookTime
173273,PT5M
173276,PT25M


In [78]:
cook_time['cookTime'] = cook_time['cookTime'].str.replace('PT', '')
cook_time.tail(2)

Unnamed: 0,cookTime
173273,5M
173276,25M


In [79]:
cook_time['where_m'] = cook_time['cookTime'].str.find('M')
cook_time['where_m'].replace(-1, np.nan, inplace=True)
cook_time['where_h'] = cook_time['cookTime'].str.find('H')
cook_time['where_h'].replace(-1, np.nan, inplace=True)
cook_time.tail(2)

Unnamed: 0,cookTime,where_m,where_h
173273,5M,1.0,
173276,25M,2.0,


In [80]:
only_h = cook_time[(cook_time['where_m'].isnull() == True) & (cook_time['where_h'].isnull() == False)]

In [81]:
only_h.tail(2)

Unnamed: 0,cookTime,where_m,where_h
173252,4H,,1.0
173259,4H,,1.0


In [82]:
only_h = only_h.copy()
only_h.loc[:, 'cook_t_min'] = only_h.loc[:, 'cookTime'].str[:-1].astype(int) * 60
only_h.head(2)

Unnamed: 0,cookTime,where_m,where_h,cook_t_min
10,3H,,1.0,180
42,1H,,1.0,60


In [83]:
only_h.drop(['cookTime', 'where_m', 'where_h'], axis=1, inplace=True)

In [84]:
only_h.tail(2)

Unnamed: 0,cook_t_min
173252,240
173259,240


In [85]:
only_m = cook_time[(cook_time['where_m'].isnull() == False) & (cook_time['where_h'].isnull() == True)]
only_m = only_m.copy()
only_m.loc[:, 'cook_t_min'] = only_m.loc[:, 'cookTime'].str[:-1].astype(int)
only_m.drop(['cookTime', 'where_m', 'where_h'], axis=1, inplace=True)
only_m.tail(2)

Unnamed: 0,cook_t_min
173273,5
173276,25


In [86]:
m_and_h = cook_time[(cook_time['where_m'].isnull() == False) & (cook_time['where_h'].isnull() == False)]
m_and_h = m_and_h.copy()
m_and_h.tail(2)

Unnamed: 0,cookTime,where_m,where_h
173135,2H30M,4.0,1.0
173200,1H10M,4.0,1.0


In [87]:
proof1 = m_and_h['where_m'] > m_and_h['where_h']
proof1.value_counts()

True    6569
Name: count, dtype: int64

In [88]:
m_and_h['where_h'] = m_and_h['where_h'].astype(int)
m_and_h['where_m'] = m_and_h['where_m'].astype(int)
m_and_h['cook_t_min'] = m_and_h.apply(lambda r: r['cookTime'][:r['where_h']], axis=1)
m_and_h['cook_t_min'] = m_and_h['cook_t_min'].astype(int) * 60
m_and_h.tail(2)

Unnamed: 0,cookTime,where_m,where_h,cook_t_min
173135,2H30M,4,1,120
173200,1H10M,4,1,60


In [89]:
m_and_h['cook_t_min2'] = m_and_h.apply(lambda r: r['cookTime'][(r['where_h']+1):-1], axis=1)
m_and_h['cook_t_min2'] = m_and_h['cook_t_min2'].astype(int)

In [90]:
m_and_h['cook_t_min'] = m_and_h['cook_t_min'] + m_and_h['cook_t_min2']
m_and_h.drop(['cookTime', 'where_m', 'where_h', 'cook_t_min2'],
             axis=1, inplace=True)
m_and_h.tail(4)

Unnamed: 0,cook_t_min
173065,75
173095,125
173135,150
173200,70


In [91]:
only_h['ind'] = only_h.index
only_m['ind'] = only_m.index

In [92]:
cook_time_final = pd.merge(only_h, only_m, on='ind', how='outer')
only_h.shape[0] + only_m.shape[0] == cook_time_final.shape[0]

True

In [93]:
cook_time_final.tail(2)

Unnamed: 0,cook_t_min_x,ind,cook_t_min_y
100640,,173273,5.0
100641,,173276,25.0


In [94]:
cook_time_final['cook_t_min_x'].fillna(0, inplace=True)
cook_time_final['cook_t_min_y'].fillna(0, inplace=True)
cook_time_final['cook_t_min'] = cook_time_final['cook_t_min_x'] + cook_time_final['cook_t_min_y']
cook_time_final.drop(['cook_t_min_x', 'cook_t_min_y'], axis=1, inplace=True)
cook_time_final.tail(2)

Unnamed: 0,ind,cook_t_min
100640,173273,5.0
100641,173276,25.0


In [95]:
m_and_h['ind'] = m_and_h.index
cook_time_final2 = pd.merge(cook_time_final, m_and_h, on='ind', how='outer')
m_and_h.shape[0] + cook_time_final.shape[0] == cook_time_final2.shape[0]

True

In [96]:
cook_time_final2.tail(3)

Unnamed: 0,ind,cook_t_min_x,cook_t_min_y
107208,173095,,125.0
107209,173135,,150.0
107210,173200,,70.0


In [97]:
cook_time_final2['cook_t_min_x'].fillna(0, inplace=True)
cook_time_final2['cook_t_min_y'].fillna(0, inplace=True)
cook_time_final2['cook_t_min'] = cook_time_final2['cook_t_min_y'] + cook_time_final2['cook_t_min_x']
cook_time_final2.drop(['cook_t_min_x', 'cook_t_min_y'], axis=1, inplace=True)
cook_time_final2.tail(3)

Unnamed: 0,ind,cook_t_min
107208,173095,125.0
107209,173135,150.0
107210,173200,70.0


In [98]:
recipes2['ind'] = recipes2.index

In [101]:
# Merge everything back

recipes3 = pd.merge(recipes2, cook_time_final2,
                    on='ind', how='outer')

In [102]:
recipes3.shape[0] == recipes2.shape[0]

True

In [103]:
recipes3.tail(2)

Unnamed: 0,_id,name,ingredients,url,image,cookTime,source,description,time_stamp,prep_t_min,ind,cook_t_min
173275,551f29b696cc62227991d465,the ultimate queso bean dip,two 16 ounce cans old el paso refried beans\n4...,http://picky-palate.com/2015/04/03/the-ultimat...,http://picky-palate.com/wp-content/uploads/201...,,pickypalate,,2015-04-04,,173275,
173276,551f29c696cc6222a4e0c0e4,maple-sweetened banana muffins,⅓ cup melted coconut oil or extra-virgin olive...,http://cookieandkate.com/2015/healthy-banana-m...,http://cookieandkate.com/images/2015/04/mashed...,PT25M,cookieandkate,"These whole wheat, maple-sweetened banana muff...",2015-04-04,10.0,173276,25.0


In [104]:
recipes3.drop(['cookTime', 'ind'], axis=1, inplace=True)
del only_h
del only_m
del m_and_h
del recipes2

In [105]:
recipes3['prep_t_min'].fillna(0, inplace=True)
recipes3['cook_t_min'].fillna(0, inplace=True)
recipes3['time_tot_min'] = recipes3['cook_t_min'] + recipes3['prep_t_min']
recipes3['prep_t_min'].replace(0, np.nan, inplace=True)
recipes3['cook_t_min'].replace(0, np.nan, inplace=True)
recipes3['time_tot_min'].replace(0, np.nan, inplace=True)

In [106]:
mmax = recipes3['time_tot_min'].max()
recipes3.loc[recipes3['time_tot_min']==mmax, :]

Unnamed: 0,_id,name,ingredients,url,image,source,description,time_stamp,prep_t_min,cook_t_min,time_tot_min
60897,5162421096cc620d2615ea3f,homemade vanilla extract,1-½ cup 1-½ cup\n3 whole 3 whole\n1 whole 1 whole,http://tastykitchen.com/recipes/homemade-ingre...,http://static.tastykitchen.com/recipes/files/2...,tastykitchen,Make your own delicious vanilla extract!,2013-04-08,5.0,60480.0,60485.0


In [107]:
recipes.loc[recipes['name']=='homemade vanilla extract', :]

Unnamed: 0,_id,name,ingredients,url,image,cookTime,source,prepTime,description,time_stamp
22212,51611ddc96cc620d26155321,homemade vanilla extract,6 whole 6 whole\n2 cups 2 cups,http://tastykitchen.com/recipes/desserts/homem...,http://static.tastykitchen.com/recipes/files/2...,PT,tastykitchen,PT5M,Very easy to do!,2013-04-07
26272,51613d1596cc620d261562fd,homemade vanilla extract,"5 whole 5 whole\n8 ounces, fluid 8 ounces, fluid",http://tastykitchen.com/recipes/homemade-ingre...,http://static.tastykitchen.com/recipes/files/2...,PT,tastykitchen,PT5M,Avoid any preservatives and additives in this ...,2013-04-07
32354,51616a8596cc620d26157abf,homemade vanilla extract,1 bottle 1 bottle\n3 whole 3 whole\n10 jars 10...,http://tastykitchen.com/recipes/canning/homema...,http://static.tastykitchen.com/recipes/files/2...,PT5M,tastykitchen,PT5M,This delightful homemade vanilla extract is ea...,2013-04-07
52546,5161f91096cc620d2615c99f,homemade vanilla extract,½ pints ½ pints\n4 whole 4 whole\n½ pints ½ pints,http://tastykitchen.com/recipes/homemade-ingre...,http://static.tastykitchen.com/recipes/files/2...,PT,tastykitchen,PT10M,Easy to make homemade vanilla extract. I’ve br...,2013-04-07
60898,5162421096cc620d2615ea3f,homemade vanilla extract,1-½ cup 1-½ cup\n3 whole 3 whole\n1 whole 1 whole,http://tastykitchen.com/recipes/homemade-ingre...,http://static.tastykitchen.com/recipes/files/2...,PT1008H,tastykitchen,PT5M,Make your own delicious vanilla extract!,2013-04-08
61377,516247a596cc620d2615ec1e,homemade vanilla extract,12 whole 12 whole\n1 bottle 1 bottle,http://tastykitchen.com/recipes/homemade-ingre...,http://static.tastykitchen.com/recipes/files/2...,PT,tastykitchen,PT5M,Nothing tastes better and saves money like mak...,2013-04-08
108785,516c47e196cc62548fd2be0c,homemade vanilla extract,"10 vanilla beans, split lengthwise\n1 liter vodka",http://allrecipes.com/Recipe/Homemade-Vanilla-...,http://images.media-allrecipes.com/userphotos/...,,allrecipes,PT5M,"""Homemade vanilla extract! What could be bette...",2013-04-15


## Ingredients

In [108]:
recipes3['ingredients'].replace(['\n', '\r', '\r\n'], ' ', regex=True, inplace=True)
recipes3['ingredients'].replace('  ', ' ', regex=True, inplace=True)
recipes3.tail(2)

Unnamed: 0,_id,name,ingredients,url,image,source,description,time_stamp,prep_t_min,cook_t_min,time_tot_min
173275,551f29b696cc62227991d465,the ultimate queso bean dip,two 16 ounce cans old el paso refried beans 4 ...,http://picky-palate.com/2015/04/03/the-ultimat...,http://picky-palate.com/wp-content/uploads/201...,pickypalate,,2015-04-04,,,
173276,551f29c696cc6222a4e0c0e4,maple-sweetened banana muffins,⅓ cup melted coconut oil or extra-virgin olive...,http://cookieandkate.com/2015/healthy-banana-m...,http://cookieandkate.com/images/2015/04/mashed...,cookieandkate,"These whole wheat, maple-sweetened banana muff...",2015-04-04,10.0,25.0,35.0


## Description

In [110]:
# back-up everything I've done until now

recipes3.to_csv(r'your_path')

In [112]:
recipes4 = pd.read_csv(r'your_path', index_col=0)

In [113]:
recipes4.tail(3)

Unnamed: 0,_id,name,ingredients,url,image,source,description,time_stamp,prep_t_min,cook_t_min,time_tot_min
173274,551c86b796cc626b1ab4d901,the best homemade taco seasoning,1/4 cup ground cumin 1/4 cup kosher salt 2 tab...,http://picky-palate.com/2015/04/01/the-best-ho...,http://picky-palate.com/wp-content/uploads/201...,pickypalate,,2015-04-02,,,
173275,551f29b696cc62227991d465,the ultimate queso bean dip,two 16 ounce cans old el paso refried beans 4 ...,http://picky-palate.com/2015/04/03/the-ultimat...,http://picky-palate.com/wp-content/uploads/201...,pickypalate,,2015-04-04,,,
173276,551f29c696cc6222a4e0c0e4,maple-sweetened banana muffins,⅓ cup melted coconut oil or extra-virgin olive...,http://cookieandkate.com/2015/healthy-banana-m...,http://cookieandkate.com/images/2015/04/mashed...,cookieandkate,"These whole wheat, maple-sweetened banana muff...",2015-04-04,10.0,25.0,35.0


In [114]:
breakfast_in = recipes4['description'].str.contains('breakfast', case=False)
breakfast_in.value_counts()

description
False    154456
True       3524
Name: count, dtype: int64

In [115]:
dinner_in = recipes4['description'].str.contains('dinner', case=False)
dinner_in.value_counts()

description
False    152769
True       5211
Name: count, dtype: int64

In [116]:
lunch_in = recipes4['description'].str.contains('lunch', case=False)
lunch_in.value_counts()

description
False    155629
True       2351
Name: count, dtype: int64

In [117]:
print(breakfast_in.isnull().sum())
print(dinner_in.isnull().sum())
print(lunch_in.isnull().sum())

15297
15297
15297


In [118]:
breakfast_in.fillna(False, inplace=True)
dinner_in.fillna(False, inplace=True)
lunch_in.fillna(False, inplace=True)

In [119]:
breakfast_in = breakfast_in.astype(int)
dinner_in = dinner_in.astype(int)
lunch_in = lunch_in.astype(int)

In [120]:
recipes4['breakfast'] = breakfast_in
recipes4['dinner'] = dinner_in
recipes4['lunch'] = lunch_in
recipes4.drop('description', axis=1, inplace=True)
recipes4.tail(3)

Unnamed: 0,_id,name,ingredients,url,image,source,time_stamp,prep_t_min,cook_t_min,time_tot_min,breakfast,dinner,lunch
173274,551c86b796cc626b1ab4d901,the best homemade taco seasoning,1/4 cup ground cumin 1/4 cup kosher salt 2 tab...,http://picky-palate.com/2015/04/01/the-best-ho...,http://picky-palate.com/wp-content/uploads/201...,pickypalate,2015-04-02,,,,0,0,0
173275,551f29b696cc62227991d465,the ultimate queso bean dip,two 16 ounce cans old el paso refried beans 4 ...,http://picky-palate.com/2015/04/03/the-ultimat...,http://picky-palate.com/wp-content/uploads/201...,pickypalate,2015-04-04,,,,0,0,0
173276,551f29c696cc6222a4e0c0e4,maple-sweetened banana muffins,⅓ cup melted coconut oil or extra-virgin olive...,http://cookieandkate.com/2015/healthy-banana-m...,http://cookieandkate.com/images/2015/04/mashed...,cookieandkate,2015-04-04,10.0,25.0,35.0,0,0,0


In [121]:
recipes4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 173277 entries, 0 to 173276
Data columns (total 13 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   _id           173277 non-null  object 
 1   name          173277 non-null  object 
 2   ingredients   173276 non-null  object 
 3   url           173276 non-null  object 
 4   image         158277 non-null  object 
 5   source        173277 non-null  object 
 6   time_stamp    173277 non-null  object 
 7   prep_t_min    123736 non-null  float64
 8   cook_t_min    106918 non-null  float64
 9   time_tot_min  127413 non-null  float64
 10  breakfast     173277 non-null  int32  
 11  dinner        173277 non-null  int32  
 12  lunch         173277 non-null  int32  
dtypes: float64(3), int32(3), object(7)
memory usage: 16.5+ MB
