# <center> Allrecipes analysis - part 1: preparing data </center>

Data was collected with a web scraper available under the following Github repository - XXX. All data was collected and saved as a JSON file. Additionaly, the list of herbs was compiled from the Encyclopedia Britannica and other online sources. The scope of scraped data involves the following:

- name: recipe name
- prep: preparation time
- cook: cooking time
- additional: additional time
- total: total time to prepare the recipe
- servings: number of servings from recipe
- yield: number of servings with appropriate measure unit
- 5 stars: number of 5-star reviews received by the recipe
- 4 stars: number of 4-star reviews received by the recipe
- 3 stars: number of 3-star reviews received by the recipe
- 2 stars: number of 2-star reviews received by the recipe
- 1 stars: number of 1-star reviews received by the recipe
- nutrition: list of nutrients with values (stored in a single string)
- ingrediences: list of all ingrediences used in the recipe (storeed in single string)

The objective is to prepare data for further exploratory analysis.

The first step to prepare data for further analysis is importing all modules that we are going to use throughout this process.

In [1]:
import pandas as pd
import numpy as np
import random

The next step is to import data from the JSON file. We can do it with the following line of code:

In [2]:
allrecipes_df = pd.read_json('data.json', orient = 'index')

Let's pereform some basic data exploration. First of all we need to check what kind of data are we dealing with. To clarify that, we can print out all columns of the Dataframe, data types associated with each column and the first few rows of the data for the illustrative purposes.

In [3]:
print(allrecipes_df.columns)

Index(['name', 'prep:', 'cook:', 'additional:', 'total:', 'Servings:',
       'Yield:', '5 stars', '4 stars', '3 stars', '2 stars', '1 stars',
       'nutrition', 'ingrediences'],
      dtype='object')


In [4]:
allrecipes_df.dtypes

name             object
prep:            object
cook:            object
additional:      object
total:           object
Servings:         int64
Yield:           object
5 stars         float64
4 stars         float64
3 stars         float64
2 stars         float64
1 stars         float64
nutrition        object
ingrediences     object
dtype: object

In [5]:
allrecipes_df.head()

Unnamed: 0,name,prep:,cook:,additional:,total:,Servings:,Yield:,5 stars,4 stars,3 stars,2 stars,1 stars,nutrition,ingrediences
1,Juicy Roasted Chicken,10 mins,1 hr 15 mins,15 mins,1 hr 40 mins,6,6 servings,3179.0,538.0,147.0,47.0,42.0,423 calories; protein 30.9g; carbohydrates 1.2...,"1 (3 pound) whole chicken, giblets removed,sal..."
2,Microwave Corn on the Cob,,5 mins,,5 mins,1,1 serving,382.0,102.0,29.0,5.0,6.0,123 calories; protein 4.6g; carbohydrates 27.2...,"1 ear corn, husked and cleaned"
3,French Toast I,5 mins,15 mins,,20 mins,3,6 slices french toast,1337.0,473.0,87.0,31.0,29.0,240 calories; protein 10.6g; carbohydrates 33....,"6 thick slices bread,2 eggs,⅔ cup milk,¼ teas..."
4,The Best Banana Pudding,25 mins,,,25 mins,20,20 servings,830.0,117.0,30.0,15.0,15.0,329 calories; protein 4.2g; carbohydrates 56.9...,1 (5 ounce) package instant vanilla pudding mi...
5,Simple Macaroni and Cheese,10 mins,20 mins,,30 mins,4,4 servings,545.0,188.0,51.0,26.0,31.0,630 calories; protein 26.5g; carbohydrates 55g...,"1 (8 ounce) box elbow macaroni,¼ cup butter,¼ ..."


We can also check for some statistical measures (applied per column) with the Dataframe describe() method. This will omit all columns that are non-numerical.

In [6]:
allrecipes_df.describe()

Unnamed: 0,Servings:,5 stars,4 stars,3 stars,2 stars,1 stars
count,1000.0,980.0,980.0,980.0,980.0,980.0
mean,9.782,859.734694,202.681633,52.773469,19.970408,17.45102
std,11.847736,1467.753844,282.735392,79.691308,32.956996,35.097481
min,1.0,1.0,0.0,0.0,0.0,0.0
25%,4.0,133.75,35.0,8.0,3.0,2.0
50%,6.0,381.5,102.5,26.0,9.0,7.0
75%,12.0,934.75,242.25,63.0,23.0,18.0
max,192.0,14902.0,2581.0,810.0,362.0,606.0


Finally let us check for any missing data both in the whole dataset (at least one cell in a given row is missing) and per column (count of missing values per column):

In [7]:
rows_with_missing_values = allrecipes_df.isnull().any(axis = 1).sum() # in the whole dataset
print(f'Number of rows with missing values: {rows_with_missing_values}')

Number of rows with missing values: 769


In [8]:
allrecipes_df.isnull().sum(axis = 0) # count per column

name              0
prep:            52
cook:           183
additional:     717
total:           49
Servings:         0
Yield:            0
5 stars          20
4 stars          20
3 stars          20
2 stars          20
1 stars          20
nutrition         1
ingrediences      0
dtype: int64

It looks like the most missing values are in the 'additional' column.

Once we've learned a little bit about the data that we are working with, it's time to fix names formatting of our dataset columns. We should remove all whitespaces, colon from the end of the string and put all names in lower case. We can do it with the following line of code:

In [9]:
allrecipes_df.rename(columns = lambda x: x.lower().strip().strip(':'), inplace = True)
print(allrecipes_df.columns)

Index(['name', 'prep', 'cook', 'additional', 'total', 'servings', 'yield',
       '5 stars', '4 stars', '3 stars', '2 stars', '1 stars', 'nutrition',
       'ingrediences'],
      dtype='object')


Once we've fixed the column names formatting, it's time to create some new columns with valuable data and to fix existing columns, so that they are ready for further analysis. Let's start with creating column with an overall score of a recipe.

In [10]:
allrecipes_df['recipe_score'] = (allrecipes_df['1 stars'] * 1
                                + allrecipes_df['2 stars'] * 2
                                + allrecipes_df['3 stars'] * 3
                                + allrecipes_df['4 stars'] * 4
                                + allrecipes_df['5 stars'] * 5) / (allrecipes_df['1 stars'] 
                                                                   + allrecipes_df['2 stars'] 
                                                                   + allrecipes_df['3 stars'] 
                                                                   + allrecipes_df['4 stars'] 
                                                                   + allrecipes_df['5 stars'])

Wee should also create a column containing number of total reviews that recipe got. Without it, it would be difficult to asses whether the overall score is representative or not.

In [11]:
allrecipes_df['number_of_reviews'] = (allrecipes_df['1 stars']
                                     + allrecipes_df['2 stars']
                                     + allrecipes_df['3 stars']
                                     + allrecipes_df['4 stars']
                                     + allrecipes_df['5 stars'])

We've learned that 'nutrients' column contains information regarding nutritional components of the prepared meal but it would be nice to have all of those information in separate columns. Firstly we should check how exactly are those information stored and whether there are some missing data. If missing values occur, we might need to put in some exception handling during nutrients separation.

In [12]:
allrecipes_df['nutrition'].head(10)

1     423 calories; protein 30.9g; carbohydrates 1.2...
2     123 calories; protein 4.6g; carbohydrates 27.2...
3     240 calories; protein 10.6g; carbohydrates 33....
4     329 calories; protein 4.2g; carbohydrates 56.9...
5     630 calories; protein 26.5g; carbohydrates 55g...
6     507 calories; protein 33.1g; carbohydrates 8.7...
7     170 calories; protein 4.8g; carbohydrates 28.1...
8     247 calories; protein 6.8g; carbohydrates 33.5...
9     333 calories; protein 9.8g; carbohydrates 30.8...
10    252 calories; protein 4.5g; carbohydrates 29.7...
Name: nutrition, dtype: object

In [13]:
allrecipes_df[allrecipes_df['nutrition'].isnull()]   

Unnamed: 0,name,prep,cook,additional,total,servings,yield,5 stars,4 stars,3 stars,2 stars,1 stars,nutrition,ingrediences,recipe_score,number_of_reviews
368,Campbell's® Tuna Noodle Casserole,10 mins,35 mins,,45 mins,8,8 servings,252.0,147.0,47.0,10.0,3.0,,2 (10.75 ounce) cans Campbell's® Condensed Cr...,4.383442,459.0


It seems that we have one recipe without any information regarding nutrients. Despite that fact, it is time to break down nutrients from nutrition column into separate columns. We can start with creating columns per each nutrient existing in the 'nutrition' column and populating it with the default null value.

In [14]:
allrecipes_df['calories'] = np.nan
allrecipes_df['protein'] = np.nan
allrecipes_df['carbohydrates'] = np.nan
allrecipes_df['fat'] = np.nan
allrecipes_df['cholesterol'] = np.nan
allrecipes_df['sodium'] = np.nan

Here we will iterate over each row of the nutrition column and split the string into a list. Since we know that value associated with each nutrient lies directly behind it in the list, we can populate the dataframe columns representing specific nutrients based on the simple logic. If nutrient exists in the list, the amount of nutrient will always be the next value lying right next to it. The only exception are calories where the value lies before the name of nutritional component.

In [15]:
for index, nutrition_string in allrecipes_df['nutrition'].iteritems():
    try:
        nutrition_list = nutrition_string.split()
        nutrition_list = [value.strip().strip(';') for value in nutrition_list]
        for i in range(len(nutrition_list)):
            
            if nutrition_list[i] == 'calories':
                allrecipes_df.loc[index, 'calories'] = nutrition_list[i-1]
                
            if nutrition_list[i] == 'protein':
                allrecipes_df.loc[index, 'protein'] = nutrition_list[i+1]
                    
            if nutrition_list[i] == 'carbohydrates':
                allrecipes_df.loc[index, 'carbohydrates'] = nutrition_list[i+1]
                
            if nutrition_list[i] == 'fat':
                allrecipes_df.loc[index, 'fat'] = nutrition_list[i+1]
                
            if nutrition_list[i] == 'cholesterol':
                allrecipes_df.loc[index, 'cholesterol'] = nutrition_list[i+1]
                
            if nutrition_list[i] == 'sodium':
                allrecipes_df.loc[index, 'sodium'] = nutrition_list[i+1]
    except:
        continue

We should compare the random number of samples to check whether we maintained nutrients data accuracy.

In [16]:
random_index = random.sample(range(0, 1000), 5)

for index in random_index:
    print('\nRecipe number ' + str(index+1) + ':')
    print(allrecipes_df['nutrition'].iloc[index].strip('Full Nutrition'))
    print(allrecipes_df.iloc[index, 16:])


Recipe number 407:
369 calories; protein 0.4g; carbohydrates 49.2g; fat 6.7g; sodium 30.8mg.

calories             369
protein             0.4g
carbohydrates      49.2g
fat                 6.7g
cholesterol          NaN
sodium           30.8mg.
Name: 407, dtype: object

Recipe number 914:
554 calories; protein 31.8g; carbohydrates 17.1g; fat 39.6g; cholesterol 91.6mg; sodium 768.3mg.

calories              554
protein             31.8g
carbohydrates       17.1g
fat                 39.6g
cholesterol        91.6mg
sodium           768.3mg.
Name: 914, dtype: object

Recipe number 962:
265 calories; protein 4.9g; carbohydrates 35g; fat 11.9g; cholesterol 2mg; sodium 166.6mg.

calories              265
protein              4.9g
carbohydrates         35g
fat                 11.9g
cholesterol           2mg
sodium           166.6mg.
Name: 962, dtype: object

Recipe number 553:
385 calories; protein 18.1g; carbohydrates 37g; fat 18g; cholesterol 44.8mg; sodium 1244.6mg.

calories               

Seems like everything worked fine. Now let's change data type of nutrients columns from string to float and remove the unnecesary trailing indicating measuring unit. We will include it in the name of the column.

In [17]:
allrecipes_df['protein'] = allrecipes_df['protein'].str.rstrip('g').apply(float)
allrecipes_df['carbohydrates'] = allrecipes_df['carbohydrates'].str.rstrip('g').apply(float)
allrecipes_df['fat'] = allrecipes_df['fat'].str.rstrip('g').apply(float)
allrecipes_df['cholesterol'] = allrecipes_df['cholesterol'].str.rstrip('mg').apply(float)
allrecipes_df['sodium'] = allrecipes_df['sodium'].str.rstrip('mg.').apply(float)
allrecipes_df['calories'] = allrecipes_df['calories'].apply(float)

In [18]:
new_nutrient_names = {'protein' : 'protein [g]',
                      'carbohydrates' : 'carbohydrates [g]',
                      'fat' : 'fat [g]',
                      'cholesterol' : 'cholesterol [mg]',
                      'sodium' : 'sodium [mg]'}

allrecipes_df.rename(columns = new_nutrient_names, inplace = True)

Since we now have a column containing information on the caloric value of a recipe, we can divide it by the number of servings and generate some data on the calories per portion.

In [19]:
allrecipes_df['calories_per_serving'] = np.ceil(allrecipes_df['calories'] / allrecipes_df['servings'])

Let's see how our data looks like at this point.

In [20]:
allrecipes_df.head()

Unnamed: 0,name,prep,cook,additional,total,servings,yield,5 stars,4 stars,3 stars,...,ingrediences,recipe_score,number_of_reviews,calories,protein [g],carbohydrates [g],fat [g],cholesterol [mg],sodium [mg],calories_per_serving
1,Juicy Roasted Chicken,10 mins,1 hr 15 mins,15 mins,1 hr 40 mins,6,6 servings,3179.0,538.0,147.0,...,"1 (3 pound) whole chicken, giblets removed,sal...",4.711358,3953.0,423.0,30.9,1.2,32.1,97.0,661.9,71.0
2,Microwave Corn on the Cob,,5 mins,,5 mins,1,1 serving,382.0,102.0,29.0,...,"1 ear corn, husked and cleaned",4.620229,524.0,123.0,4.6,27.2,1.7,,21.5,123.0
3,French Toast I,5 mins,15 mins,,20 mins,3,6 slices french toast,1337.0,473.0,87.0,...,"6 thick slices bread,2 eggs,⅔ cup milk,¼ teas...",4.562596,1957.0,240.0,10.6,33.6,6.4,128.3,477.7,80.0
4,The Best Banana Pudding,25 mins,,,25 mins,20,20 servings,830.0,117.0,30.0,...,1 (5 ounce) package instant vanilla pudding mi...,4.71996,1007.0,329.0,4.2,56.9,9.6,8.6,205.2,17.0
5,Simple Macaroni and Cheese,10 mins,20 mins,,30 mins,4,4 servings,545.0,188.0,51.0,...,"1 (8 ounce) box elbow macaroni,¼ cup butter,¼ ...",4.414982,841.0,630.0,26.5,55.0,33.6,99.6,777.0,158.0


Looks great but there is just one more thing that we have to deal with before wee can move on. As you can see the data regarding time is still being held as a string and does not allow any reliable quantitative analysis. We sould convert information in the 'prep', 'cook', 'additional' and 'total' columns to unified measure of time - minutes. To achieve that with the lowest amount of effort we should first of all define an appropriate function - convert_to_mins(). However, beforee we can do it we need to learn more about time-related columns in our dataset. In order to apply some logic to our function we should check what are the possible configurations of time measurement units that occur independantly or together in a dataset (check whether there are cells where time is measured in hours, hours and minutes, etc.). To do it we will produce a set from all values from time-relatedd columns with digits stripped from the string. We will create a new dataframe - df_time, and perform all the necessary operations on it.

In [21]:
df_time = allrecipes_df.loc[:, 'prep':'total']

list_of_measures = list(df_time.loc[:, 'prep':'total'].values.T.ravel())

list_of_measures = [x for x in list_of_measures if type(x) == str]

list_of_measures = [''.join(x for x in i if not x.isdigit()) for i in list_of_measures]

set(list_of_measures)

{' day', ' days', ' hr', ' hr  mins', ' hrs', ' hrs  mins', ' min', ' mins'}

So the possibilities are - 'day', 'days', 'hr', 'hr mins', ' hrs', 'hrs mins', 'min', 'mins'. Now we need to apply appropriate logic to our convert_to_mins() function. We will split the string containing time informaiton into list of values. Later on we will extract and convert the time into minutes depending on the configuration of time units.

In [22]:
def convert_to_mins(x):
    x = x.split()
    if len(x) == 4:
        x = [int(d) for d in x if d.isdigit()]
        return (x[0] * 60 + x[1])
    elif (len(x) == 2) and ('hr' in x or 'hrs' in x):
        return int(x[0]) * 60
    elif (len(x) == 2) and ('day' in x or 'days' in x):
        return int(x[0]) * 3600
    else:
        return int(x[0])

In the function we convert all hours and days to minutes and convert all values to integers. If the only time unit used in the cell is minutes we just strip the non-digits and convert string to an integer without changing it's value. We can now apply our function to the df_time columns. However, since the split() method cannot be applied to np.nan values (and we have those in our dataset) we will have to mat the function to string values only.

In [23]:
df_time.prep = df_time.prep.map(lambda x: convert_to_mins(x) if type(x) == str else x)
df_time.cook = df_time.cook.map(lambda x: convert_to_mins(x) if type(x) == str else x)    
df_time.additional = df_time.additional.map(lambda x: convert_to_mins(x) if type(x) == str else x)
df_time.total = df_time.total.map(lambda x: convert_to_mins(x) if type(x) == str else x)

It would be nice to check whether there are no mistakes in our calculations. We will create a new Series 'check' where we sum all values from the 'prep', 'cook' and 'additional' columns of df_time to compare with the 'total' column. However, we cannot use pandas equals() method straight away. Comparison of the NaN values equals to False and therefore before we need to fill all cells containing NaN's with 0 first.

In [24]:
df_time.fillna(0, inplace = True)

calc_verification = df_time.loc[:, 'prep':'additional'].sum(axis = 1, skipna = True)

df_time['total'].equals(calc_verification)

False

So the Series are not equal. We must have made a mistake somewhere. In order to determine which rows of df_time['total'] and calc_verification do not match, we will create a numpy array - diff, where depending on whether value in the row is the same or not we will fill it with 1 and 0 respectively. Next, we will extract only records where the values don't match by indexing df_time with the diff array.

In [25]:
diff = np.where( df_time['total'] == calc_verification , 1, 0)

df_time[diff == 0]

Unnamed: 0,prep,cook,additional,total
159,20.0,10.0,3600.0,3600.0
288,20.0,10.0,3600.0,3600.0
818,10.0,15.0,10800.0,10800.0
844,5.0,0.0,7200.0,7200.0


So the problematic records are in rows 159, 288, 818 and 844. It seems that whenever days were used as a measurement of time, the 'prep' and 'cook' weree not added in the 'total' column's value. It seems that we haven't made a mistake after all but there were some inconsistancies (or rather simplifications) in the original data. We could leave it as it is bus since there are only 4 records that are not matching, we can amend those values manually.

In [26]:
df_time.at[159, 'total'] = 3630.00
df_time.at[288, 'total'] = 3630.00
df_time.at[818, 'total'] = 10825.00
df_time.at[844, 'total'] = 7205.00

Let's once again check whether all the values are the same.

In [27]:
df_time['total'].equals(calc_verification)

True

Great! Looks like it worked all right. Now we should substitute the columns in the original dataframe - allrecipes_df with whe columns from df_time. But first let's convert all 0 back to np.nan so that there are no inaccuracies in the future.

In [28]:
df_time.replace(0, np.nan, inplace=True)

allrecipes_df.loc[:, 'prep':'total'] = df_time

We can see how our data looks like now.

In [29]:
allrecipes_df.head()

Unnamed: 0,name,prep,cook,additional,total,servings,yield,5 stars,4 stars,3 stars,...,ingrediences,recipe_score,number_of_reviews,calories,protein [g],carbohydrates [g],fat [g],cholesterol [mg],sodium [mg],calories_per_serving
1,Juicy Roasted Chicken,10.0,75.0,15.0,100.0,6,6 servings,3179.0,538.0,147.0,...,"1 (3 pound) whole chicken, giblets removed,sal...",4.711358,3953.0,423.0,30.9,1.2,32.1,97.0,661.9,71.0
2,Microwave Corn on the Cob,,5.0,,5.0,1,1 serving,382.0,102.0,29.0,...,"1 ear corn, husked and cleaned",4.620229,524.0,123.0,4.6,27.2,1.7,,21.5,123.0
3,French Toast I,5.0,15.0,,20.0,3,6 slices french toast,1337.0,473.0,87.0,...,"6 thick slices bread,2 eggs,⅔ cup milk,¼ teas...",4.562596,1957.0,240.0,10.6,33.6,6.4,128.3,477.7,80.0
4,The Best Banana Pudding,25.0,,,25.0,20,20 servings,830.0,117.0,30.0,...,1 (5 ounce) package instant vanilla pudding mi...,4.71996,1007.0,329.0,4.2,56.9,9.6,8.6,205.2,17.0
5,Simple Macaroni and Cheese,10.0,20.0,,30.0,4,4 servings,545.0,188.0,51.0,...,"1 (8 ounce) box elbow macaroni,¼ cup butter,¼ ...",4.414982,841.0,630.0,26.5,55.0,33.6,99.6,777.0,158.0


By now we have worked a bit with the data and got to know it better. There is no doubt that we could get some insights based on this data only. However, in order to ilustrate how it could be enriched eveen further I decided to compile a list of herbs from an external source - Encyclopedia Britannica. In the following steps we will create a number of columns (one per herb) that indicate whether particular herb exists in a recipe. Those steps could be used with virtually all categories of ingrediences but in this case I've decided to limit analysis to herbs only. Let's beegin with importing the list of herbs from an external file and first. We can make some basic formatting on the wat as well.

In [30]:
with open('herbs_britannica.txt', 'r') as file:
    file_content = file.read()
    herbs = file_content.split('\n')
    herbs = [item.lower().strip() for item in herbs]
    herbs = list(set(herbs))
    herbs.remove('')

We have some well-formatted herbs list now. We can create a function that checks whether herb is listed in the ingrediences column in our Dataframe. Depending on whether it exist, the function will assign value 1 (representing existance of a herb) or 0 (if it doesn't exist) in the newly created herb column.

In [31]:
def create_ingredient_column(ingredient_name):
    ingredient = str(ingredient_name)
    allrecipes_df[ingredient] = np.where(allrecipes_df['ingrediences'].str.find(ingredient) != -1, 1, 0)

Now we can iterate over the herbs list and apply all the herbs to the function as variable ingredient_name.

In [35]:
for herb in herbs:
    create_ingredient_column(herb)

Now we will check which columns are of any value to us. We can check the number of columns with values equals only to 0, only to 1 and those where 0s and 1s exist.

In [36]:
count_of_herbs_columns = {}

for column in allrecipes_df.columns:
    if column in herbs:
        value_counts = np.array2string(allrecipes_df[column].value_counts().index.values)
        if value_counts not in count_of_herbs_columns.keys():
            count_of_herbs_columns[value_counts] = 1
        else:
            count_of_herbs_columns[value_counts] += 1

print (count_of_herbs_columns)

{'[0]': 39, '[0 1]': 45, '[1 0]': 1}


There are 39 herbs which doesn't exist in any recipe. We do not have any herb that exists in all recipes. It seems like there are 46 herb-related columns that are of value to us. We can now drop columns where all values are 0s.

In [39]:
nunique = allrecipes_df.nunique()
columns_to_drop = nunique[nunique == 1].index
allrecipes_df = allrecipes_df.drop(columns_to_drop, axis=1)

We can check how many times have remaining herbs occured in our recipes.

In [42]:
ingr_sum_dict = {}

for column in allrecipes_df.columns:
    if column in herbs:
        ingr_sum_dict[column] = allrecipes_df[column].sum()

print(ingr_sum_dict)

{'coriander': 3, 'mint': 3, 'dill': 19, 'parsley': 68, 'black pepper': 325, 'sage': 26, 'cayenne pepper': 69, 'celery': 61, 'wasabi': 1, 'plantain': 1, 'sesame': 28, 'lemon grass': 1, 'oregano': 68, 'marjoram': 5, 'tarragon': 5, 'curry': 7, 'savory': 1, 'fennel': 2, 'ginger': 37, 'nutmeg': 16, 'celery seed': 6, 'mustard': 62, 'garlic': 346, 'vanilla': 124, 'mace': 1, 'rosemary': 23, 'pepper': 509, 'bay leaf': 7, 'chives': 5, 'brown mustard': 2, 'anise': 3, 'dandelion': 1, 'poppy seed': 5, 'thyme': 41, 'allspice': 3, 'horseradish': 2, 'clove': 182, 'paprika': 73, 'turmeric': 2, 'star anise': 1, 'cumin': 45, 'basil': 54, 'cilantro': 36, 'chili pepper': 1, 'cinnamon': 66, 'cardamom': 3}


There are some herbs that were used only in a limited number of recipes. Let's see which herbs were used in at least 20 recipes.

In [45]:
filtered_ingr_sum_dict =  {key: value for (key, value) in ingr_sum_dict.items() if value >= 20}
print(filtered_ingr_sum_dict)

{'parsley': 68, 'black pepper': 325, 'sage': 26, 'cayenne pepper': 69, 'celery': 61, 'sesame': 28, 'oregano': 68, 'ginger': 37, 'mustard': 62, 'garlic': 346, 'vanilla': 124, 'rosemary': 23, 'pepper': 509, 'thyme': 41, 'clove': 182, 'paprika': 73, 'cumin': 45, 'basil': 54, 'cilantro': 36, 'cinnamon': 66}


Since we would like to perform some worthwile analysis in the future we sould get rid of the columns containing information on the herbs that were used rarely. As above, we will set the treshold at 20 use times.

In [46]:
for column in allrecipes_df.columns:    
    if (column in herbs) and (column not in list(filtered_ingr_sum_dict.keys())):
        allrecipes_df.drop(labels = column, axis = 1, inplace = True)

We can now see the final result of our data transformation. Let's print the first and last 5 rows of the dataframe.

In [48]:
allrecipes_df.head()

Unnamed: 0,name,prep,cook,additional,total,servings,yield,5 stars,4 stars,3 stars,...,vanilla,rosemary,pepper,thyme,clove,paprika,cumin,basil,cilantro,cinnamon
1,Juicy Roasted Chicken,10.0,75.0,15.0,100.0,6,6 servings,3179.0,538.0,147.0,...,0,0,1,0,0,0,0,0,0,0
2,Microwave Corn on the Cob,,5.0,,5.0,1,1 serving,382.0,102.0,29.0,...,0,0,0,0,0,0,0,0,0,0
3,French Toast I,5.0,15.0,,20.0,3,6 slices french toast,1337.0,473.0,87.0,...,1,0,0,0,0,0,0,0,0,1
4,The Best Banana Pudding,25.0,,,25.0,20,20 servings,830.0,117.0,30.0,...,1,0,0,0,0,0,0,0,0,0
5,Simple Macaroni and Cheese,10.0,20.0,,30.0,4,4 servings,545.0,188.0,51.0,...,0,0,1,0,0,0,0,0,0,0


In [49]:
allrecipes_df.tail()

Unnamed: 0,name,prep,cook,additional,total,servings,yield,5 stars,4 stars,3 stars,...,vanilla,rosemary,pepper,thyme,clove,paprika,cumin,basil,cilantro,cinnamon
996,Biscotti,15.0,25.0,,40.0,42,3 to 4 dozen,866.0,163.0,40.0,...,0,0,0,0,0,0,0,0,0,0
997,Potato Salad,20.0,10.0,360.0,390.0,20,20 servings,168.0,56.0,8.0,...,0,0,1,0,0,0,0,0,0,0
998,Top Ramen® Salad,15.0,10.0,30.0,55.0,6,6 servings,23.0,13.0,3.0,...,0,0,0,0,0,0,0,0,0,0
999,Meatball Nirvana,20.0,20.0,,40.0,4,12 meatballs,3327.0,774.0,200.0,...,0,0,1,0,0,0,0,0,0,0
1000,Air-Fryer Roasted Veggies,20.0,10.0,,30.0,4,2 cups,9.0,1.0,2.0,...,0,0,1,0,0,0,0,0,0,0


Great! It seems that we are done!