# Search Engine

This lab is about starting from scratch, and hence having to structure the code yourself. A specification for the program you are to implement is given below - do pay attention as misunderstandings may cost you marks! Being precise is a necessary skill for a programmer. To give an executive summary, you are to code a search engine for recipes. A data set has been provided. The search engine is to be pretty basic, returning all recipes that contain all of the provided keywords. However, the user can choose from a number of orderings depending on their food preferences, which you need to support.

## Marking and submission

These lab exercises are marked, and contribute to your final grade. This lab exercise has 20 marks to earn, equivalent to 12% of your final grade.

Please submit your completed workbook to the auto marker before 2021-11-14 20:00 GMT. The workbook you submit must be an .ipynb file, which is saved into the directory you're running Jupyter; alternatively you can download it from the menu above using `File -> Download As -> Notebook (.ipynb)`. Remember to save your work regularly (`Save and checkpoint` in the `File` menu, the icon of a floppy disk, or `Ctrl-S`). It is wise to verify it runs to completion with _Restart & Run All_ before submission.

You must comply with the universities plagiarism guidelines: http://www.bath.ac.uk/library/help/infoguides/plagiarism.html

## Specification

The system must provide a function ``search``, with the following specification:
```
def search(query, ordering = 'normal', count = 10):
  ...
```

It `print`s out the results of the search, subject to the following rules:
1. It selects from the set of all recipes that contain __all__ of the words in the query (the positions/order of the words in the recipe are to be ignored).
2. It orders them based on the provided ordering (a string, meaning defined below).
3. It `print`s the top `count` matches only, preserving the order from best to worst. Must `print` just their title, one per line. Must be less than `count` if the search is specific enough that less than `count` recipes match.

As an aside, do not worry about memory usage. If duplicating some data can make your code faster/neater then feel free.



### Data set

A file, `recipes.json` is provided, containing 17K recipes. It can be parsed into a Python data structure using the [`json`](https://docs.python.org/3/library/json.html) module. It is a list, where each recipe is a dictionary containing various keys:
* `title` : Name of recipe; you can assume these are unique
* `categories` : A list of tags assigned to the recipe
* `ingredients` : What is in it, as a list
* `directions` : List of steps to make the recipe
* `rating` : A rating, out of 5, of how good it is
* `calories` : How many calories it has
* `protein` : How much protein is in it
* `fat` : How much fat is in it

Note that the data set was obtained via web scrapping and hence is noisy - every key except for `title` is missing from at least one recipe. Your code will need to cope with this.

You will probably want to explore the data before starting, so you have an idea of what your code has to deal with.

Data set came from https://www.kaggle.com/hugodarwood/epirecipes/version/2 though note it has been cleaned it up, by deleting duplicates and removing the really dodgy entries.



### Search

The search should check the following parts of the recipe (see data set description below):
* `title`
* `categories`
* `ingredients`
* `directions`

For instance, given the query "banana cheese" you would expect "Banana Layer Cake with Cream Cheese Frosting" in the results. Note that case is to be ignored ("banana" matches "Banana") and the words __do not__ have to be next to one another, in the same order as the search query or even in the same part of the recipe ("cheese" could appear in the title and "banana" in the ingredients). However, all words in the search query __must__ appear somewhere.



### Tokenisation

This is the term for breaking a sentence into each individual word (token). Traditionally done using regular expressions, and Python does have the `re` module, but there is no need to do that here (`re` can be quite fiddly). For matching words your tokenisation must follow the following steps:
1. Convert all punctuation and digits into spaces. For punctuation use the set in [`string.punctuation`](https://docs.python.org/3/library/string.html#string.punctuation), for digits [`string.digits`](https://docs.python.org/3/library/string.html#string.digits). You may find [`translate()`](https://docs.python.org/3/library/stdtypes.html#str.translate) interesting!
2. [`split()`](https://docs.python.org/3/library/stdtypes.html#str.split) to extract individual tokens.
3. Ignore any token that is less than $3$ characters long.
4. Make tokens lowercase.

When matching words for search (above) or ordering (below) it's only a match if you match an entire token. There are many scenarios where this simple approach will fail, but it's good enough for this exercise. The auto marker will be checking the above is followed! When doing a search the code should ignore terms in the search string that fail the above requirements.



### Ordering

There are three ordering modes to select from, each indicated by passing a string to the `search` function:
* `normal` - Based simply on the number of times the search terms appear in the recipe. A score is calculated and the order is highest to lowest. The score sums the following terms (repeated words are counted multiple times, i.e. "cheese cheese cheese" is $3$ matches to "cheese"):
    * $8 \times$ Number of times a query word appears in the title
    * $4 \times$ Number of times a query word appears in the categories
    * $2 \times$ Number of times a query word appears in the ingredients
    * $1 \times$ Number of times a query word appears in the directions
    * The `rating` of the recipe (if not available assume $0$)

* `simple` - Tries to minimise the complexity of the recipe, for someone who is in a rush. Orders to minimise the number of ingredients multiplied by the numbers of steps in the directions.

* `healthy` - Order from lowest to highest by this cost function:
$$\frac{|\texttt{calories} - 510n|}{510} + 2\frac{|\texttt{protein} - 18n|}{18} + 4\frac{|\texttt{fat} - 150n|}{150}$$
Where $n \in \mathbb{N}^+$ is selected to minimise the cost ($n$ is a positive integer and $n=0$ is not allowed). This can be understood in terms of the numbers $510$, $18$ and $150$ being a third of the recommended daily intake (three meals per day) for an average person, and $n$ being the number of whole meals the person gets out of cooking/making the recipe. So this tries to select recipes that neatly divide into a set of meals that are the right amount to consume for a healthy, balanced diet. Try not to overthink the optimisation of $n$, as it's really quite simple to do!

To clarify the use of the ordering string, to get something healthy that contains cheese you might call `search('cheese', 'healthy')`. In the case of a recipe that is missing a key in its dictionary the rules are different for each search mode:
* `normal` - Consider a missing entry in the recipe (e.g. no `ingredients` are provided) to simply mean that entry can't match any search words (because it has none!), but the item is still eligible for inclusion in the results, assuming it can match the search with a different entry.
* `simple` - If a recipe is missing either `ingredients` or `directions` it is dropped from such a search result. Because the data is messy if either of these lists is of length $1$ it should be assumed that the list extraction has failed and the recipe is to also be dropped from the search results.
* `healthy` - If any of `calories`, `protein` or `fat` is missing the recipe should be dropped from the result.



### Extra

You may find the [_inverted index_](https://en.wikipedia.org/wiki/Inverted_index) interesting. It's a data structure used by search engines. For each word a user may search for this contains a list of all documents (recipes) that contain the word. This may take a little effort to understand, but the resulting code will be both faster and neater.

## Advice

* Don't just start coding: Make a plan and work out what you intend to do.
* Think about structure, as messy code leads to mistakes.
* Plan your data structures. Don't be afraid to use sheets of paper if that works for you!
* Don't duplicate code, put it in a function/method instead.

* Divide and conquer. Break the system into parts that can implemented with minimal dependency on the rest. Functions or OOP are both suitable for doing this.
* Test early. Verify the individual parts work before trying to combine them. Factor this into the order you implement the parts of the system - don't implement something you are going to struggle to test before implementing, and verifying, other parts.
* Do not try and do a 'big bang', where you get it all working at once. Instead, get it working with features missing, then add those features in, one at a time.

* Keep things as simple as possible. Avoid long functions/methods.
* Include comments, as a form of planning and for your own sanity!
* Regularly reset the kernel and rerun the entire workbook. It is very easy to break something but not notice, because the correct version remains in memory.

## Marks
* __20 marks__: For many different _unit tests_ that will be run on `search`. They will cover all of the details in the above specification!
    * One test checks that it's faster than $0.1$ seconds on average (on the computer which runs the auto marker, which is quite fast) to do a search, so try not to be too inefficient (it ignores any time your notebook spends building data structures to be used by `search`). Note that the validation implementation comes in at $0.001$ seconds per search (after $5.5$ seconds of preparation), so this is very achievable!
    * You may want to look into Python's `set()` object, as it is useful for _one_ of the possible ways to implement this.
    * There will be sorting. The [Sorting how to](https://docs.python.org/3/howto/sorting.html) may help.
    * The auto marker does give some feedback, and you can run it as many times as you want. Don't be afraid to test an incomplete or semi-broken version of your code if stuck or unsure. It may help!
    * The validation implementation is 104 lines of code split over 5 cells (including white space for clarity and comments). Coded by someone who probably has much more experience than you, so you shouldn't aim to match this, but it's a good clue: If you find yourself at 500 lines you may want to stop and think some more! (line count does not include testing code, which is about the same amount again)

In [1]:
import pandas as pd
import json
import string
import numpy as np
#loading/reading data from provided dataset...

with open('recipes.json','r') as f:
    data = json.load(f)
df = pd.DataFrame(data)
df
#df.iloc[:,:].isnull().sum().sum() ........ 11122 total NaN values
#df.iloc[:,2:4].isnull().sum().sum() ..... 27 NaN values
#df.iloc[:,4:8].isnull().sum().sum() ...... 11065 NaN values
#df.iloc[:,:2].isnull().sum().sum() ....... 30 NaN values


Unnamed: 0,title,categories,ingredients,directions,rating,calories,protein,fat
0,"""Adult"" Pimiento Cheese","[Cheese, Vegetable, No-Cook, Vegetarian, Quick...","[2 or 3 large garlic cloves, a 2-ounce jar dic...",[Force garlic through a garlic press into a la...,3.125,,,
1,"""Blanketed"" Eggplant","[Tomato, Vegetable, Appetizer, Side, Vegetaria...","[8 small Japanese eggplants, peeled, 16 large ...",[Place eggplants on double thickness of paper ...,3.750,1386.0,9.0,133.0
2,"""Bloody Mary"" Tomato Toast with Celery and Hor...","[Condiment/Spread, Tomato, Appetizer, Kid-Frie...","[1 lemon, zested, juiced, 1/2 shallot, finely ...","[Combine lemon zest, lemon juice, shallot, tom...",5.000,189.0,2.0,16.0
3,"""Brown on Blonde"" Blondies","[Cookbook Critic, Dessert, Brownie, Nut, Walnu...","[170 grams unsalted butter, 100 grams walnuts,...",[Crank the oven to 350ºF. Coat the insides of ...,0.000,321.0,5.0,18.0
4,"""California Roll"" Salad","[Salad, Ginger, Rice, Vegetable, Side, Vegetar...","[1 1/2 cups long-grain rice, 1/4 cup plus 3 ta...",[Into a large saucepan of salted boiling water...,4.375,369.0,9.0,13.0
...,...,...,...,...,...,...,...,...
17729,Zucchini-Pecan Cake with Cream Cheese Frosting,"[Cake, Mixer, Egg, Dessert, Bake, Kentucky Der...","[Nonstick vegetable oil spray, 1 1/2 cups all ...",[Position rack in center of oven and preheat t...,4.375,453.0,6.0,31.0
17730,"Zucchini-Wrapped Halibut ""Scallops""","[Fish, Low Sodium, Dinner, Seafood, Halibut, S...","[1 teaspoon smoked paprika, 1 teaspoon ground ...","[To prep, mix the ingredients for the two rubs...",4.375,,,
17731,"Zucchini-Wrapped Red Snapper with Tomato, Cumi...","[Fish, Tomato, Vegetable, Bake, Orange, Snappe...","[2 teaspoons cumin seeds, four 6- to 7-ounce r...","[Preheat oven to 450°F., In a small dry heavy ...",3.750,259.0,14.0,15.0
17732,Zuni Rolls with Raspberry Chipotle Sauce,"[Sandwich, Berry, Cheese, Dairy, Poultry, turk...",[1 cup fresh raspberries or frozen unsweetened...,[In a small saucepan combine sauce ingredients...,4.375,,,


In [2]:
#tokenisation (attempt #3...-_-) aka removing punctuation and numbers from our input; tokens to be made lowercase!
#revert to code from attempt #1...but converting lists to strings to solve previous error!

def tokenisation(input):
    
    #1.Convert all punctuation and digits into spaces. For punctuation use the set in string.punctuation, for digits string.digits. You may find translate() interesting!
    
    #--------------------------------------------------------------------------------------------------

    tokenised = str(input).translate(str.maketrans(string.punctuation+string.digits, ' '*len(string.punctuation+string.digits))) #removes digits and punctuation
    
    #--------------------------------------------------------------------------------------------------

    #tokenised = tokenised.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) #removes punctuation...used above in shortened code
    #token = tokenised.lower().split() #makes token lowercase...used below
    
    #2.split() to extract individual tokens; #3.Ignore any token that is less than  3  characters long: 4.Make tokens lowercase
    
    #--------------------------------------------------------------------------------------------------

    tokens = ' '.join(i for i in tokenised.split() if len(i) >= 3).lower().split() #converts lists to strings c:
    
    #2.split() to extract individual tokens; 4.Make tokens lowercase.
    #tokens = tokens.lower().split()     ...used above to shorten code


    return tokens

#tokenisation(data) returns list of individual tokens according to specified instructions
#finally! the output i wanted.....-_-   

In [3]:
#after creating inverted index...very difficult to extract values for number of ingredients and number of steps using:
#for i in range(df.shape[0]):
   #     df['number of ingredients'] = len(df.iloc[i,df.columns[2]]))
        #df['number of directions'] = len(df.iloc[i,df.columns[3]]) .....due to NaN values present
#TypeError: object of type 'float' has no len()...(NaN values are of type float!)
#after many attempts creating a dictionary which fills NaN values, is able to extract said values

empty_dict_names={df.columns[i]:list() for i in range(4)}
empty_dict_nums={df.columns[i]:0 for i in range(4,8)}

def new_dict(data) :
    for col_name, value in empty_dict_names.items() :
        if col_name not in data :
            for i in range(4):
                if col_name == df.columns[i] :data[col_name] = list()

    for col_num, value in empty_dict_nums.items() :
        if col_num not in data :
            for i in range(4,8):
                if col_num == df.columns[i] :df.loc[:,df.columns[i]] = df[df.columns[i]].fillna(0) #fills NaNs with zeroes
                else :data[col_num] = 0
    return data



In [4]:
#we can now use the extracted words to start building the search function...
#using the inverted indexing approach...we must first start our search by finding the corresponding tokens in our dataset!
#dont forget to account for the weights of each search term!:
    #8×  Number of times a query word appears in the title
    #4×  Number of times a query word appears in the categories
    #2×  Number of times a query word appears in the ingredients
    #1×  Number of times a query word appears in the directions

#attempt @ invert indexing...

def invert_index(data):
    
    inverted_index = list()
    for i in range(df.shape[0]):
        inverted_dict = {}
        inverted_dict[df.columns[2]] = len(new_dict(data[i])[df.columns[2]])
        inverted_dict[df.columns[3]] = len(new_dict(data[i])[df.columns[3]])
        #inverted_dict[i]['ingredients'] = len(data[i][df.columns[2])
        #inverted_dict[i]['directions'] = len(data[i][df.columns[3])
        for j in data[i]:
            inverted_dict[df.columns[0]] = df.iloc[i][df.columns[0]]
            if j == df.columns[0]:                 #title
                for k in tokenisation(df.iloc[i][j]):
                    if k not in inverted_dict: inverted_dict[k]=8
                    else: inverted_dict[k]+=8
            elif j == df.columns[1]:              #categories
                for k in tokenisation(df.iloc[i][j]):
                    if k not in inverted_dict: inverted_dict[k]=4
                    else: inverted_dict[k]+=4
            elif j == df.columns[2]:              #ingredients
                for k in tokenisation(df.iloc[i][j]):
                    if k not in inverted_dict: inverted_dict[k]=2
                    else: inverted_dict[k]+=2
            elif j == df.columns[3]:              #directions
                for k in tokenisation(df.iloc[i][j]):
                    if k not in inverted_dict: inverted_dict[k]=1
                    else: inverted_dict[k]+=1
                
            else:
#our output is still missing some info...ratings etc are nowhere to be found!
#let's add them :3
                inverted_dict[j]=tokenisation(data[i][j])
                for l in range(4,8):
                    inverted_dict[df.columns[l]] = df.iloc[i][df.columns[l]]
                    #inverted_dict[df.columns[5]] = df.iloc[i][df.columns[5]]
                    #inverted_dict[df.columns[6]] = df.iloc[i][df.columns[6]]
                    #inverted_dict[df.columns[7]] = df.iloc[i][df.columns[7]]
                inverted_dict['index'] = i #will become useful later!
        inverted_index.append(inverted_dict)
    return inverted_index

invert_index(data)
#len(inverted_index)


[{'ingredients': 7,
  'directions': 2,
  'title': '"Adult" Pimiento Cheese ',
  'adult': 8,
  'pimiento': 8,
  'cheese': 13,
  'vegetable': 4,
  'cook': 4,
  'vegetarian': 4,
  'quick': 4,
  'easy': 4,
  'cheddar': 7,
  'hot': 4,
  'pepper': 5,
  'winter': 4,
  'gourmet': 4,
  'alabama': 4,
  'large': 3,
  'garlic': 4,
  'cloves': 2,
  'ounce': 2,
  'jar': 3,
  'diced': 2,
  'pimientos': 3,
  'cups': 2,
  'coarsely': 2,
  'grated': 2,
  'sharp': 2,
  'preferably': 2,
  'english': 2,
  'canadian': 2,
  'vermont': 2,
  'about': 2,
  'ounces': 2,
  'cup': 2,
  'mayonnaise': 3,
  'crackers': 2,
  'toasted': 2,
  'baguette': 2,
  'slices': 2,
  'crudités': 2,
  'force': 1,
  'through': 1,
  'press': 1,
  'into': 1,
  'bowl': 1,
  'and': 4,
  'stir': 2,
  'with': 3,
  'liquid': 1,
  'add': 1,
  'toss': 1,
  'mixture': 1,
  'combine': 1,
  'well': 1,
  'taste': 1,
  'season': 1,
  'freshly': 1,
  'ground': 1,
  'black': 1,
  'spread': 3,
  'may': 1,
  'made': 1,
  'day': 1,
  'ahead': 1,
  'c

In [5]:
#we can now reference the inverted index function in order to create the corresponding 'normal', 'simple' and 'healthy' functions!

def normal_model(query, recipe) :
    
    #query=tokenisation(query)
    
    normal_dict = {}

    for name in recipe :
        rating = 0
        rating += inverted_index[name][df.columns[4]] 
        for token in query :
            rating += inverted_index[name][token]
        normal_dict[name] = rating
    sorted_recipes = sorted(normal_dict,key=normal_dict.get,reverse=True)
    return sorted_recipes

def simple_model(recipe) :
    
    simple_dict = {}
    
    for name in recipe :
        if inverted_index[name][df.columns[2]] == 1 or inverted_index[name][df.columns[3]] == 1 :pass
        elif inverted_index[name][df.columns[2]] == 0 or inverted_index[name][df.columns[3]] == 0 : pass
        else : score = inverted_index[name][df.columns[2]] * inverted_index[name][df.columns[3]]
        simple_dict[name] = score
    sorted_recipes = sorted(simple_dict,key=simple_dict.get)
    return sorted_recipes

def healthy_model(recipe) :
    
    healthy_dict = {}
    
    for name in recipe :
        if inverted_index[name][df.columns[5]] == 0 or inverted_index[name][df.columns[6]] == 0 or inverted_index[name][df.columns[7]] == 0  :pass
        else :
            for n in range(2,6):
                initial_cost = (abs(inverted_index[name][df.columns[5]] - 510)/510) + (2 * abs(inverted_index[name][df.columns[6]] - 18)/18) + (4*abs(inverted_index[name][df.columns[7]] - 150)/150)
                new_cost = (abs(inverted_index[name][df.columns[5]] - 510*n)/510) + (2 * abs(inverted_index[name][df.columns[6]] - 18*n)/18) + (4*abs(inverted_index[name][df.columns[7]] - 150*n)/150)
                if new_cost < initial_cost:
                    initial_cost = new_cost
            healthy_dict[name] = initial_cost
    sorted_recipes = sorted(healthy_dict,key=healthy_dict.get)
    return sorted_recipes

In [6]:
def print_search(sorted_recipes, count) :
    sorted_list = list()
    for recipe in sorted_recipes :
        sorted_list.append(inverted_index[recipe][df.columns[0]])
    print("\n".join(sorted_list[:count]))

In [7]:
inverted_index = invert_index(data)
def preprocessing_and_indexing(query):
    recipe_index = list()
    recipe_index=(inverted_index[i]['index'] for i in range(df.shape[0]) if all(j in inverted_index[i] for j in tokenisation(query)))
    return recipe_index


In [8]:
#inverted_index = invert_index(data)
def search(query, ordering = 'normal', count = 10) :
    try:
        
        query=tokenisation(query)
        recipe_index = preprocessing_and_indexing(query)
        #recipe_index=(inverted_index[i]['index'] for i in range(df.shape[0]) if all(j in inverted_index[i] for j in preprocessing(query)))
            
            
        if ordering == 'normal' :print_search(normal_model(query, recipe_index), count)
        elif ordering == 'simple' :print_search(simple_model(recipe_index),count)
        elif ordering == 'healthy' :print_search(healthy_model(recipe_index),count)  
            
    except:pass

In [9]:
from timeit import default_timer as timer
from datetime import timedelta

search('bread', 'normal', 20)

start = timer()
end = timer()
print(timedelta(seconds=end-start))


Sticky Date and Almond Bread Pudding with Amaretto Zabaglione 
Dry Corn Bread for Bread Pudding 
Ham and Fresh Peach Chutney on Corn Bread 
Prune, Apple, and Chestnut Bread Pudding 
Apple and Maple Bread Pudding 
Sofrito Grilled Bread 
Banana Bread 
Pumpkin Bread Pudding with Spicy Caramel Apple Sauce 
Toasted Bread with Burrata and Arugula 
Rosemary and Cranberry Soda Bread 
Pumpkin Bread Puddings Brûlée 
Chocolate Bread Pudding with Walnuts and Chocolate Chips 
Rustic Bread Stuffing with Red Mustard Greens, Currants, and Pine Nuts 
Whole Wheat Bread with Raisins and Walnuts 
Apple Raisin Bread Pudding 
Banana Bread Pudding with Rum Sauce 
Blue Cheese Bread 
Bread Pudding with Dried Apricots, Dried Cherries and Caramel Sauce 
Bread Pudding with Warm Bourbon Sauce 
Corn Bread, Green Chili and Pine Nut Stuffing 
0:00:00.000016


In [10]:
#extra code from failed attempts...
#or just code used for visualisation purposes...

#--------------------------------------------------------------------------------------------------

#our output is still missing some info...ratings etc are nowhere to be found!
#let's add them :3

#rating_dict = {}
#for i in range(df.shape[0]):
    #rating = df.iloc[i][df.columns[4]]     #rating column
    #df.loc[:,df.columns[4]] = df[df.columns[4]].fillna(0)     #accounting for missing values
    #rating_dict[df.iloc[i][df.columns[0]]] = rating
#rating_dict

#df.loc[:,'calories'] = df['calories'].fillna(0)
#df.loc[:,'protein'] = df['protein'].fillna(0)
#df.loc[:,'fat'] = df['fat'].fillna(0)

#--------------------------------------------------------------------------------------------------
#def token_search(dict):
    
    #col_dict={df.columns[0]:8, df.columns[1]:4, df.columns[2]:2, df.columns[3]:1}
    
    #a=list()
    
    #for i,j in dict.items():
        #if i in df.columns[:4]:
            #searched_tokens=tokenisation(str(j))
            #a+=[searched_tokens]*col_dict[i]
    #searched=[token for sub in a for token in sub]
    
    #return searched

#--------------------------------------------------------------------------------------------------
  
#tokenisation (attempt #1) aka removing punctuation and numbers from our input; tokens to be made lowercase!

#def tokenisation(input):
    
    #1.Convert all punctuation and digits into spaces. For punctuation use the set in string.punctuation, for digits string.digits. You may find translate() interesting!
    #tokenised = input.translate(str.maketrans(string.digits, ' '*len(string.digits))) #removes digits
    #tokenised = tokenised.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) #removes punctuation
    
    #2.split() to extract individual tokens; 4.Make tokens lowercase.
    #token = tokenised.lower().split(' ') #makes token lowercase
    
    #3.Ignore any token that is less than  3  characters long.
    #tokens = list()
    #for i in token:
        #if len(i) >= 3:
            #tokens.append(i)

    #return tokens

#tokenisation(recipes)
#AttributeError: 'list' object has no attribute 'translate'
#current tokenisation function does not work on lists...sensible approach would be to extend its functionability or alter input data into a dictionary 
#retry start approach by considering importing data as pandas dataframe for ease of use of indexing...
    
#--------------------------------------------------------------------------------------------------

#df = pd.DataFrame(recipes)

#if 'title' in df.columns:
    # print(1)
#for items in df.columns[1:4]:
    #delist=' '.join([str(word) for word in df.iloc[i][items]])
    #print(delist)
#df

#--------------------------------------------------------------------------------------------------

#tokenisation attempt #2...this time using created dataframe!

#def tokenisation(df):
    
    #tokens={}
    
    #for i in range(len(df)):
        #j = list()
        
        #1.Convert all punctuation and digits into spaces. For punctuation use the set in string.punctuation, for digits string.digits. You may find translate() interesting!
        #tokenised_title = df.iloc[i][df.columns[0]]
        #tokenised_title = tokenised_title.translate(str.maketrans(string.digits, ' '*len(string.digits))) #removes digits
        #tokenised_title = tokenised_title.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) #removes punctuation
        
            
        #2.split() to extract individual tokens; 4.Make tokens lowercase.
        #tokenised_title = tokenised_title.lower().split(' ').replace('-',' ') #makes token lowercase
        
        #for k in numpy.unique(tokenised_title):
            #if len(k) > 2 and k not in j:
                #j.append(k)
        
        #for items in df.columns[1:4]:
            #delist=' '.join([str(word) for word in df.iloc[i][items]])
            
            #tokenised_items = delist
            #tokenised_items = tokenised_items.translate(str.maketrans(string.digits, ' '*len(string.digits))) #removes digits
            #tokenised_items = tokenised_items.translate(str.maketrans(string.punctuation, ' '*len(string.punctuation))) #removes punctuation
            #tokenised_items = tokenised_items.lower().split(' ') #makes token lowercase
            
            
            #for k in numpy.unique(tokenised_items):
                #if len(k) > 2 and k not in j:
                    #j.append(k)
        #for k in j:
            #if k not in tokens:
                #tokens[k] = [df.iloc[i][df.columns[0]]]
            #elif df.iloc[i][df.columns[0]] not in tokens[k]:
                #tokens[k].append(df.iloc[i][df.columns[0]])
    #return tokens

#--------------------------------------------------------------------------------------------------

#for i in range(df.shape[0]):
  #  print(data[i][df.columns[4]] )
    
#range(df.shape[0])

#index for i in len(df.index)

#print (df.index[:])

#for column in df.columns[:4]:
    #print(df.loc[:,'score'])
#for i in range(len(df)):
    #print(len(df.loc[i,df.columns[2]]))
    
#for i in range(df.shape[0]):
    #print(len(df.loc[i,df.columns[2]]))
    #df.loc[:,df.columns[i]] = df[df.columns[i]].fillna(0)
    #print(df.loc[i,df.columns[2]])
    #df.iloc[:,4:8] = df.iloc[:,4:8].fillna(int(0))
    #df.iloc[:,4:8]
#for column in df.columns[:4]:
    #print(df.loc[:,'score'])
    
#len(df.loc[0,df.columns[2]])
#len(df.iloc[1,2])