# Search Engine

At a high level we are going to code a search engine for recipes - a data set has been provided. The search engine is to be pretty basic, returning all recipes that contain all of the provided keywords. However, the user can choose from a number of orderings depending on their food preferences, which you need to support.

## Specification

The system provides `print`s out the results of the search, subject to the following rules:
1. It selects from the set of all articles that contain all of the words in the query (the positions/order of the words in the recipe are to be ignored).
2. It orders them based on the provided ordering (a string, meaning defined below).
3. It `print`s the top 10 only, preserving the order (just their title, one per line).

As an aside, do not worry about memory usage. If duplicating some data can make your code faster/neater then feel free.

### Search

The search should check the following parts of the recipe (see data set description below):
* `title`
* `categories`
* `ingredients`
* `directions`

For instance, given the query "banana cheese" would return "Banana Layer Cake with Cream Cheese Frosting" in the results. Note that case is to be ignored ("banana" matches "Banana") and the words __do not__ have to be next to one another, or in the same order as the search query.

### Ordering

There are three ordering modes to select from, each indicated by passing a string to the `search` function:
* `normal` - Based simply on the number of times the search terms appear in the recipe. A score is calculated and the order is highest to lowest. The score sums the following terms:
    * $8 \times$ Number of times a query word appears in the title
    * $4 \times$ Number of times a query word appears in the categories
    * $2 \times$ Number of times a query word appears in the ingredients
    * $1 \times$ Number of times a query word appears in the directions
    * The `rating` of the recipe (if not available assume $0$).

* `simple` - Tries to minimise the complexity of the recipe, for someone who is in a rush. Orders to minimise the number of ingredients multiplied by the numbers of steps in the instructions.

* `healthy` - Order from lowest to highest by this cost function:
$$\frac{|\texttt{calories} - 510n|}{510} + 2\frac{|\texttt{protein} - 18n|}{18} + 4\frac{|\texttt{fat} - 150n|}{150}$$
Where $n \in \mathbb{N}^+$ is selected to minimise the cost ($n$ is a positive integer and $n=0$ is not allowed). This can be understood in terms of the numbers $510$, $18$ and $150$ being a third of the recommended daily intake (three meals per day) for an average person, and $n$ being the number of whole meals the person gets out of cooking/making the recipe. So this tries to select recipes that neatly divide into a set of meals that are the right amount to consume for a healthy, balanced diet.

To clarify the use of the string, to get something healthy that contains cheese you might call `search('cheese', 'healthy')`. In the case of a recipe that is missing information required for the ordering in the case of `normal` it should simply be treated as having no matches. But in the case of `simple` and `healthy` that recipe should be dropped entirely, and not returned from such searches.

### Data Set

A file, `recipes.json` is used, containing 20K recipes. It can be parsed into a Python data structure using the `json` module.
* `title` : Name of recipe.
* `categories` : A list of tags assigned to the recipe.
* `ingredients` : What is in it, as a list. Includes quantities.
* `directions` : List of steps to make the recipe.
* `rating` : A rating, out of 5, of how good it is.
* `calories` : How many calories it has.
* `protein` : How much protein is in it.
* `fat` : How much fat is in it.

Note that the data set was obtained via web scrapping and hence is noisy - every key in the dictionary provided for each recipe is missing at least once. Your code will need to cope with this. Ignore any recipe that has no title.

You will probably want to explore the data before starting, so you have an idea of what your code has to deal with.

Data set origin: https://www.kaggle.com/hugodarwood/epirecipes/version/2


In [1]:
import json
import re
import numpy as np
rawData = json.load(open('recipes.json', 'r'))

Process and clean data

In [2]:
#remove duplicates

data = []
for recipe in rawData:
    if recipe not in data: 
        data.append(recipe)


#loop through list and remove all entries with no title also assign an outlier of 100000 value for all missing values. 
#outlier values are set because missing values will cause errors in the calculations.
i = 0
for diction in data:
    if 'title' in diction:
        if diction['title'] == '':
            del data[i]
    else:
        del data[i]
    if 'fat' in diction:
        if diction['fat'] == ''or diction['fat']==None:
            #if no value then assign outlier
            diction['fat']=10000000.0
        elif (type(diction['fat']) != int) and (type(diction['fat']) != float):
            diction['fat']=10000000.0
    if 'calories' in diction:
        if diction['calories'] == '' or diction['calories']==None:
            diction['calories']=10000000.0
        elif (type(diction['calories']) != int) and (type(diction['calories']) != float):
            diction['calories']=10000000.0
    if 'protein' in diction:
        if diction['protein'] == '' or diction['protein']==None:
            diction['protein']=10000000.0
        elif (type(diction['protein']) != int) and (type(diction['protein']) != float):
            diction['protein']=10000000.0
    if 'sodium' in diction:  
        if diction['sodium'] == '' or diction['sodium']==None:
            diction['sodium']=10000000.0
        elif (type(diction['sodium']) != int) and (type(diction['sodium']) != float):
            diction['sodium']=10000000.0
    i+= 1


# create a list which identifies all unique words for directions, cateories, ingredients and title. Then make all data lower case and remove punctuation using regex
# while looping through calulate the number of ingrediants and len of directions and to the dictionary. will be used to calculate simple search
#use set to create unique set of words

dictCount = 0
uniqueDict = {}
uniqueList = []
for diction in data:
    direcWords = []
    categWords = []
    ingreWords = []
    directionLen = 0
    ingreLength = ''
    title = ''
    fat = 10000000.0
    calories = 10000000.0
    protein = 10000000.0
    rating = 10000000.0
    if 'directions' in diction:
        #make lower case
        diction['directions'] = [v.lower() for v in diction['directions']]
        #remove punctuation
        diction['directions'] = [re.sub(r'[^\w\s]','',v) for v in diction['directions']]
        directionLen = len(diction['directions'])
        for direcList in diction['directions']:
            direcWords.append(direcList.split())
        direcWords = [item for sublist in direcWords for item in sublist]
        #use set to create unique set of words
        direcWords = list(set(direcWords))
    else:
        directionLen = 10000000.0
    if 'categories' in diction:
        diction['categories'] = [v.lower() for v in diction['categories']]
        diction['categories'] = [re.sub(r'[^\w\s]','',v) for v in diction['categories']]
        for categList in diction['categories']:
            categWords.append(categList.split())
        categWords = [item for sublist in categWords for item in sublist]
        categWords = list(set(categWords))

    if 'ingredients' in diction:   
        diction['ingredients'] = [v.lower() for v in diction['ingredients']]
        diction['ingredients'] = [re.sub(r'[^\w\s]','',v) for v in diction['ingredients']]
        for ingreList in diction['ingredients']:
            ingreWords.append(ingreList.split())
        ingreWords = [item for sublist in ingreWords for item in sublist]
        ingreWords = list(set(ingreWords))
        ingreLength = len(ingreWords)
    else:
        ingreLength = 10000000.0
    if 'desc' in diction:
        if isinstance(diction['desc'],str):
            diction['desc'] = diction['desc'].lower()
            diction['desc'] = re.sub(r'[^\w\s]','',diction['desc'])
    if 'title' in diction:
        if isinstance(diction['title'],str):
            diction['title'] = diction['title'].lower()
            diction['title'] = re.sub(r'[^\w\s]','',diction['title'])
            title = diction['title']
    if 'fat' in diction:
         fat = diction['fat']
    if 'calories' in diction:
         calories = diction['calories']
    if 'protein' in diction:
         protein = diction['protein']
    if 'rating' in diction:
         rating = diction['rating']
    else:
        rating = 0



    #create a dictionary for containing unique words, numerical data, ingredient lenth and direction length for each recipe
    uniqueDict = {'title':title, 'directions':direcWords, 'categories':categWords,'ingredients':categWords, 'ingredientLen':ingreLength, 'fat':fat,'calories':calories,'protein':protein,'rating':rating,'directionLen':directionLen}
    uniqueList.append(uniqueDict)



In [3]:
uniqueList[:10]

[{'title': 'lentil apple and turkey wrap ',
  'directions': ['end',
   'place',
   'drain',
   'until',
   'stock',
   'heat',
   'thyme',
   'to',
   '3',
   'tortillas',
   'you',
   'boil',
   'then',
   'simmer',
   '1inch',
   '1',
   'spread',
   'clean',
   'water',
   'turkey',
   'several',
   'top',
   'wrap',
   'of',
   'apple',
   '2',
   'are',
   'work',
   'tender',
   'before',
   'border',
   '30',
   'away',
   'depending',
   'pepper',
   'dry',
   'assemble',
   'lentil',
   'about',
   'center',
   'if',
   'right',
   'the',
   'bring',
   'up',
   'with',
   'bowl',
   'lettuce',
   'using',
   'saucepan',
   'nearest',
   'from',
   'mixture',
   'lavash',
   'left',
   'they',
   'low',
   'some',
   'out',
   'rolling',
   'surface',
   'season',
   'discard',
   'a',
   'in',
   'salt',
   'as',
   'fold',
   'lemon',
   'tomato',
   'carrot',
   'celery',
   'add',
   'sheet',
   'reduce',
   'cool',
   'roll',
   'lentils',
   'minutes',
   'slices',
   'm

In [4]:
#function to convert search string into list of lowercase words and remove punctuation

def ProcessSearchQuer(searchStr):
    searchStrList = []
    searchStrList = searchStr.split()
    searchStrList = [v.lower() for v in searchStrList]
    searchStrList = [re.sub(r'[^\w\s]','',v) for v in searchStrList]
    return searchStrList


In [5]:
#function to search for all recipes which contain the search query
#also calculates the normal score of the matching recipes
def searchResults(searchStrList):
    
    searchResultsList = []
    #loop through recipes
    for dictionary in uniqueList:
        normalScore = 0
        for word in searchStrList:
            #loop through words in query
            #setting score variables for each word
            matchword = 0
            normalWord = 0
            titleScore = 0
            categoriesScore = 0
            directionScore = 0
            ingrediantsScore = 0
            #if (word in dictionary['title']) or (word in dictionary['categories']) or (word in dictionary['directions']) or (word in dictionary['ingredients']):
            #check if word is in any of the search columns 
            if (word in dictionary['title']):
                matchword = 1
            else:
                for direcList in dictionary['directions']:
                    if word in direcList:
                        matchword = 1
                for ingreList in dictionary['ingredients']:
                    if word in ingreList:
                        matchword = 1
                for categList in dictionary['categories']:
                    if word in categList:
                        matchword = 1
            #if word matched calculate score assigned to the word
            if matchword == 1:
                if (word in dictionary['title']):
                    #calculate title score based on counting words in title key
                    titleScore = dictionary['title'].count(word)*8
                for categList in dictionary['categories']:
                    categoriesScore += categList.count(word)*4
                    #calculate categories score by looping through categories list and counting amount of times words match
                for ingreList in dictionary['ingredients']:
                    ingrediantsScore += ingreList.count(word)*2
                for direcList in dictionary['directions']:
                    directionScore += direcList.count(word)
                #calculate the normal score associated with the word
                normalWord = titleScore + categoriesScore + directionScore + ingrediantsScore
                #add scores of all words in the query
                normalScore = normalScore + normalWord
            else:
                #if word doesnt match break the loop as query not fulfilled
                matchword = 0
                break
            
        #dictionary['simpleScore'] = ingrediantsScore
            
        if matchword==1:
            #if query matches then record normalscore and append recipe to return list
            dictionary['normalScore'] = normalScore + dictionary['rating']
            searchResultsList.append(dictionary)
        
            
    return searchResultsList


In [6]:
#simple search function. load all results from searchResults and sort by simple score

def simplesearch (resultsDict):
    #create np array of title, ingredientlength and directionlength
    title = np.array([results['title'] for results in resultsDict])
    ingredientLen = np.array([results['ingredientLen'] for results in resultsDict])
    directionLen = np.array([results['directionLen'] for results in resultsDict])
    #calculate simple score
    simpleScore = ingredientLen * directionLen
    #sort scores based on simplescore ascending
    simpleSortedScore = title[simpleScore.argsort()]
    #return top if more than 10 results
    if len(simpleSortedScore)>10:
        return simpleSortedScore[0:10]
    else:
        return simpleSortedScore

#normal search function. load all results from searchResults and sort by normal score
def normalscore(resultsDict):
    #create np array of title and normal score
    title = np.array([results['title'] for results in resultsDict])
    normalScore = np.array([results['normalScore'] for results in resultsDict])
    #sort scores based on normal score descending
    normalSortedScore = title[normalScore.argsort()[::-1]]
    #return top if more than 10 results
    if len(normalSortedScore)<10:
        return normalSortedScore
    else:
        return normalSortedScore[0:10]
    


In [7]:
#health array returns the integer n values for protein, calories and fats which make the cost function 0
def healthArray(resultsDict):
    calories = np.array([result['calories'] for result in resultsDict])
    #divide array by corresponding scaler for calories = 510. This produces the associated minumum n for calories
    ncal = (calories/510).astype(int)
    protein = np.array([result['protein'] for result in resultsDict])
    nprot = (protein/18).astype(int)
    fat = np.array([result['fat'] for result in resultsDict])
    nfat = (fat/150).astype(int)
    
    #return the n values for each recipe
    return ncal, nprot, nfat
 

#cost function calculates the best n to minimise the cost function
def Cost(nList, cal, prot, fat):
    #set an arbitory mincost
    mincost =100000
    for n in nList:
        #calculat cost function for all 6 ns
        cost = abs(cal-510*n)/510 + (2*abs(prot-18*n))/18 + (4*abs(fat-150*n))/150
    if cost < mincost:
        mincost = cost
    #return mincost from all ns    
    return mincost

#mimimumN returns sorted recipes based on health optimisation. pass in all n values calculated in healthyArray
def minimumN(nCal, nProt, nFat, resultsDict):
    costArr = np.array([])
    #need to calulate ceiling of the n values because astype provides a floor
    nCalCeil = nCal + 1
    nProtCeil = nProt + 1
    nFatCeil = nFat + 1
    #create arrays
    title = np.array([result['title'] for result in resultsDict])
    calories = np.array([result['calories'] for result in resultsDict])
    protein = np.array([result['protein'] for result in resultsDict])
    fat = np.array([result['fat'] for result in resultsDict])
    #use cost funcion to calculate the minimum cost
    #loop through all matching recipes
    for i in range(0,len(resultsDict)):
        #n values to test
        nList = [nCal[i],nProt[i],nFat[i],nCalCeil[i],nProtCeil[i],nFatCeil[i]]
        cali = calories[i]
        proti = protein[i]
        fati = fat[i]
        #create array of miminum cost values for each recipe
        costArr = np.append(costArr,Cost(nList,cali,proti,fati))
    
    #sort by cost values
    if len(title[costArr.argsort()])<10:
        return title[costArr.argsort()]
    else:
        return title[costArr.argsort()][0:10]
        


In [8]:
#search function which returns the top 10 recipes for each search
def search(query, ordering = 'normal'):
    preocessedQuery = ProcessSearchQuer(query)
    resultsDict = searchResults(preocessedQuery)
    if ordering == 'simple':
        return simplesearch (resultsDict)
    elif ordering == 'normal':
        return  normalscore(resultsDict)
    elif ordering == 'healthy':
        ncal, nprot, nfat = healthArray(resultsDict)
        return minimumN(ncal, nprot, nfat, resultsDict)

Run some searches and see the results

In [9]:
print(search('chocolate, peanut', 'healthy'))

['cream puffs with vanilla ice cream and chocolate sauce '
 'hazelnut chocolate and strawberry torte '
 'chocolate hazelnut cake with praline chocolate crunch '
 'chocolate mousse pie ' 'lemonpistachio crunch cake '
 'white chocolate cheesecake with cinnamon and lemon '
 'fudgy peanut brownies ' 'chocolate sauce '
 'hot chocolate baked french toast ' 'crispy peanut butter snack cake ']


In [10]:
print(search('tomatoes, onions', 'normal'))

['bulgur and blackeyed pea salad with tomatoes onions and pomegranate dressing '
 'grilled green bean salad with red onions and tomatoes '
 'fish in foil with sweet onions tomatoes and mojo verde '
 'sea bass with tomatoes and onions '
 'orzo with tomatoes feta and green onions '
 'sauted scallops with cherry tomatoes green onions and parsley '
 'pork chops with golden onions and wilted tomatoes '
 'succotash of fresh corn lima beans tomatoes and onions '
 'pierogies with tomatoes browned onions and dill '
 'fall salad of corn cherry tomatoes and ovenroasted green onions ']


In [11]:
print(search('bean, potatoes, cheese', 'simple'))

['surprise salad ' 'vegetable enchiladas ' 'sausage stew '
 'tuscan vegetable soup with white beans and parmesan '
 'trenette with pesto potatoes and green beans ' 'soupe au pistou '
 'vegetable soup with basil and garlic '
 'quick sweet potato mushroom and black bean burrito '
 'farmers market salad with spiced goat cheese rounds '
 'provençal vegetable soup soupe au pistou ']
