# Search Engine

Below is mini search engine for parsing JSON file filled with recipes. This was originally part of a class project. The work was a real challenge, but getting everything organized was quite satisfying in the end. 

## Specification

The system provides a function ``search``, with the following specification:
```
def search(query, ordering = 'normal'):
  ...
```

It prints out the results of the search, subject to the following rules:
1. It selects from the set of all articles that contain all of the words in the query (the positions/order of the words in the recipe are to be ignored).
2. It orders them based on the provided ordering (a string, meaning defined below).
3. It prints the top 10 only, preserving the order (just their title, one per line).


### Search

The search checks the following parts of the recipe (see data set description below):
* `title`
* `categories`
* `ingredients`
* `directions`

For instance, given the query "banana cheese" you can expect "Banana Layer Cake with Cream Cheese Frosting" in the results. Note that searches are not case-sensitive ("banana" matches "Banana") and the words __do not__ have to be next to one another, or in the same order as the search query.

### Ordering

There are three ordering modes to select from, each indicated by passing a string to the `search` function:
* `normal` - Based simply on the number of times the search terms appear in the recipe. A score is calculated and the order is highest to lowest. The score sums the following terms:
    * $8 \times$ Number of times a query word appears in the title
    * $4 \times$ Number of times a query word appears in the categories
    * $2 \times$ Number of times a query word appears in the ingredients
    * $1 \times$ Number of times a query word appears in the directions
    * The `rating` of the recipe (if not available assume $0$).

* `simple` - Tries to minimise the complexity of the recipe, for someone who is in a rush. Orders to minimise the number of ingredients multiplied by the numbers of steps in the instructions.

* `healthy` - Order from lowest to highest by this cost function:
$$\frac{|\texttt{calories} - 510n|}{510} + 2\frac{|\texttt{protein} - 18n|}{18} + 4\frac{|\texttt{fat} - 150n|}{150}$$
Where $n \in \mathbb{N}^+$ is selected to minimise the cost ($n$ is a positive integer and $n=0$ is not allowed). This can be understood in terms of the numbers $510$, $18$ and $150$ being a third of the recommended daily intake (three meals per day) for an average person, and $n$ being the number of whole meals the person gets out of cooking/making the recipe. So this tries to select recipes that neatly divide into a set of meals that are the right amount to consume for a healthy, balanced diet.

To clarify the use of the string, to get something healthy that contains cheese you might call `search('cheese', 'healthy')`. In the case of a recipe that is missing information required for the ordering in the case of `normal` it should simply be treated as having no matches. But in the case of `simple` and `healthy` that recipe should be dropped entirely, and not returned from such searches.

### Data Set

A file, `recipes.json` is provided, containing 20K recipes. It is a list, where each recipe is a dictionary containing many keys. THe key value pairs are:


* `title` : Name of recipe.
* `categories` : A list of tags assigned to the recipe.
* `ingredients` : What is in it, as a list. Includes quantities.
* `directions` : List of steps to make the recipe.
* `rating` : A rating, out of 5, of how good it is.
* `calories` : How many calories it has.
* `protein` : How much protein is in it.
* `fat` : How much fat is in it.

Note that the data set was obtained via web scrapping and hence is noisy - every key in the dictionary provided for each recipe is missing at least once. 

Data set origin: https://www.kaggle.com/hugodarwood/epirecipes/version/2


### The import block

In [None]:
import json
import re         
import pandas as pd

### The rest 

In [None]:
class Wrapper_Class(): 
    """A class that hold a JSON file and provides accessors."""
    def __init__(self, json_file):         
        self.json_file = json_file
        
        
    def get_valid_indices(self, string):       
        """Returns indices that match the search string. Look at specifications for the search criteria if confused."""
        tokens = string.split()
        list_of_indices = []
                
        for i in range(len(self.json_file)): 
            recipe = self.json_file[i]

            A = self.__search_dict(tokens, recipe, 'title')

            B = any([ 
            self.__search_dict(tokens, recipe, 'categories'),
            self.__search_dict(tokens, recipe, 'ingredients'),
            self.__search_dict(tokens, recipe, 'directions'),
            ])

            # every valid recipe has to have a title AND at least one of the other three categories. 
            if A or B: 
                list_of_indices.append(i)   


        return(list_of_indices)
   
        

    def __search_dict(self, tokens, json_dict, key_name): 
        """searches a given recipes' key value to see if it contains ALL of the tokens. 
        Only called from get_valid_indices()"""
        
        # This code is a bit hard to follow, admittedly.  
        try: 
            for token in tokens:
                
                #converting to string as some values in the dict are actually lists. Very tricky. 
                key_value = str(json_dict[key_name])      
                
                # match is true iff the token is anywhere in the key value. 
                match = re.search(token, key_value, re.IGNORECASE) 
                
                if bool(match) == False:                  
                    return(False)
            
  
        except KeyError:     # if the key doesn't exist, return false.                               
            return(False)    
        
        else: 
            return(True)     # return true iff All the tokens match at least once and all the keys exists. 
    
        
    def get_nutrition_information(self, indices):                     
        """gets nutrition information for a given indice. Returns None if any of the three is missing."""
        
        
        return_list = []
        for indice in indices:
            recipe = self.json_file[indice]
            calories = 0
            protein = 0
            fat = 0

            try:     
                calories = recipe['calories']
                protein = recipe['protein']
                fat = recipe['fat']
            except KeyError:
        # If anything goes wrong, calories = None. Calories is arbitrary, it could be any of the three.
                calories = None    
                
                
        # If anything goes wrong or the value is missing, return None.
            if calories == None or protein == None or fat == None:
                return_list.append(None)
            else: 
                return_list.append([calories, fat, protein])
                
        return(return_list)
    
    def get_recipes(self, indices):
        """gets all recipes for a list of indices. Returns a list."""

        new_list = []
        for index in indices: 
            new_list.append(self.json_file[index])
            
        return(new_list)
    
    def get_frequencys(self, indices, query):
        '''Gets frequencys for how many times the search terms appear in the list of recipes.'''
        
        
        # This method is somewhat hard to understand. I apologize. 
        
        tokens = query.split()
        
        dict_keys = ['title', 'categories', 'ingredients', 'directions']   # The four categories to seach.  
        
        # List of lists containing counts for each category as well as the rating.
        master_frequency_list = []                                        
        
        for i in indices:
            count_list = []
            json_dict = self.json_file[i]
            
            for key_name in dict_keys: 
                try: 
                    results = 0               # results is number of occurences for a given dict_value. 
                    for token in tokens:
                        
                        #converting the dict value to a string as some values are actually lists.
                        key_value = str(json_dict[key_name])         
                        match = re.findall(token, key_value, re.IGNORECASE)
                        
                        # len(match) is equal to the number of occurrences. 
                        results += len(match)   
                    count_list.append(results)
                except KeyError:                                       
                        results = 0      # if the key doesn't exist, Number of occurences = zero. 
                        count_list.append(results)
                



            # adding the rating value to the end of the frequency list      
            try:
                if json_dict['rating'] == None:
                    count_list.append(0)
                else: 
                    count_list.append(json_dict['rating'])  
                    

            except KeyError: 
                count_list.append(0)       # if the key doesn't exist, count it as zero, just like with other dicts
            
            master_frequency_list.append(count_list)

        return(master_frequency_list)
    
    
    def get_number_of_steps_and_ingredients(self, indices):
        """Returns a list containing the number of steps and ingredients"""
        dict_keys = ['directions', 'ingredients'] 
        current_recipe = None
        pair_list = []              # list of steps for directions (first element) and ingredients (second).  
        return_list = []
        for i in indices:
            pair_list = [] 
            current_recipe = self.json_file[i]
            for key in dict_keys: 
                pair_list.append(len(current_recipe[key]))
            
            return_list.append(pair_list)
            
        return(return_list)
                
            
        
            
            
    def get_recipe_title(self, i):
        """Returns the recipe title. Assumes the title actually exists. Which it will if this function is only
            ever called after get_valid_indices() """
        
        recipe = self.json_file[i]
        return(recipe['title'])              
    
    
    


In [None]:
### Convert this to an obect???

class Search_Object: 
    """A class that holds a Wrapper_Class object and implements the neccessary search algorithms."""
    
    def __init__(self, json_file_name): 
        with open('recipes.json', 'r') as json_data:  # Assumed to be in the same directory. 
            recipes = json.load(json_data)
        
        self.json_object = Wrapper_Class(recipes)
        
        
    def search(self, query, ordering = 'normal'):
        """The main seach function"""
        print('searching for: %s ...' % query)
        print('\n')

        # list of recipe indices for that meet the search criteria. 
        list_of_indices = self.json_object.get_valid_indices(query)    

        if ordering == 'normal':
            recipe_freqs = self.json_object.get_frequencys(list_of_indices, query)   
            recipe_utility = []   # list of utility for each recipe. 

            for recipe_freq in recipe_freqs: 
                recipe_utility.append(self.__calc_normal_utility(recipe_freq[0], recipe_freq[1], recipe_freq[2], recipe_freq[3], 
                                                     recipe_freq[4]))

            self.__print_best_ten(list_of_indices, recipe_utility)

        elif ordering == 'health':
            list_of_nutrition_information = self.json_object.get_nutrition_information(list_of_indices)

            # list of utility for each recipe. 
            health_utility_list = [] 
            for i in range(len(list_of_nutrition_information)):
                if list_of_nutrition_information[i] != None:           # Ignoring none values, which indicate missing info.     
                    health_utility_list.append(self.__calc_health_utility(list_of_nutrition_information[i][0], 
                                                              list_of_nutrition_information[i][1], 
                                                              list_of_nutrition_information[i][2] ))
                elif  list_of_nutrition_information[i] == None: 
                    list_of_indices[i] = -1 

            # getting rid of all recipe indices that have missing nutrition information, marked by a -1. 
            list_of_indices = list(filter(lambda x: x != -1, list_of_indices))      
            self.__print_best_ten(list_of_indices, health_utility_list)



        elif ordering == 'simple':
            recipe_freqs = self.json_object.get_number_of_steps_and_ingredients(list_of_indices) 
            simple_utility_list = [] # list of utility


            for i in range(len(recipe_freqs)):
                if recipe_freqs[i][0] == 0 or  recipe_freqs[i][1] == 0: 
                    list_of_indices[i] = -1

                else:  
                    simple_utility_list.append(self.__calc_simple_utility(recipe_freqs[i][0], recipe_freqs[i][1]))

            # getting rid of all recipe indices that have missing nutrition information, marked by a -1. 
            list_of_indices = list(filter(lambda x: x != -1, list_of_indices))                        
            self.__print_best_ten(list_of_indices, simple_utility_list)



        else: 
            print('invalid ordering.') # In case ordering is invalid. 


    def __calc_health_utility(self, calories = 0, protein = 0, fat = 0, n = 1):
        """Computes utility of a recipe by health"""
        # n is a positive integer. Represents the number of whole meals the person gets out of cooking/making the recipe.

        health_value = 0 

        health_value += abs((calories - (510*n))/510)
        health_value += (2 * abs((protein - (18*n))/18))
        health_value += (4 * abs((fat - (150*n))/150)) 

        return(health_value)

    def __calc_normal_utility(self, title_freq, categories_freq, ingredients_freq , directions_freq, rating = 0):
        """Computes utility of a recipe by various frequencies"""

        normal_utility = 0     
        normal_utility +=  8 * title_freq
        normal_utility +=  4 * categories_freq
        normal_utility +=  2 * ingredients_freq
        normal_utility +=  1 * directions_freq
        normal_utility += rating 
        
        
        # We invert normal_utility  because print_best_ten() prints in descending order and normal is the only one where
        # a bigger scalar should be higher in the list. Kind of a kludge, but it works. 
        normal_utility = normal_utility **-1

        return(normal_utility)


    def __calc_simple_utility(self, number_of_ingredients, number_of_steps):    
        """Computes utility of a recipe by number of ingredients and steps."""
        simplicity = number_of_ingredients * number_of_steps
        
        return(simplicity) 


    def __print_best_ten(self, list_of_indices, list_of_utility):
        """prints the best ten recipes by utility along with the index. """

        # assert statement to make sure these lists correspond to one another. 
        assert len(list_of_indices) == len(list_of_utility)

        frame = pd.DataFrame.from_dict({'Json_Indices': list_of_indices, 'Utility': list_of_utility} )

        # Sort based on utility
        frame.sort_values('Utility', inplace= True, ascending= True)

        if len(frame >= 10): 
            name_indices = frame.iloc[0:10, 0]         
        else: 
            name_indices = frame.iloc[0:len(frame), 0]

        if name_indices.empty: 
            print('Nothing matched your search terms. Sorry.')

        else: 
            for i in name_indices: 
                print(self.json_object.get_recipe_title(i), "---", i)         # printing the top ten. 
                
                
    def get_recipe_by_indice(self, indice): 
        """Helper function that returns the dict for a given recipe. Used for further lookup."""
        return(self.json_object.json_file[indice])
        
        
        
        

### Final driver code here

In [None]:
# I recommend using this cell for testing seach()

searcher = Search_Object('recipes.json')

searcher.search("onions", ordering = "normal")

# Lets look at the fourth one down:  
ccc.get_recipe_by_indice(7415)['ingredients']