# Assignment group 3: Probabilistic modeling and prediction

#### Kiana Montazeri

## Module B _(39 pts)_ Exploring conditional probability and prediction
In This section we're going to experiment with some recipes data again, which can be obtained from Kaggle:

- https://www.kaggle.com/kaggle/recipe-ingredients-dataset

As usual, they're also in the assignment's directory:

- `./data/train.json`

In [61]:
#Libraries in use:
from pprint import pprint
from collections import Counter
from collections import defaultdict
import json

__B1.__ _(3 pts)_ To start, load the json data and call the resulting object `recipes`.

In [62]:
with open('data/train.json', 'r') as f:
    recipes = json.load(f)
pprint(recipes[:2])
print(len(recipes))
#it is a list of dictionaries with cuisines names, ids and ingredients and it has 39774 recipes"

[{'cuisine': 'greek',
  'id': 10259,
  'ingredients': ['romaine lettuce',
                  'black olives',
                  'grape tomatoes',
                  'garlic',
                  'pepper',
                  'purple onion',
                  'seasoning',
                  'garbanzo beans',
                  'feta cheese crumbles']},
 {'cuisine': 'southern_us',
  'id': 25693,
  'ingredients': ['plain flour',
                  'ground pepper',
                  'salt',
                  'tomatoes',
                  'ground black pepper',
                  'thyme',
                  'eggs',
                  'green tomatoes',
                  'yellow corn meal',
                  'milk',
                  'vegetable oil']}]
39774


__B2.__ _(2 pts)_ Instead of builing out these data as a network represented by an adjacency matrix, we'll still benefit from the latent network structure, instead storing it as an adjacency list. In particular, symmetrically store the number of shared recipes between pairs of ingredients:

- ```
    {
        Ingredient: {
            CoIngredient: NumSharedRecipes
        }
    }
```

To do this, use a default dictionary of counters:

- `defaultdict(lambda : Counter())`

for the co-ingredient data. 

Note: you are only responsible for initializing the data structure in this part.

In [63]:
CUISINE = list(list(recipe['ingredients']) for recipe in recipes) ##only ingredients of the recipes
###################################
#make a set of all the ingredients:
ingredients = set() #define a set to avoid duplicates
for eachCUISINE in CUISINE:
    for eachelement in eachCUISINE:
        ingredients.add(eachelement)
###################################

In [64]:
CoIngredients = defaultdict(lambda : Counter())

__B3.__ _(5 pts)_ Now, count up the co-ingredient frequencies between all pairs of ingredients.

In [66]:
for each_ing in ingredients:
    counts = Counter()
    for recipe in CUISINE:
        if each_ing in recipe:
            for co_ing in recipe:
                if co_ing != each_ing:
                    counts[co_ing] += 1
    CoIngredients[each_ing] = counts

In [67]:
pprint(CoIngredients['pesto'])

Counter({'salt': 47,
         'olive oil': 37,
         'grated parmesan cheese': 30,
         'pepper': 22,
         'garlic': 17,
         'ground black pepper': 16,
         'onions': 16,
         'zucchini': 14,
         'mozzarella cheese': 14,
         'water': 11,
         'tomatoes': 10,
         'extra-virgin olive oil': 10,
         'black pepper': 10,
         'red bell pepper': 9,
         'garlic cloves': 9,
         'fresh basil': 9,
         'parmesan cheese': 9,
         'butter': 8,
         'roasted red peppers': 8,
         'goat cheese': 7,
         'fresh lemon juice': 7,
         'fresh basil leaves': 7,
         'chees fresh mozzarella': 7,
         'dry white wine': 7,
         'ricotta cheese': 7,
         'plum tomatoes': 6,
         'part-skim mozzarella cheese': 6,
         'sun-dried tomatoes': 6,
         'minced garlic': 6,
         'carrots': 6,
         'green beans': 6,
         'eggs': 6,
         'kosher salt': 6,
         'lasagna noodles': 6,
     

__B4.__ _(2 pts)_ In the response box below answer the following questions:
- Why didn't we choose to construct an adjacency matrix from our co-ingredients data?
- Why was this _adjacency list_ a more efficient choice for computing the co-ingredient frequencies?

<font color=blue>Since we do not care about the order of the data, array is not absolutely necessary. We can assume a specific order and make an array. In this case, seince we are usig numpy, and dealing with indexes and counters(int numbers), the process would be faster and more efficient. In our way, the name of each ingredient is preserved and can be used directly in our calculations.</font>

__B5.__ _(5 pts)_ Write a function that finds the probability that a recipe contains a specific ingredient:

$$P(\text{a recipe contains ingredient } A)$$

This should be computed as:

$$\frac{\text{number of times ingredient }A\text{ is used in any recipe}}{\text{number of recipes in the dataset}}$$

When complete, exhibit that this function works by finding the probability that a recipe contains `"feta cheese"`.

In [68]:
def prob_recipes_one(ingredient, CUISINE):
    ing_used = [True for rec in CUISINE if ingredient in rec]
    probability = sum(ing_used)/len(CUISINE)
    return probability

In [69]:
fetaCheese_P = prob_recipes_one("feta cheese", CUISINE)

In [70]:
fetaCheese_P

0.0067380700960426405

__B6.__ _(5 pts)_ Now, write a function that finds the probability that a recipe contains two specific ingredients:

$$P(\text{a recipe contains ingredients } A \text{ and } B)$$

This should be computed as:

$$\frac{\text{number of times both ingredients were used in any recipe}}{\text{number of recipes in the dataset}}$$

Exhibit that this function works by finding the probability that a recipe contains both `"feta cheese"` and `"romaine lettuce"`.

In [71]:
def prob_recipes_multiple(ingredients, CUISINE):
    ing_used = [True for rec in CUISINE if all(x in rec for x in ingredients)]
    probability = sum(ing_used)/len(CUISINE)
    return probability

In [72]:
P_FetaAndRomaine = prob_recipes_multiple(["feta cheese", "romaine lettuce"], CUISINE)

In [73]:
P_FetaAndRomaine

0.0003017046311660884

__B7.__ _(5 pts)_ Next, write a function that finds the probability that a recipe contains one ingredient, given we assume    the presence of another. This is the conditional probability:

$$P(\text{a recipe contains ingredient } A\mid\text{ it is a recipe that we know contains ingredient } B)$$

which can be computed as a quotient from Bayes' rule:

$$
P(\text{a recipe contains ingredient } A\mid\text{it is a recipe that we know contains ingredient } B)=
\frac{P(\text{a recipe contains ingredients } A \text{ and } B)}{P(\text{a recipe contains ingredient } B)}
$$

i.e., using the output of our previous two functions. Exhibit that this function works by finding the probability that a recipe contains `"feta cheese"`, given we know it contains `"romaine lettuce"`.

In [74]:
def conditional_prob_recipe(ingredient, presentIng, CUISINE):
    p_denom = prob_recipes_one(presentIng, CUISINE)
    p_nom = prob_recipes_multiple([ingredient, presentIng], CUISINE)
    probability = p_nom/p_denom
    return probability

In [75]:
conditional_prob_recipe("feta cheese", "romaine lettuce", CUISINE)

0.044444444444444446

__B8.__ _(7 pts)_ Finally, write a function that finds all conditional probabilities for a given conditioning ingredient. The co-ingredients and their likelihoods should be returned in a counter.

In [76]:
def co_ingredients_prob(presentIng, CUISINE):
    p_denom = prob_recipes_one(presentIng, CUISINE)
    probability = Counter()
    for eachkey in [x for x in CoIngredients[presentIng].keys()]:
        probability[eachkey] = prob_recipes_multiple([eachkey, presentIng], CUISINE)/p_denom
    return probability

In [77]:
co_ingredients_prob("romaine lettuce", CUISINE)

Counter({'black olives': 0.025925925925925925,
         'grape tomatoes': 0.05925925925925926,
         'garlic': 0.17777777777777778,
         'pepper': 0.15555555555555556,
         'purple onion': 0.2740740740740741,
         'seasoning': 0.011111111111111112,
         'garbanzo beans': 0.018518518518518517,
         'feta cheese crumbles': 0.07407407407407407,
         'sesame seeds': 0.011111111111111112,
         'gingerroot': 0.003703703703703704,
         'soy sauce': 0.05925925925925926,
         'sesame oil': 0.05925925925925926,
         'cooked white rice': 0.014814814814814815,
         'sugar': 0.14814814814814814,
         'mirin': 0.003703703703703704,
         'garlic cloves': 0.1962962962962963,
         'rib eye steaks': 0.003703703703703704,
         'black pepper': 0.09259259259259259,
         'hot bean paste': 0.003703703703703704,
         'onions': 0.1037037037037037,
         'tostada shells': 0.011111111111111112,
         'jalapeno chilies': 0.09259259259259

In [78]:
co_ingredients_prob("romaine lettuce", CUISINE)['feta cheese']

0.044444444444444446

__B9.__ _(3 pts)_ Using your function from __B8__, find the co-ingredient likelihoods for `"feta cheese"`. Print out the top 10 most likely co-ingredients by using the `.most_common(10)` method and interpret the ingredients. 

When this is complete, answer the following questions in the response box below.
- Are they related?
- Do you interpret these as fitting into a common cuisine of recipes? 

<font color=blue>Yes, they are somehow related. There are a lot of recipes with these ingredients together such as Greek food or some kinds of salads that have feta cheese + olive oil + etc. We can categorize certain types of recipes that are very closely related by these similar ingredients.</font>

In [79]:
co_ingredients_prob("feta cheese", CUISINE).most_common(10)

[('olive oil', 0.5186567164179104),
 ('salt', 0.42537313432835827),
 ('purple onion', 0.22388059701492538),
 ('tomatoes', 0.1902985074626866),
 ('garlic cloves', 0.1902985074626866),
 ('dried oregano', 0.1902985074626866),
 ('garlic', 0.1791044776119403),
 ('pepper', 0.17537313432835822),
 ('extra-virgin olive oil', 0.17164179104477612),
 ('onions', 0.16417910447761194)]

__B10.__ _(2 pts)_ Finally, discuss the numeric values output in your execution of __B9__ in the response box below. In particular, consider how these output are probabilities, but definitely don't add up to 1! Why is this the case?? 
\[Hint. think about the Sum rule and the mutual exclusivity of probabilities.\]

<font color=blue>The probabilities of one ingredient being the co-ingredient to all other ingredients will add up to one and this is the way to interpret these numbers instead of adding them all up. The sum of all probabilities involving for example salt will be one.</font>