## Data Collection - Web Crawling

In order to start our project we will need to collect three different types of data: 
* First, we will need a dataset filled with different users food preferences. Because food rating data is difficult to come by, we can instead use point of purchase grocery store data for users and utilize implicit feedback (i.e., assume that customers that bought an item liked the item). 

* Second, we need a repository of diverse recipes. Although a number of recipe datasets are available online, no dataset that I found has all the required attributes. Because of the lack of appropriate available data, this data will need to be collected from a website via webscraping. 

* Third, we will need a database of grocery prices for pricing our recipes. Because datasets with prices are and far between, and quickly outdated, we will need to manually collect grocery pricing data from a store website. 

In [None]:
!pip install python-crfsuite
!pip install selenium

In [None]:
!pip install python-utils
!pip install data

In [None]:
!pip install Utils

In [None]:
!pip install scrapy

In [1]:
%load_ext autoreload
%autoreload 2

In [4]:
import argparse
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import pandas as pd
import pickle
import pycrfsuite
import random
import re
import sys
import warnings
from os.path import dirname, realpath, sep, pardir
nltk.download('averaged_perceptron_tagger')
warnings.filterwarnings('ignore')

from selenium import webdriver
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)
# sys.path.append(dirname(pardir + sep + "src")) part jisne mood kharab kar diya

from src.d00_utils import utils
from src.d01_data import clean_data
from src.d01_data.web_scraping import sr_scraping, marianos_insta_scraping #first web second mart
from src.d02_features.feature_creation import nyt_ingredients_crf_feature_creation
from src.d02_features.feature_creation import instacart_prod_crf_feature_creation
from src.d03_models.crf_model_recipes import crf_model_recipe_tagger
from src.d03_models.crf_model_baskets import crf_basket_feature_creation, crf_basket_dataset_creation
from src.d03_models.app_functions import *

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [5]:
sr_scraping() #web scraping ke liye hai

<selenium.webdriver.chrome.webdriver.WebDriver (session="cf30983741e0ca99e233721214a79932")>
<html id="homeTemplate_1-0" class="comp simplyrecipes homeTemplate html mntl-html no-touchevents is-window-loaded" data-ab="99,99,66,58,99,99,34" data-resource-version="5.228.0" lang="en" data-lazy-threshold="100" data-simplyrecipes-resource-version="5.228.0" data-mantle-resource-version="3.14.29" data-lifestyle-food-resource-version="5.228.0" data-tracking-container="true" style=""><!--
<globe-environment environment="k8s-prod" application="simplyrecipes" dataCenter="us-west-1"/>
--><head class="loc head" style="">				
					
					<script async="" src="//www.googletagmanager.com/gtm.js?id=GTM-5P3SZGS"></script><script type="text/javascript">var Mntl = window.Mntl || {};</script>
					
					    <link rel="preconnect" href="//js-sec.indexww.com">
    <link rel="preconnect" href="//c.amazon-adsystem.com">
    <link rel="preconnect" href="//securepubads.g.doubleclick.net">

					
					
<meta charse

None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/1/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/2/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/3/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/4/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/5/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/6/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/7/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/8/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/9/ fetching nxt pge
[]
None
https://www.simplyrecipes.com/gluten-free-recipes-5091257/page/10/ fetching nxt pge
[]


KeyError: 'links'

In [13]:
recipes_sr_orig = utils.read_multiple_csv_and_concat('simply_recipes/simply_recipes/*')
# recipes_sr_orig = pd.DataFrame({["0","title","preptime","cooktime","yield","filedunder"]})
# recipes_sr_orig.describe()
# recipes_sr_orig[1]
# recipes_sr_orig.drop(columns='Unnamed: 0', inplace=True)
# recipes_sr_orig.describe()
recipes_sr_orig

['simply_recipes/simply_recipes\\IFDcsv.csv']


Unnamed: 0,Srno,RecipeName,TranslatedRecipeName,Ingredients,TranslatedIngredients,PrepTimeInMins,CookTimeInMins,TotalTimeInMins,Servings,Cuisine,Course,Diet,Instructions,TranslatedInstructions,URL
0,1,Masala Karela Recipe,Masala Karela Recipe,"6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...","6 Karela (Bitter Gourd/ Pavakkai) - deseeded,S...",15,30,45,6,Indian,Side Dish,Diabetic Friendly,"To begin making the Masala Karela Recipe,de-se...","To begin making the Masala Karela Recipe,de-se...",https://www.archanaskitchen.com/masala-karela-...
1,2,टमाटर पुलियोगरे रेसिपी - Spicy Tomato Rice (Re...,Spicy Tomato Rice (Recipe),"2-1/2 कप चावल - पका ले,3 टमाटर,3 छोटा चमच्च बी...","2-1 / 2 cups rice - cooked, 3 tomatoes, 3 teas...",5,10,15,3,South Indian Recipes,Main Course,Vegetarian,टमाटर पुलियोगरे बनाने के लिए सबसे पहले टमाटर क...,"To make tomato puliogere, first cut the tomato...",http://www.archanaskitchen.com/spicy-tomato-ri...
2,3,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,Ragi Semiya Upma Recipe - Ragi Millet Vermicel...,"1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...","1-1/2 cups Rice Vermicelli Noodles (Thin),1 On...",20,30,50,4,South Indian Recipes,South Indian Breakfast,High Protein Vegetarian,"To begin making the Ragi Vermicelli Recipe, fi...","To begin making the Ragi Vermicelli Recipe, fi...",http://www.archanaskitchen.com/ragi-vermicelli...
3,4,Gongura Chicken Curry Recipe - Andhra Style Go...,Gongura Chicken Curry Recipe - Andhra Style Go...,"500 grams Chicken,2 Onion - chopped,1 Tomato -...","500 grams Chicken,2 Onion - chopped,1 Tomato -...",15,30,45,4,Andhra,Lunch,Non Vegeterian,To begin making Gongura Chicken Curry Recipe f...,To begin making Gongura Chicken Curry Recipe f...,http://www.archanaskitchen.com/gongura-chicken...
4,5,आंध्रा स्टाइल आलम पचड़ी रेसिपी - Adrak Chutney ...,Andhra Style Alam Pachadi Recipe - Adrak Chutn...,"1 बड़ा चमच्च चना दाल,1 बड़ा चमच्च सफ़ेद उरद दाल,2...","1 tablespoon chana dal, 1 tablespoon white ura...",10,20,30,4,Andhra,South Indian Breakfast,Vegetarian,आंध्रा स्टाइल आलम पचड़ी बनाने के लिए सबसे पहले ...,"To make Andhra Style Alam Pachadi, first heat ...",https://www.archanaskitchen.com/andhra-style-a...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6866,14073,गोअन मशरुम जकुटी रेसिपी - Goan Mushroom Xacuti...,Goan Mushroom Xacuti Recipe,"20 बटन मशरुम,2 प्याज - काट ले,1 टमाटर - बारीक ...","20 बटन मशरुम,2 प्याज - काट ले,1 टमाटर - बारीक ...",15,45,60,4,Goan Recipes,Lunch,Vegetarian,गोअन मशरुम जकुटी रेसिपी बनाने के लिए सबसे पहले...,गोअन मशरुम जकुटी रेसिपी बनाने के लिए सबसे पहले...,https://www.archanaskitchen.com/goan-mushroom-...
6867,14107,शकरकंदी और मेथी का पराठा रेसिपी - Sweet Potato...,Sweet Potato & Methi Stuffed Paratha Recipe,"1 बड़ा चम्मच तेल,1 कप गेहूं का आटा,नमक - स्वाद ...","1 बड़ा चम्मच तेल,1 कप गेहूं का आटा,नमक - स्वाद ...",30,60,90,4,North Indian Recipes,North Indian Breakfast,Diabetic Friendly,शकरकंदी और मेथी का पराठा रेसिपी बनाने के लिए स...,शकरकंदी और मेथी का पराठा रेसिपी बनाने के लिए स...,https://www.archanaskitchen.com/sweet-potato-m...
6868,14165,Ullikadala Pulusu Recipe | Spring Onion Curry,Ullikadala Pulusu Recipe | Spring Onion Curry,150 grams Spring Onion (Bulb & Greens) - chopp...,150 grams Spring Onion (Bulb & Greens) - chopp...,5,10,15,2,Andhra,Side Dish,Vegetarian,To begin making Ullikadala Pulusu Recipe | Spr...,To begin making Ullikadala Pulusu Recipe | Spr...,https://www.archanaskitchen.com/ullikadala-pul...
6869,14167,Kashmiri Style Kokur Yakhni Recipe-Chicken Coo...,Kashmiri Style Kokur Yakhni Recipe-Chicken Coo...,"1 kg Chicken - medium pieces,1/2 cup Mustard o...","1 kg Chicken - medium pieces,1/2 cup Mustard o...",30,45,75,4,Kashmiri,Lunch,Non Vegeterian,To begin making the Kashmiri Kokur Yakhni reci...,To begin making the Kashmiri Kokur Yakhni reci...,http://www.archanaskitchen.com/kashmiri-kokur-...


In [11]:
recipes_sr_inter = clean_data.intermediate_clean_recipes_sr(recipes_sr_orig)

AttributeError: 'DataFrame' object has no attribute 'title'

In [None]:
recipes_sr_inter.head(2)

In [None]:
marianos_insta_scraping()

In [None]:
grocery_prices_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/grocery_prices_marianos/prod_aile*')

In [None]:
grocery_prices_orig.head(2)

In [None]:
grocery_prices_inter = clean_data.intermediate_clean_marianos_prices(grocery_prices_orig)

In [None]:
grocery_prices_inter.head(2)

In [None]:
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')
order = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')
order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')

In [None]:
instacart_baskets = clean_data.combine_instacart_kaggle_datasets(aisles, departments, order, 
                                                                 order_products__prior, products)
instacart_baskets.head()

In [None]:
instacart_baskets.info()

In [None]:
pd.DataFrame(instacart_baskets.groupby('user_id')['order_id']\
             .nunique()).sort_values('order_id', ascending=False)\
             .head(5)

In [None]:
recipes_sr_inter.head(2)

In [None]:
recipes_sr_inter.info()

In [None]:
print('Number of unique recipes: ', len(recipes_sr_inter))

Data Source: https://github.com/nytimes/ingredient-phrase-tagger

In [None]:
nyt_ing = pd.read_csv('../../data/01_raw/nyt-ingredients-snapshot-2015.csv')
nyt_ing.drop(columns=['index'], inplace=True)
print('Number of Handlabeled Ingredients: ', len(nyt_ing))
nyt_ing.head()

In [None]:
nyt_ing.fillna("missing", inplace=True)

In [None]:
X, y = nyt_ingredients_crf_feature_creation(nyt_ing)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_nyt_initial_model.model')
# let's read back in our model 
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_nyt_initial_model.model')

In [None]:
# Kernal keeps dying when i try and tag things with tagger
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [None]:
mlb = MultiLabelBinarizer()
print(classification_report(y_pred=mlb.fit_transform(y_pred), y_true=mlb.fit_transform(y_test)))

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/0/crf_ing_final.model')

In [None]:
recipe_ing_dict, recipe_links_dict, recipe_tags_dict = crf_model_recipe_tagger(recipes_sr_inter)

In [None]:
recipe_ing_dict

In [None]:
instacart_prod_train = pd.read_csv('../../data/01_raw/instacart_product_train.csv')

In [None]:
instacart_prod_train.head()

In [None]:
X, y = instacart_prod_crf_feature_creation(instacart_prod_train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_ingredients_initial.model')
tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_ingredients_initial.model')

In [None]:
labels = [tagger.tag(xseq) for xseq in X_test]

In [None]:
mlb = MultiLabelBinarizer()

print(classification_report(y_pred=mlb.fit_transform(labels), y_true=mlb.fit_transform(y_test)))

In [None]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('../../data/04_models/crf_instacart_products_final.model')

```python
# if you would like to run this on your own then add this to a cell. Otherwise you should read in the final file from 
# the thing provided
X, token_sr, products_list = crf_basket_feature_creation(instacart_baskets)

tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_instacart_products_final.model')

labels = [tagger.tag(xseq) for xseq in X]

instacart_baskets_update = crf_basket_dataset_creation(token_sr, labels, products_list, instacart_baskets)
```

In [None]:
instacart_baskets_update = pd.read_csv('../../data/05_model_output/baskets_newprodlist_2.csv')

In [None]:
instacart_baskets_update

In [None]:
# from a cursory look at the dataset I can tell that there are a number of things marked as food that are not. let's 
# get rid of these so that they don't mess up our results. 
mask = ((instacart_baskets_update['new_prod_list']!='1')&(instacart_baskets_update['new_prod_list']!='100')&\
        (instacart_baskets_update['new_prod_list']!='11')&(instacart_baskets_update['new_prod_list']!='118')&\
        (instacart_baskets_update['new_prod_list']!='2')&(instacart_baskets_update['new_prod_list']!='24')&\
        (instacart_baskets_update['new_prod_list']!='3')&(instacart_baskets_update['new_prod_list']!='3 cheese')&\
        (instacart_baskets_update['new_prod_list']!='30')&(instacart_baskets_update['new_prod_list']!='328')&\
        (instacart_baskets_update['new_prod_list']!='4')&(instacart_baskets_update['new_prod_list']!='5')&\
        (instacart_baskets_update['new_prod_list']!='50')&(instacart_baskets_update['new_prod_list']!='6')&\
        (instacart_baskets_update['new_prod_list']!='6 cheese')&(instacart_baskets_update['new_prod_list']!='60')&\
        (instacart_baskets_update['new_prod_list']!='7')&(instacart_baskets_update['new_prod_list']!='70')&\
        (instacart_baskets_update['new_prod_list']!='8')&(instacart_baskets_update['new_prod_list']!='85')&\
        (instacart_baskets_update['new_prod_list']!='9')&(instacart_baskets_update['new_prod_list']!='95')&\
        (instacart_baskets_update['new_prod_list']!='97')&(instacart_baskets_update['new_prod_list']!='98')&\
        (instacart_baskets_update['new_prod_list']!='a')&(instacart_baskets_update['new_prod_list']!='a garlic butter sauce')&\
        (instacart_baskets_update['new_prod_list']!=np.nan)&(instacart_baskets_update['new_prod_list']!='nan'))

instacart_baskets_filtered = instacart_baskets_update[mask]

In [None]:
print('Number of Products After Running Names through CRF Mode: ', instacart_baskets_filtered.new_prod_list.nunique())
print('Number of products in the original list: ', instacart_baskets_filtered.product_name.nunique())
print('Number of unique users: ', instacart_baskets_filtered.user_id.nunique())

In [None]:
instacart_users_lst = list(instacart_baskets_filtered.user_id.unique())
len(instacart_users_lst)

In [None]:
random_usrids_100k = random.sample(instacart_users_lst, 100000)
mask = instacart_baskets_filtered['user_id'].isin(random_usrids_100k)
baskets_100k = instacart_baskets_filtered.loc[mask]
print('Number of User IDs: ', baskets_100k.user_id.nunique())

In [None]:
baskets_100k

In [None]:
baskets_complete = baskets_100k.drop(columns=['product_name', 'user_id'])
baskets_complete.head()

This is how to get the dataframe into matrix format
```python
basket_matrix_usr = baskets_complete.groupby(['order_id', 'new_prod_list'])['all_ones']\
                    .sum().unstack().reset_index().fillna(0)\
                    .set_index('order_id')
```

Run ```similarities_model.py``` (located the ```src/d02_features``` folder) from the command line in order to get the final similarity matrix. 

In [None]:
data_matrix = pd.read_csv('../../data/05_model_output/data_matrix_sim.csv')
data_matrix.set_index('Unnamed: 0', inplace=True)

In [None]:
print(data_matrix.loc['potato'].nlargest(11))

In [None]:
print('Choose your meal by inputing either 1, 2 or 3')
# print('\n')
meal_input = input("Breakfast: Input 1 || || Lunch: Input 2 || Dinner: Input 3: ")
print('\n')
print('Choose your dietary preferences by inputing either 1 or 2: ')
# print('\n')
dietary_preference_input = input("Vegetarian: Input 1 || Omnivore: Input 2: ")
# print('\n')
print('Type in 3 foods you already like')
item1 = input("Item 1: ")
item2 = input("Item 2: ")
item3 = input("Item 3: ")
print('\n')
print('Searching for five recipe recommendations based both on your inputs and similair foods.')

if meal_input == "1":
    meal = 'Breakfast'
else: 
    meal = 'Dinner'
    
if dietary_preference_input == "1":
    dietary_preference = 'Vegetarian'
else:
    dietary_preference = None
shopping_basket = [item1, item2, item3]
recipe_recommendations_app(shopping_basket, recipe_ing_dict, recipe_tags_dict, meal, dietary_preference, recipe_links_dict)
