## Data Collection - Web Crawling

In order to start our project we will need to collect three different types of data: 
* First, we will need a dataset filled with different users food preferences. Because food rating data is difficult to come by, we can instead use point of purchase grocery store data for users and utilize implicit feedback (i.e., assume that customers that bought an item liked the item). 

* Second, we need a repository of diverse recipes. Although a number of recipe datasets are available online, no dataset that I found has all the required attributes. Because of the lack of appropriate available data, this data will need to be collected from a website via webscraping. 

* Third, we will need a database of grocery prices for pricing our recipes. Because datasets with prices are and far between, and quickly outdated, we will need to manually collect grocery pricing data from a store website. 

In [None]:
!pip install python-crfsuite
!pip install selenium

In [None]:
!pip install python-utils
!pip install data

In [None]:
!pip install Utils

In [None]:
!pip install scrapy

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
import argparse
import matplotlib.pyplot as plt
import nltk
import numpy as np
import os
import pandas as pd
import pickle
import pycrfsuite
import random
import re
import sys
import warnings
from os.path import dirname, realpath, sep, pardir
nltk.download('averaged_perceptron_tagger')
warnings.filterwarnings('ignore')

from selenium import webdriver
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import MultiLabelBinarizer

src_dir = os.path.join(os.getcwd(), '..', '..', 'src')
sys.path.append(src_dir)
# sys.path.append(dirname(pardir + sep + "src")) part jisne mood kharab kar diya

from src.d00_utils import utils
from src.d01_data import clean_data
from src.d01_data.web_scraping import sr_scraping, marianos_insta_scraping #first web second mart
from src.d02_features.feature_creation import nyt_ingredients_crf_feature_creation
from src.d02_features.feature_creation import instacart_prod_crf_feature_creation
from src.d03_models.crf_model_recipes import crf_model_recipe_tagger
from src.d03_models.crf_model_baskets import crf_basket_feature_creation, crf_basket_dataset_creation
from src.d03_models.app_functions import *

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [None]:
sr_scraping() #web scraping ke liye hai

In [2]:
recipes_sr_orig = utils.read_multiple_csv_and_concat('simply_recipes/simply_recipes/*')
# recipes_sr_orig = pd.DataFrame({["0","title","preptime","cooktime","yield","filedunder"]})
# recipes_sr_orig.describe()
# recipes_sr_orig[1]
recipes_sr_orig.drop(columns='Unnamed: 0', inplace=True)
# recipes_sr_orig.describe()
recipes_sr_orig

['simply_recipes/simply_recipes\\simply_recipes_1.csv', 'simply_recipes/simply_recipes\\simply_recipes_2.csv', 'simply_recipes/simply_recipes\\simply_recipes_3.csv']


Unnamed: 0,title,prep_time,cook_time,recipe_yield,tags,ingredients,entire_card,byline,link_food
0,['Grilled Cheese BLT'],"['Prep time:', ' ', '10 minutes']","['Cook time:', ' ', '10 minutes']","['Yield:', ' ', '4 sandwiches']","['Filed under:', ' ', 'Dinner', 'Lunch', 'Sand...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Aaron Hutcherson', 'August 2...","['<link rel=""canonical"" href=""https://www.simp..."
1,['Pulled Pork Sandwich'],"['Prep time:', ' ', '10 minutes']","['Cook time:', ' ', '2 hours, 45 minutes']","['Yield:', ' ', 'Serves 6 to 8']","['Filed under:', ' ', 'Dinner', 'Sandwich', 'B...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Elise Bauer', 'Updated Augus...","['<link rel=""canonical"" href=""https://www.simp..."
2,['How to Make Bacon in the Oven'],"['Prep time:', ' ', '5 minutes']","['Cook time:', ' ', '20 minutes']","['Yield:', ' ', '12 strips']","['Filed under:', ' ', 'Tips', 'Breakfast and B...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Nick Evans', 'August 25, 2019']","['<link rel=""canonical"" href=""https://www.simp..."
3,['Sausage Stuffed Zucchini'],"['Prep time:', ' ', '15 minutes']","['Cook time:', ' ', '1 hour']","['Yield:', ' ', 'Serves 4']","['Filed under:', ' ', 'Dinner', 'Favorite Summ...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Elise Bauer', 'Updated Augus...","['<link rel=""canonical"" href=""https://www.simp..."
4,['The Best Dry Rub for Ribs'],"['Prep time:', ' ', '5 minutes']","['Yield:', ' ', '1 1/2 cups']",[],"['Filed under:', ' ', 'Favorite Fall', 'Favori...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Irvin Lin', 'July 28, 2019']","['<link rel=""canonical"" href=""https://www.simp..."
...,...,...,...,...,...,...,...,...,...
1747,['Asparagus Risotto'],"['Prep time:', ' ', '10 minutes']","['Cook time:', ' ', '35 minutes']","['Yield:', ' ', 'Serves 2-3 as a main course, ...","['Filed under:', ' ', 'Dinner', 'Side Dish', '...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Elise Bauer']","['<link rel=""canonical"" href=""https://www.simp..."
1748,['Butternut Squash Risotto'],"['Prep time:', ' ', '10 minutes']","['Cook time:', ' ', '40 minutes']","['Yield:', ' ', 'Serves 4 to 6']","['Filed under:', ' ', 'Side Dish', 'Gluten-Fre...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Elise Bauer']","['<link rel=""canonical"" href=""https://www.simp..."
1749,['Rice Pilaf'],"['Prep time:', ' ', '5 minutes']","['Cook time:', ' ', '25 minutes']","['Yield:', ' ', 'Serves 6 to 8']","['Filed under:', ' ', 'Side Dish', 'Gluten-Fre...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Elise Bauer']","['<link rel=""canonical"" href=""https://www.simp..."
1750,"['Rice with Carrot, Lemon, Onion and Mint']","['Prep time:', ' ', '10 minutes']","['Cook time:', ' ', '40 minutes']","['Yield:', ' ', 'Serves 6.']","['Filed under:', ' ', 'Side Dish', 'Quick and ...","['\n ', 'Ingredients', ...","['\n\n ', '\n ...","['by ', ' ', 'Elise Bauer']","['<link rel=""canonical"" href=""https://www.simp..."


In [3]:
recipes_sr_inter = clean_data.intermediate_clean_recipes_sr(recipes_sr_orig)

In [4]:
recipes_sr_inter.head(2)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...


In [5]:
marianos_insta_scraping()

WebDriverException: Message: chrome not reachable
  (Session info: chrome=104.0.5112.80)
Stacktrace:
Backtrace:
	Ordinal0 [0x002D78B3+2193587]
	Ordinal0 [0x00270681+1771137]
	Ordinal0 [0x00184070+802928]
	Ordinal0 [0x00179AB2+760498]
	Ordinal0 [0x0016CEAF+708271]
	Ordinal0 [0x00171142+725314]
	Ordinal0 [0x001755C9+742857]
	Ordinal0 [0x00185500+808192]
	Ordinal0 [0x001DD2C2+1168066]
	Ordinal0 [0x001CD5C6+1103302]
	Ordinal0 [0x001A77E0+948192]
	Ordinal0 [0x001A86E6+952038]
	GetHandleVerifier [0x00580CB2+2738370]
	GetHandleVerifier [0x005721B8+2678216]
	GetHandleVerifier [0x003617AA+512954]
	GetHandleVerifier [0x00360856+509030]
	Ordinal0 [0x0027743B+1799227]
	Ordinal0 [0x0027BB68+1817448]
	Ordinal0 [0x0027BC55+1817685]
	Ordinal0 [0x00285230+1856048]
	BaseThreadInitThunk [0x76386739+25]
	RtlGetFullPathName_UEx [0x777A90AF+1215]
	RtlGetFullPathName_UEx [0x777A907D+1165]
	(No symbol) [0x00000000]


In [None]:
grocery_prices_orig = utils.read_multiple_csv_and_concat('../../data/01_raw/grocery_prices_marianos/prod_aile*')

In [None]:
grocery_prices_orig.head(2)

In [None]:
grocery_prices_inter = clean_data.intermediate_clean_marianos_prices(grocery_prices_orig)

In [None]:
grocery_prices_inter.head(2)

In [None]:
aisles = pd.read_csv('../../data/01_raw/instacart_2017_05_01/aisles.csv')
departments = pd.read_csv('../../data/01_raw/instacart_2017_05_01/departments.csv')
order = pd.read_csv('../../data/01_raw/instacart_2017_05_01/orders.csv')
order_products__prior = pd.read_csv('../../data/01_raw/instacart_2017_05_01/order_products__prior.csv')
products = pd.read_csv('../../data/01_raw/instacart_2017_05_01/products.csv')

In [None]:
instacart_baskets = clean_data.combine_instacart_kaggle_datasets(aisles, departments, order, 
                                                                 order_products__prior, products)
instacart_baskets.head()

In [None]:
instacart_baskets.info()

In [None]:
pd.DataFrame(instacart_baskets.groupby('user_id')['order_id']\
             .nunique()).sort_values('order_id', ascending=False)\
             .head(5)

In [6]:
recipes_sr_inter.head(2)

Unnamed: 0,title,prep_time,cook_time,tags,ingredients,recipe_yield,byline,link_food
0,Grilled Cheese BLT,10 minutes,10 minutes,"['Dinner', 'Lunch', 'Sandwich', 'Favorite Summ...","[8 slices sourdough bread, 4 tablespoon unsalt...",4 sandwiches,Aaron Hutcherson,https://www.simplyrecipes.com/recipes/grilled_...
1,Pulled Pork Sandwich,10 minutes,"2 hours, 45 minutes","['Dinner', 'Sandwich', 'Budget', 'Comfort Food...","[For the sauce:, 1 large onion, chopped, 6 gar...",Serves 6 to 8,Elise Bauer,https://www.simplyrecipes.com/recipes/pulled_p...


In [7]:
recipes_sr_inter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1752 entries, 0 to 1751
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   title         1738 non-null   object
 1   prep_time     1583 non-null   object
 2   cook_time     1408 non-null   object
 3   tags          1752 non-null   object
 4   ingredients   1752 non-null   object
 5   recipe_yield  1242 non-null   object
 6   byline        1752 non-null   object
 7   link_food     1752 non-null   object
dtypes: object(8)
memory usage: 109.6+ KB


In [8]:
print('Number of unique recipes: ', len(recipes_sr_inter))

Number of unique recipes:  1752


Data Source: https://github.com/nytimes/ingredient-phrase-tagger

In [9]:
nyt_ing = pd.read_csv('nyt-ingredients-snapshot-2015.csv')
nyt_ing.drop(columns=['index'], inplace=True)
print('Number of Handlabeled Ingredients: ', len(nyt_ing))
nyt_ing.head()

Number of Handlabeled Ingredients:  179207


Unnamed: 0,input,name,qty,range_end,unit,comment
0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,"cooked and pureed fresh, or 1 10-ounce package..."
1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.0,0.0,cup,"peeled and cooked fresh (about 20), or 1 cup c..."
2,"1 medium-size onion, peeled and chopped",onion,1.0,0.0,,"medium-size, peeled and chopped"
3,"2 stalks celery, chopped coarse",celery,2.0,0.0,stalk,chopped coarse
4,1 1/2 tablespoons vegetable oil,vegetable oil,1.5,0.0,tablespoon,


In [10]:
nyt_ing.fillna("missing", inplace=True)

In [11]:
X, y = nyt_ingredients_crf_feature_creation(nyt_ing)

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [13]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('04_models/crf_nyt_initial_model.model')
# let's read back in our model 
tagger = pycrfsuite.Tagger()
tagger.open('04_models/crf_nyt_initial_model.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 40293
Seconds required: 0.545

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 727854.175096
Feature norm: 1.000000
Error norm: 293556.966761
Active features: 39606
Line search trials: 1
Line search step: 0.000003
Seconds required for this iteration: 0.461

***** Iteration #2 *****
Loss: 247137.281503
Feature norm: 4.934997
Error norm: 152176.348445
Active features: 38261
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.236

***** Iteration #3 *****
Loss: 226950.542434
Feature norm: 5.428817
Error norm: 184513.388000
Active features: 39142
Line search trials: 2
Line search step: 0.500000
Seconds requir

***** Iteration #39 *****
Loss: 29201.186559
Feature norm: 70.026198
Error norm: 5104.517303
Active features: 26031
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.242

***** Iteration #40 *****
Loss: 28506.347917
Feature norm: 73.702489
Error norm: 5777.563801
Active features: 25919
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.235

***** Iteration #41 *****
Loss: 27908.488551
Feature norm: 77.331039
Error norm: 3413.650905
Active features: 25721
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.236

***** Iteration #42 *****
Loss: 27278.588751
Feature norm: 82.043475
Error norm: 4948.054714
Active features: 25305
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.238

***** Iteration #43 *****
Loss: 26797.477456
Feature norm: 85.994593
Error norm: 3443.722562
Active features: 25445
Line search trials: 1
Line search step: 1.000000

***** Iteration #79 *****
Loss: 21531.868486
Feature norm: 159.910126
Error norm: 1298.007272
Active features: 20540
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.253

***** Iteration #80 *****
Loss: 21496.009728
Feature norm: 160.507643
Error norm: 1677.188202
Active features: 20442
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.237

***** Iteration #81 *****
Loss: 21468.014565
Feature norm: 161.032862
Error norm: 1682.066839
Active features: 20434
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.221

***** Iteration #82 *****
Loss: 21440.683550
Feature norm: 161.440348
Error norm: 1575.235239
Active features: 20413
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.226

***** Iteration #83 *****
Loss: 21416.097942
Feature norm: 161.901007
Error norm: 1448.154710
Active features: 20368
Line search trials: 1
Line search step: 1.0

***** Iteration #119 *****
Loss: 20947.740385
Feature norm: 171.856236
Error norm: 1102.267797
Active features: 19752
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.223

***** Iteration #120 *****
Loss: 20945.197714
Feature norm: 172.037293
Error norm: 1949.762979
Active features: 19753
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.228

***** Iteration #121 *****
Loss: 20934.916244
Feature norm: 172.212196
Error norm: 1114.851831
Active features: 19763
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.223

***** Iteration #122 *****
Loss: 20932.406043
Feature norm: 172.395585
Error norm: 1887.023817
Active features: 19756
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.227

***** Iteration #123 *****
Loss: 20922.185933
Feature norm: 172.547898
Error norm: 970.855897
Active features: 19753
Line search trials: 1
Line search step:

***** Iteration #158 *****
Loss: 20725.348185
Feature norm: 177.247148
Error norm: 1570.919517
Active features: 19694
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.226

***** Iteration #159 *****
Loss: 20718.247708
Feature norm: 177.350429
Error norm: 840.617894
Active features: 19704
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.225

***** Iteration #160 *****
Loss: 20714.210015
Feature norm: 177.454467
Error norm: 1292.470980
Active features: 19711
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.226

***** Iteration #161 *****
Loss: 20708.079665
Feature norm: 177.559485
Error norm: 975.045948
Active features: 19702
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.224

***** Iteration #162 *****
Loss: 20705.068669
Feature norm: 177.691186
Error norm: 1659.120747
Active features: 19690
Line search trials: 1
Line search step: 

***** Iteration #197 *****
Loss: 20570.905269
Feature norm: 181.212528
Error norm: 898.013088
Active features: 19567
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.225

***** Iteration #198 *****
Loss: 20568.614970
Feature norm: 181.279755
Error norm: 1130.084436
Active features: 19565
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.223

***** Iteration #199 *****
Loss: 20565.084087
Feature norm: 181.350209
Error norm: 858.516984
Active features: 19580
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.225

***** Iteration #200 *****
Loss: 20562.910686
Feature norm: 181.424018
Error norm: 1100.797543
Active features: 19564
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.225

L-BFGS terminated with the maximum number of iterations
Total seconds required for training: 46.491

Storing the model
Number of active features: 19564 (40293

<contextlib.closing at 0x1fc9a4c0070>

In [14]:
# Kernal keeps dying when i try and tag things with tagger
y_pred = [tagger.tag(xseq) for xseq in X_test]

In [15]:
mlb = MultiLabelBinarizer()
print(classification_report(y_pred=mlb.fit_transform(y_pred), y_true=mlb.fit_transform(y_test)))

              precision    recall  f1-score   support

           0       0.95      0.98      0.97     19694
           1       1.00      1.00      1.00     35753
           2       1.00      1.00      1.00     35842
           3       1.00      1.00      1.00     24577

   micro avg       0.99      1.00      0.99    115866
   macro avg       0.99      0.99      0.99    115866
weighted avg       0.99      1.00      0.99    115866
 samples avg       0.99      1.00      0.99    115866



In [31]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('04_models/crf_ing_final.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 4719
Seconds required: 0.008

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 2746.490798
Feature norm: 1.000000
Error norm: 1104.380669
Active features: 4506
Line search trials: 1
Line search step: 0.000414
Seconds required for this iteration: 0.002

***** Iteration #2 *****
Loss: 2369.340357
Feature norm: 1.378173
Error norm: 646.726482
Active features: 4439
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #3 *****
Loss: 2054.669438
Feature norm: 2.107407
Error norm: 504.188817
Active features: 4489
Line search trials: 1
Line search step: 1.000000
Seconds required for this iterat

***** Iteration #161 *****
Loss: 231.673503
Feature norm: 51.373196
Error norm: 2.510484
Active features: 1354
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.002

***** Iteration #162 *****
Loss: 231.665725
Feature norm: 51.371460
Error norm: 2.364115
Active features: 1354
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #163 *****
Loss: 231.658170
Feature norm: 51.378704
Error norm: 2.675686
Active features: 1353
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #164 *****
Loss: 231.648942
Feature norm: 51.375001
Error norm: 2.280568
Active features: 1350
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #165 *****
Loss: 231.642275
Feature norm: 51.380169
Error norm: 2.612104
Active features: 1349
Line search trials: 1
Line search step: 1.000000
Seconds required for thi

In [18]:
recipe_ing_dict, recipe_links_dict, recipe_tags_dict = crf_model_recipe_tagger(recipes_sr_inter)

FileNotFoundError: [Errno 2] No such file or directory: 'crf_ing_final.model'

In [19]:
recipe_ing_dict

NameError: name 'recipe_ing_dict' is not defined

In [21]:
instacart_prod_train = pd.read_csv('instacart_product_train.csv')

In [22]:
instacart_prod_train.head()

Unnamed: 0,products,pre_description,food,post_description
0,Organic Egg Whites,Organic,Egg Whites,
1,Michigan Organic Kale,Michigan Organic,Kale,
2,Garlic Powder,,Garlic Powder,
3,Coconut Butter,,Coconut Butter,
4,Natural Sweetener,,natural sweetener,


In [23]:
X, y = instacart_prod_crf_feature_creation(instacart_prod_train)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [24]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('04_models/crf_ingredients_initial.model')
tagger = pycrfsuite.Tagger()
tagger.open('04_models/crf_ingredients_initial.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 4068
Seconds required: 0.007

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 2136.451080
Feature norm: 1.000000
Error norm: 876.821285
Active features: 3869
Line search trials: 1
Line search step: 0.000510
Seconds required for this iteration: 0.001

***** Iteration #2 *****
Loss: 1813.951183
Feature norm: 1.427471
Error norm: 521.776769
Active features: 3817
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #3 *****
Loss: 1536.153087
Feature norm: 2.248731
Error norm: 395.625705
Active features: 3857
Line search trials: 1
Line search step: 1.000000
Seconds required for this iterati

***** Iteration #175 *****
Loss: 185.371504
Feature norm: 45.531160
Error norm: 1.013069
Active features: 1114
Line search trials: 2
Line search step: 0.500000
Seconds required for this iteration: 0.001

***** Iteration #176 *****
Loss: 185.367933
Feature norm: 45.527142
Error norm: 0.705533
Active features: 1114
Line search trials: 2
Line search step: 0.500000
Seconds required for this iteration: 0.002

***** Iteration #177 *****
Loss: 185.365158
Feature norm: 45.532402
Error norm: 0.901887
Active features: 1115
Line search trials: 2
Line search step: 0.500000
Seconds required for this iteration: 0.001

***** Iteration #178 *****
Loss: 185.361662
Feature norm: 45.530387
Error norm: 0.638585
Active features: 1115
Line search trials: 2
Line search step: 0.500000
Seconds required for this iteration: 0.002

***** Iteration #179 *****
Loss: 185.359493
Feature norm: 45.534627
Error norm: 1.075739
Active features: 1115
Line search trials: 2
Line search step: 0.500000
Seconds required for thi

<contextlib.closing at 0x1fca91c63d0>

In [25]:
labels = [tagger.tag(xseq) for xseq in X_test]

In [26]:
mlb = MultiLabelBinarizer()

print(classification_report(y_pred=mlb.fit_transform(labels), y_true=mlb.fit_transform(y_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00       194
           1       0.74      0.55      0.63        47
           2       0.93      0.98      0.95       172

   micro avg       0.95      0.94      0.94       413
   macro avg       0.89      0.85      0.86       413
weighted avg       0.94      0.94      0.94       413
 samples avg       0.95      0.95      0.94       413



In [28]:
trainer = pycrfsuite.Trainer(verbose=True)

# Submit training data to the trainer
for xseq, yseq in zip(X, y):
    trainer.append(xseq, yseq)

# Set the parameters of the model
trainer.set_params({
    # coefficient for L1 penalty
    'c1': 0.1,

    # coefficient for L2 penalty
    'c2': 0.01,  

    # maximum number of iterations
    'max_iterations': 200,

    # whether to include transitions that
    # are possible, but not observed
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('04_models/crf_instacart_products_final.model')

Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 4719
Seconds required: 0.007

L-BFGS optimization
c1: 0.100000
c2: 0.010000
num_memories: 6
max_iterations: 200
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

***** Iteration #1 *****
Loss: 2746.490798
Feature norm: 1.000000
Error norm: 1104.380669
Active features: 4506
Line search trials: 1
Line search step: 0.000414
Seconds required for this iteration: 0.002

***** Iteration #2 *****
Loss: 2369.340357
Feature norm: 1.378173
Error norm: 646.726482
Active features: 4439
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #3 *****
Loss: 2054.669438
Feature norm: 2.107407
Error norm: 504.188817
Active features: 4489
Line search trials: 1
Line search step: 1.000000
Seconds required for this iterat

***** Iteration #168 *****
Loss: 231.618064
Feature norm: 51.373619
Error norm: 2.364111
Active features: 1348
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #169 *****
Loss: 231.611289
Feature norm: 51.381440
Error norm: 2.554260
Active features: 1347
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #170 *****
Loss: 231.602869
Feature norm: 51.376782
Error norm: 2.419340
Active features: 1346
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #171 *****
Loss: 231.595944
Feature norm: 51.384496
Error norm: 2.437415
Active features: 1346
Line search trials: 1
Line search step: 1.000000
Seconds required for this iteration: 0.001

***** Iteration #172 *****
Loss: 231.587451
Feature norm: 51.378354
Error norm: 2.349536
Active features: 1346
Line search trials: 1
Line search step: 1.000000
Seconds required for thi

```python
# if you would like to run this on your own then add this to a cell. Otherwise you should read in the final file from 
# the thing provided
X, token_sr, products_list = crf_basket_feature_creation(instacart_baskets)

tagger = pycrfsuite.Tagger()
tagger.open('../../data/04_models/crf_instacart_products_final.model')

labels = [tagger.tag(xseq) for xseq in X]

instacart_baskets_update = crf_basket_dataset_creation(token_sr, labels, products_list, instacart_baskets)
```

In [29]:
# instacart_baskets_update = pd.read_csv('../../data/05_model_output/baskets_newprodlist_2.csv')

FileNotFoundError: [Errno 2] No such file or directory: '../../data/05_model_output/baskets_newprodlist_2.csv'

In [None]:
# instacart_baskets_update

In [None]:
# from a cursory look at the dataset I can tell that there are a number of things marked as food that are not. let's 
# get rid of these so that they don't mess up our results. 
# mask = ((instacart_baskets_update['new_prod_list']!='1')&(instacart_baskets_update['new_prod_list']!='100')&\
#         (instacart_baskets_update['new_prod_list']!='11')&(instacart_baskets_update['new_prod_list']!='118')&\
#         (instacart_baskets_update['new_prod_list']!='2')&(instacart_baskets_update['new_prod_list']!='24')&\
#         (instacart_baskets_update['new_prod_list']!='3')&(instacart_baskets_update['new_prod_list']!='3 cheese')&\
#         (instacart_baskets_update['new_prod_list']!='30')&(instacart_baskets_update['new_prod_list']!='328')&\
#         (instacart_baskets_update['new_prod_list']!='4')&(instacart_baskets_update['new_prod_list']!='5')&\
#         (instacart_baskets_update['new_prod_list']!='50')&(instacart_baskets_update['new_prod_list']!='6')&\
#         (instacart_baskets_update['new_prod_list']!='6 cheese')&(instacart_baskets_update['new_prod_list']!='60')&\
#         (instacart_baskets_update['new_prod_list']!='7')&(instacart_baskets_update['new_prod_list']!='70')&\
#         (instacart_baskets_update['new_prod_list']!='8')&(instacart_baskets_update['new_prod_list']!='85')&\
#         (instacart_baskets_update['new_prod_list']!='9')&(instacart_baskets_update['new_prod_list']!='95')&\
#         (instacart_baskets_update['new_prod_list']!='97')&(instacart_baskets_update['new_prod_list']!='98')&\
#         (instacart_baskets_update['new_prod_list']!='a')&(instacart_baskets_update['new_prod_list']!='a garlic butter sauce')&\
#         (instacart_baskets_update['new_prod_list']!=np.nan)&(instacart_baskets_update['new_prod_list']!='nan'))

# instacart_baskets_filtered = instacart_baskets_update[mask]

In [None]:
print('Number of Products After Running Names through CRF Mode: ', instacart_baskets_filtered.new_prod_list.nunique())
print('Number of products in the original list: ', instacart_baskets_filtered.product_name.nunique())
print('Number of unique users: ', instacart_baskets_filtered.user_id.nunique())

In [None]:
# instacart_users_lst = list(instacart_baskets_filtered.user_id.unique())
# len(instacart_users_lst)

In [None]:
# random_usrids_100k = random.sample(instacart_users_lst, 100000)
# mask = instacart_baskets_filtered['user_id'].isin(random_usrids_100k)
# baskets_100k = instacart_baskets_filtered.loc[mask]
# print('Number of User IDs: ', baskets_100k.user_id.nunique())

In [None]:
# baskets_100k

In [None]:
# baskets_complete = baskets_100k.drop(columns=['product_name', 'user_id'])
# baskets_complete.head()

This is how to get the dataframe into matrix format
```python
basket_matrix_usr = baskets_complete.groupby(['order_id', 'new_prod_list'])['all_ones']\
                    .sum().unstack().reset_index().fillna(0)\
                    .set_index('order_id')
```

Run ```similarities_model.py``` (located the ```src/d02_features``` folder) from the command line in order to get the final similarity matrix. 

In [None]:
data_matrix = pd.read_csv('../../data/05_model_output/data_matrix_sim.csv')
data_matrix.set_index('Unnamed: 0', inplace=True)

In [None]:
print(data_matrix.loc['potato'].nlargest(11))

In [30]:
print('Choose your meal by inputing either 1, 2 or 3')
# print('\n')
meal_input = input("Breakfast: Input 1 || || Lunch: Input 2 || Dinner: Input 3: ")
print('\n')
print('Choose your dietary preferences by inputing either 1 or 2: ')
# print('\n')
dietary_preference_input = input("Vegetarian: Input 1 || Omnivore: Input 2: ")
# print('\n')
print('Type in 3 foods you already like')
item1 = input("Item 1: ")
item2 = input("Item 2: ")
item3 = input("Item 3: ")
print('\n')
print('Searching for five recipe recommendations based both on your inputs and similair foods.')

if meal_input == "1":
    meal = 'Breakfast'
else: 
    meal = 'Dinner'
    
if dietary_preference_input == "1":
    dietary_preference = 'Vegetarian'
else:
    dietary_preference = None
shopping_basket = [item1, item2, item3]
recipe_recommendations_app(shopping_basket, recipe_ing_dict, recipe_tags_dict, meal, dietary_preference, recipe_links_dict)


Choose your meal by inputing either 1, 2 or 3
Breakfast: Input 1 || || Lunch: Input 2 || Dinner: Input 3: bread


Choose your dietary preferences by inputing either 1 or 2: 
Vegetarian: Input 1 || Omnivore: Input 2: omnivore
Type in 3 foods you already like
Item 1: eggs
Item 2: tomato
Item 3: zucchini


Searching for five recipe recommendations based both on your inputs and similair foods.


NameError: name 'recipe_ing_dict' is not defined