<a href="https://colab.research.google.com/github/JackyChen2T2/REScipe/blob/main/REScipe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# for loading google drive files
from google.colab import drive
drive.mount('/content/drive')

# for data cleaning
import pandas as pd
import numpy as np

# for tokenizing english words and applying GloVe representation
import spacy
!python -m spacy download en

In [17]:
# flags
XLSX_PATH = '/content/drive/My Drive/Colab Notebooks/Project_REScipe/data/total_data.xlsx'      # copy the path of the xlsx file in Path in Google Colab menu
XLSX_COLUMN = ['rand_int_0', 'name', 'url', 'rand_int_1', 'size', 'ingredient', 'recipe', 'rand_int_2']
ALL_DATA = False

===================== A =====================

This section loads the excel file(.xlsx) into a Pandas Dataframe.


In [4]:
# load the xlsx file from Google Drive to a Pandas Dataframe
xlsx_total_data = pd.read_excel(XLSX_PATH, header=None, names=XLSX_COLUMN)

print(xlsx_total_data)

       rand_int_0  ... rand_int_2
0               0  ...     190001
1               1  ...     190004
2               2  ...     190015
3               3  ...     190032
4               4  ...     190060
...           ...  ...        ...
40333           7  ...     149842
40334           8  ...     149844
40335           9  ...     149846
40336          10  ...     149855
40337          11  ...     149896

[40338 rows x 8 columns]


===================== B =====================

This section saves recipe names (index 'name') from the Pandas Dataframe to a csv file in the same Google Drive folder.

This file (total_name.csv) is a temporary dictionary for debugging topic modeling and LDA.

In [5]:
# extract the column of 'name' from the Pandas Dataframe
csv_name = xlsx_total_data[['name']]
# add an additional index column, in case that tokenization or GloVe representation removes certain recipes
csv_name['index'] = csv_name.index

csv_path = '/'.join(XLSX_PATH.split(sep='/')[:-1]) + '/total_name.csv'
csv_name.to_csv(csv_path, index=False)

print(csv_name)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.


                                             name  index
0                             Winter Endive Salad      0
1                        Stuffed Artichoke Hearts      1
2                      Slow Cooker Moscow Chicken      2
3      Slow Cooker Creole Black Beans and Sausage      3
4              Champagne Sorbet with Berry Medley      4
...                                           ...    ...
40333                           Hawaiian Iced Tea  40333
40334                    Belgian Endive au Gratin  40334
40335                             Hot Cider Punch  40335
40336    Hash Brown Casserole for the Slow Cooker  40336
40337                       Samhain Pumpkin Bread  40337

[40338 rows x 2 columns]


===================== C =====================

This section loads total_name.csv and saves a tokenized version of this csv back.

This file (token_name.csv) is used for topic modeling with LDA, and GloVe representation (optional).

In [58]:
csv_path = '/'.join(XLSX_PATH.split(sep='/')[:-1]) + '/total_name.csv'
csv_name = pd.read_csv(csv_path)

# load spaCy english nlp model
nlp = spacy.load('en')

# tokenize, remove non-ascii chars, lemmatize, remove stop words (safe actions)
list_token_name = []
for i, row in csv_name.iterrows():
  token_name = []
  # tokenize and lemmatize
  for word in nlp(row['name']):
    # remove non-ascii chars
    if all(ord(char) < 128 for char in word.lemma_):
      text = word.text
      lemma = word.lemma_.lower()
    else:
      text = word.text.encode('ascii','ignore').decode()
      lemma = word.lemma_.lower().encode('ascii','ignore').decode()
    # remove stop words
    if (len(text) > 2) and (len(lemma) > 2) and (word.is_stop == False):
        token_name.append(lemma)
  if token_name != []:
    list_token_name.append([token_name, row['index']])
  else:
    print(row['index'], '\t removed by tokenization.')
  if i % 1000 == 0:
    print('Tokenizing:', i, '~', i+1000)
    break

csv_token_name = pd.DataFrame(list_token_name, columns=['token_name', 'index'])
print(csv_token_name)

csv_path = '/'.join(XLSX_PATH.split(sep='/')[:-1]) + '/token_name.csv'
csv_token_name.to_csv(csv_path, index=False)

108	SwansonÂ	-->	swanson
147	SoufflÃ	-->	souffl
175	French'sÂ	-->	french's
175	Chickenâ„¢	-->	chicken
177	French'sÂ	-->	french's
179	French'sÂ	-->	french's
223	KraftÂ	-->	kraft
225	KraftÂ	-->	kraft
227	KraftÂ	-->	kraft
234	PHILLYÂ	-->	philly
264	KraftÂ	-->	kraft
266	KraftÂ	-->	kraft
318	RaguÂ	-->	ragu
319	RaguÂ	-->	ragu
320	RaguÂ	-->	ragu
323	RaguÂ	-->	ragu
345	LACTAIDÂ	-->	lactaid
347	LACTAIDÂ	-->	lactaid
399	DOVEÂ	-->	dove
401	DOVEÂ	-->	dove
729	WonderÂ	-->	wonder
746	SmokiesÂ	-->	smokies
747	SmokiesÂ	-->	smokies
825	Treatsâ„¢	-->	treats
826	Treatsâ„¢	-->	treats
827	Treatsâ„¢	-->	treats
962	JifÂ	-->	jif
993	GhirardelliÂ	-->	ghirardelli
994	GhirardelliÂ	-->	ghirardelli
995	GhirardelliÂ	-->	ghirardelli
996	GhirardelliÂ	-->	ghirardelli
997	GhirardelliÂ	-->	ghirardelli
998	GhirardelliÂ	-->	ghirardelli
999	GhirardelliÂ	-->	ghirardelli
1000	GhirardelliÂ	-->	ghirardelli
1001	GhirardelliÂ	-->	ghirardelli
1002	GhirardelliÂ	-->	ghirardelli
1003	GhirardelliÂ	-->	ghirardelli
1004	GhirardelliÂ	--

In [None]:
# remove overly frequent/infrequent words (risky actions, require visualization)