# Word Embeddings (Non-semantic)

**Count based Vector Space Model (Non-semantic):**

- One-hot encoding
- Bag-of-Words model (BoW)
- N-gram Language Models
- Co-Occurrence Counts/Vectors
- TF-IDF

In [None]:
!pip install keras

In [None]:
!pip install tensorflow

In [None]:
!pip install altair

In [30]:
import pandas as pd
import numpy as np
import nltk 
import altair as alt

import sklearn
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from scipy.linalg import svd

from keras.utils import to_categorical

import warnings
warnings.filterwarnings('ignore')

In [14]:
print(sklearn.__version__)

1.0.1


In [15]:
corpus = ['''Plant-based cuisine continues to trend across the country (and around the world), with many operators jumping aboard the veggie train. Planta is certainly aboard, though similar to a number of other meat-free restaurants, Planta prefers to avoid the "vword" (vegan), is careful to not alienate meat eaters, and instead describes itself as a plant-forward concept. Unlike some plant-based concepts that simply swap out burgers or steaks with tofu or other meat substitutes like tempeh, Planta takes a different approach in showcasing the versatility of vegetables in creative dishes, making them the star of the show ("crab" cakes made from hearts of palm, eggplant lasagna). Planta has succeeded in attracting both omnivores and herbivores, catering successfully to those following the flexitarian trend. The Star says Planta "proves plants are posh," serving up unique renditions of classic foods like the 18-Carrot Dog that features smoky carrots dressed to the Twitch ballpark mustard and pickle spears. Chef David Lee's menu also leverages vegetables to showcase flavours from across the globe – take inspiration from Kimchi Spring Rolls or Chickpea Curry. The original Planta restaurant, located in Toronto's Yorkville neighbourhood, opened in late 2016. The next year, the plant-based concept opened a fast-casual offshoot called Planta Burger. This year, Planta made the leap to the U.S., recently opening a restaurant in Miami's South Beach.''',
          '''JuiceBrothers (which has locations in the Netherlands and two in the U.S.inNew York) is a convenience-driven café known for its bottled cold-pressed juices, cleanses, and shots, as well as health-forward bowls that can be packaged for later consumption. In both cases, ingredients are displayed front and centre on the packaging to show off on-trend clean labels – according to our data, 41% of U.S.consumers believe it’s important that clean labels be placed on the front of packaging instead of the back. Across the menu, the chain’s products are organic and vegan. Juice flavours range from Matcha Moringa Mylk (combines cashew milk, agave, lucuma, matcha powder, moringa powder, vanilla, cardamom, cinnamon, and nutmeg) to ColdBrew Latte (made simply with cold brew coffee and vanilla almond milk). Common breakfast dishes like granola also get a healthy spin from buckwheat, an ancient grain that’s gluten-free, contains all eight essential amino acids and has grown more than 10% on U.S.menus in the past year alone (Datassential MenuTrends ). Last year JuiceBrothers partnered with the Brooklyn -based ice cream chain Van Leeuwen to offer vegan ice cream at the newest JuiceBrothers location in Amsterdam for a limited time. The alternative ice cream was made using coconut milk, cashew milk, coconut oil, cane sugar, cocoa butter, and carob bean, and flavours included turmeric as well as trendy spirulina with pieces of matcha cake.''',
          '''Your guide to Chicago’s hottest new restaurants ahead of the National Restaurant Association show. White Oak Tavern Sink | Swim Boeufhaus 2 MAY 2015 DINE AROUND: CHICAGO Last year Datassential took you to the on-trend neighborhood of Logan Square ahead of the 2014 National Restaurant Association Show, and now we are once again taking you on an immersion tour of Chicago. But this year we are taking you to the city's newest restaurants opened in 2014 or 2015. These are the cutting-edge restaurant concepts taking advantage of the hottest trends. Chicago has long been known as a centre of culinary innovation, with chefs like Grant Achatz and the late Homaro Cantu pushing the envelope and redefining the American plate. And this issue is full of restaurants that take chances. At Intro they are taking a chance with a new chef every few months, allowing them full reign over the menu. At Duck Inn, they are taking a chance that Chicago's working-class Bridgeport neighbourhood will embrace an adventurous, fine dining menu filled with dishes like potted foie gras with guava jelly or rice cake fingers with kimchi sauce as a bar snack. At Baker Miller they are taking a chance that customers will embrace the house-milled grains and flours they use in their baked goods, giving them control over the flavour and texture, even if it comes at a higher cost. And you'll find a variety of adventurous dishes and ingredients in this issue, inviting you to take a chance on something new – chicken hearts at Momotaro, makgeolli (Korean rice wine) at Parachute, corned pig tongue at Tete Charcuterie, or a whole pig's head for two at Charlatan. You can take a chance on a new cuisine, like the Piemontese -focused Osteria Langhe or the Basque-inspired Salero and mfk, or a new concept, like the beer cafe Beermiscuous. Or go for broke with some of the city's over-the-top creations -- the 5 lb. bacon bomb at Kaiser Tiger, or the $195.00 porterhouse at RPM Steak. Of course, Chicago is also home to the type of comforting, crowd-pleasing restaurants that the Midwest is known for – hearty chicken paprikash at Bohemian House, filling Tex -Mex at updated diner Dove's Luncheonette, inventive pizzas at Parlor Pizza Bar, old-school Italian at Formento's, or a whole roasted chicken from River Roast. We also bring you extensive consumer data on the concepts found in this issue and the dishes found on Chicago restaurant menus. And we have included a handy fast casual directory in the back: a curated list of some of the notable fast-casual concepts found in Chicago, one of which may just be the next big thing. Just in time for this month's 2015 National Restaurant Association Show, this special edition of Datassential's Dine Around takes you on a trendspotting immersion tour of Chicago, highlighting the newest restaurants, concepts, and trends in the city. If you are interested in learning more about how you can put food trends to work for you, ask us about meeting a Datassential representative while you are in town.''']

---

## One-hot encoding

In [160]:
# Sample set for our example
samples = {'''Plant-based cuisine continues to trend across the country (and around the world), with many operators jumping aboard the veggie train. Planta is certainly aboard, though similar to a number of other meat-free restaurants, Planta prefers to avoid the "vword" (vegan), is careful to not alienate meat eaters, and instead describes itself as a plant-forward concept. Unlike some plant-based concepts that simply swap out burgers or steaks with tofu or other meat substitutes like tempeh, Planta takes a different approach in showcasing the versatility of vegetables in creative dishes, making them the star of the show ("crab" cakes made from hearts of palm, eggplant lasagna). Planta has succeeded in attracting both omnivores and herbivores, catering successfully to those following the flexitarian trend. The Star says Planta "proves plants are posh," serving up unique renditions of classic foods like the 18-Carrot Dog that features smoky carrots dressed to the Twitch ballpark mustard and pickle spears. Chef David Lee's menu also leverages vegetables to showcase flavours from across the globe – take inspiration from Kimchi Spring Rolls or Chickpea Curry. The original Planta restaurant, located in Toronto's Yorkville neighbourhood, opened in late 2016. The next year, the plant-based concept opened a fast-casual offshoot called Planta Burger. This year, Planta made the leap to the U.S., recently opening a restaurant in Miami's South Beach.''',
          '''JuiceBrothers (which has locations in the Netherlands and two in the U.S.inNew York) is a convenience-driven café known for its bottled cold-pressed juices, cleanses, and shots, as well as health-forward bowls that can be packaged for later consumption. In both cases, ingredients are displayed front and centre on the packaging to show off on-trend clean labels – according to our data, 41% of U.S.consumers believe it’s important that clean labels be placed on the front of packaging instead of the back. Across the menu, the chain’s products are organic and vegan. Juice flavours range from Matcha Moringa Mylk (combines cashew milk, agave, lucuma, matcha powder, moringa powder, vanilla, cardamom, cinnamon, and nutmeg) to ColdBrew Latte (made simply with cold brew coffee and vanilla almond milk). Common breakfast dishes like granola also get a healthy spin from buckwheat, an ancient grain that’s gluten-free, contains all eight essential amino acids and has grown more than 10% on U.S.menus in the past year alone (Datassential MenuTrends ). Last year JuiceBrothers partnered with the Brooklyn -based ice cream chain Van Leeuwen to offer vegan ice cream at the newest JuiceBrothers location in Amsterdam for a limited time. The alternative ice cream was made using coconut milk, cashew milk, coconut oil, cane sugar, cocoa butter, and carob bean, and flavours included turmeric as well as trendy spirulina with pieces of matcha cake.''',
          '''Your guide to Chicago’s hottest new restaurants ahead of the National Restaurant Association show. White Oak Tavern Sink | Swim Boeufhaus 2 MAY 2015 DINE AROUND: CHICAGO Last year Datassential took you to the on-trend neighborhood of Logan Square ahead of the 2014 National Restaurant Association Show, and now we are once again taking you on an immersion tour of Chicago. But this year we are taking you to the city's newest restaurants opened in 2014 or 2015. These are the cutting-edge restaurant concepts taking advantage of the hottest trends. Chicago has long been known as a centre of culinary innovation, with chefs like Grant Achatz and the late Homaro Cantu pushing the envelope and redefining the American plate. And this issue is full of restaurants that take chances. At Intro they are taking a chance with a new chef every few months, allowing them full reign over the menu. At Duck Inn, they are taking a chance that Chicago's working-class Bridgeport neighbourhood will embrace an adventurous, fine dining menu filled with dishes like potted foie gras with guava jelly or rice cake fingers with kimchi sauce as a bar snack. At Baker Miller they are taking a chance that customers will embrace the house-milled grains and flours they use in their baked goods, giving them control over the flavour and texture, even if it comes at a higher cost. And you'll find a variety of adventurous dishes and ingredients in this issue, inviting you to take a chance on something new – chicken hearts at Momotaro, makgeolli (Korean rice wine) at Parachute, corned pig tongue at Tete Charcuterie, or a whole pig's head for two at Charlatan. You can take a chance on a new cuisine, like the Piemontese -focused Osteria Langhe or the Basque-inspired Salero and mfk, or a new concept, like the beer cafe Beermiscuous. Or go for broke with some of the city's over-the-top creations -- the 5 lb. bacon bomb at Kaiser Tiger, or the $195.00 porterhouse at RPM Steak. Of course, Chicago is also home to the type of comforting, crowd-pleasing restaurants that the Midwest is known for – hearty chicken paprikash at Bohemian House, filling Tex -Mex at updated diner Dove's Luncheonette, inventive pizzas at Parlor Pizza Bar, old-school Italian at Formento's, or a whole roasted chicken from River Roast. We also bring you extensive consumer data on the concepts found in this issue and the dishes found on Chicago restaurant menus. And we have included a handy fast casual directory in the back: a curated list of some of the notable fast-casual concepts found in Chicago, one of which may just be the next big thing. Just in time for this month's 2015 National Restaurant Association Show, this special edition of Datassential's Dine Around takes you on a trendspotting immersion tour of Chicago, highlighting the newest restaurants, concepts, and trends in the city. If you are interested in learning more about how you can put food trends to work for you, ask us about meeting a Datassential representative while you are in town.'''} 

# Create an empty dictionary for storing our words(keys) and their corresponding indices(values)
token_index = {}

#Create a counter for counting the number of key-value pairs in the token_length
counter = 0

# Select the elements of the samples which are the three texts

# The first for loop iterates over the sentences while the following for loop in the next 
# line iterates over each word in the selected sentence and splits each word returning a list of 
# strings.
for sample in samples:                                      
  for considered_word in sample.split():
    if considered_word not in token_index:
      
      # If the considered word is not present in the dictionary token_index, add it to the token_index
      # The index of the word in the dictionary begins from 1 
      token_index.update({considered_word : counter + 1}) 
      
      # updating the value of counter
      counter = counter + 1                                

In [237]:
print(token_index)

{'Plant-based': 1, 'cuisine': 2, 'continues': 3, 'to': 4, 'trend': 5, 'across': 6, 'the': 7, 'country': 8, '(and': 9, 'around': 10, 'world),': 11, 'with': 12, 'many': 13, 'operators': 14, 'jumping': 15, 'aboard': 16, 'veggie': 17, 'train.': 18, 'Planta': 19, 'is': 20, 'certainly': 21, 'aboard,': 22, 'though': 23, 'similar': 24, 'a': 25, 'number': 26, 'of': 27, 'other': 28, 'meat-free': 29, 'restaurants,': 30, 'prefers': 31, 'avoid': 32, '"vword"': 33, '(vegan),': 34, 'careful': 35, 'not': 36, 'alienate': 37, 'meat': 38, 'eaters,': 39, 'and': 40, 'instead': 41, 'describes': 42, 'itself': 43, 'as': 44, 'plant-forward': 45, 'concept.': 46, 'Unlike': 47, 'some': 48, 'plant-based': 49, 'concepts': 50, 'that': 51, 'simply': 52, 'swap': 53, 'out': 54, 'burgers': 55, 'or': 56, 'steaks': 57, 'tofu': 58, 'substitutes': 59, 'like': 60, 'tempeh,': 61, 'takes': 62, 'different': 63, 'approach': 64, 'in': 65, 'showcasing': 66, 'versatility': 67, 'vegetables': 68, 'creative': 69, 'dishes,': 70, 'makin

Next, we create a tensor of consisting of 0 as it’s elements.

In [230]:
# Set max_length to 10000
max_length = 10000
# Create a tensor of dimension 3 named results whose every elements are initialized to 0
results  = np.zeros(shape = (len(samples),
                            max_length,
                            max(token_index.values()) + 1))  

In [163]:
# Now create a one-hot vector corresponding to the word
# iterate over enumerate(samples) enumerate object
for i, sample in enumerate(samples): 
  
# Convert enumerate object to list and iterate over resultant list 
  for j, considered_word in list(enumerate(sample.split())):
    
    # set the value of index variable equal to the value of considered_word in token_index
    index = token_index.get(considered_word)
    
    # In the previous zero tensor: results, set the value of elements with their 
    # positional index as [i, j, index] = 1.
    results[i, j, index] = 1.          

In [239]:
list(enumerate(sample.split()[:25])) 

[(0, 'Your'),
 (1, 'guide'),
 (2, 'to'),
 (3, 'Chicago’s'),
 (4, 'hottest'),
 (5, 'new'),
 (6, 'restaurants'),
 (7, 'ahead'),
 (8, 'of'),
 (9, 'the'),
 (10, 'National'),
 (11, 'Restaurant'),
 (12, 'Association'),
 (13, 'show.'),
 (14, 'White'),
 (15, 'Oak'),
 (16, 'Tavern'),
 (17, 'Sink'),
 (18, '|'),
 (19, 'Swim'),
 (20, 'Boeufhaus'),
 (21, '2'),
 (22, 'MAY'),
 (23, '2015'),
 (24, 'DINE')]

In [240]:
for j, considered_word in list(enumerate(sample.split()[:25])):
  print(j, considered_word)

0 Your
1 guide
2 to
3 Chicago’s
4 hottest
5 new
6 restaurants
7 ahead
8 of
9 the
10 National
11 Restaurant
12 Association
13 show.
14 White
15 Oak
16 Tavern
17 Sink
18 |
19 Swim
20 Boeufhaus
21 2
22 MAY
23 2015
24 DINE


In [166]:
results

array([[[0., 1., 0., ..., 0., 0., 0.],
        [0., 0., 1., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

### One Hot Encode with `scikit-learn`

In this example, we will use the encoders from the scikit-learn library. Specifically, the `LabelEncoder` of creating an integer encoding of labels and the `OneHotEncoder` for creating a one hot encoding of integer encoded values

By default, the OneHotEncoder class will return a more efficient sparse encoding. This may not be suitable for some applications, such as use with the `Keras` deep learning library. In this case, we disabled the sparse return type by setting the `sparse=False` argument.

In [167]:
values = array(corpus)

# Integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)

# Binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

# Invert first example
inverted = label_encoder.inverse_transform([argmax(onehot_encoded[0, :])])
print(inverted)

[1 0 2]
[[0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]]
['Plant-based cuisine continues to trend across the country (and around the world), with many operators jumping aboard the veggie train. Planta is certainly aboard, though similar to a number of other meat-free restaurants, Planta prefers to avoid the "vword" (vegan), is careful to not alienate meat eaters, and instead describes itself as a plant-forward concept. Unlike some plant-based concepts that simply swap out burgers or steaks with tofu or other meat substitutes like tempeh, Planta takes a different approach in showcasing the versatility of vegetables in creative dishes, making them the star of the show ("crab" cakes made from hearts of palm, eggplant lasagna). Planta has succeeded in attracting both omnivores and herbivores, catering successfully to those following the flexitarian trend. The Star says Planta "proves plants are posh," serving up unique renditions of classic foods like the 18-Carrot Dog that features smoky carrots dres

---

## Bag-of-Words model (BoW)
> `CountVectorizer`

In [168]:
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(corpus)

In [169]:
print(vectorizer.get_feature_names_out())

['00' '10' '18' '195' '2014' '2015' '2016' '41' 'aboard' 'according'
 'achatz' 'acids' 'advantage' 'adventurous' 'agave' 'ahead' 'alienate'
 'allowing' 'almond' 'alternative' 'american' 'amino' 'amsterdam'
 'ancient' 'approach' 'ask' 'association' 'attracting' 'avoid' 'bacon'
 'baked' 'baker' 'ballpark' 'bar' 'based' 'basque' 'beach' 'bean' 'beer'
 'beermiscuous' 'believe' 'big' 'boeufhaus' 'bohemian' 'bomb' 'bottled'
 'bowls' 'breakfast' 'brew' 'bridgeport' 'bring' 'broke' 'brooklyn'
 'buckwheat' 'burger' 'burgers' 'butter' 'cafe' 'café' 'cake' 'cakes'
 'called' 'cane' 'cantu' 'cardamom' 'careful' 'carob' 'carrot' 'carrots'
 'cases' 'cashew' 'casual' 'catering' 'centre' 'certainly' 'chain'
 'chance' 'chances' 'charcuterie' 'charlatan' 'chef' 'chefs' 'chicago'
 'chicken' 'chickpea' 'cinnamon' 'city' 'class' 'classic' 'clean'
 'cleanses' 'cocoa' 'coconut' 'coffee' 'cold' 'coldbrew' 'combines'
 'comes' 'comforting' 'common' 'concept' 'concepts' 'consumer' 'consumers'
 'consumption' 'cont

In [170]:
counts = pd.DataFrame(X.toarray(),
                      columns=vectorizer.get_feature_names())
counts



Unnamed: 0,00,10,18,195,2014,2015,2016,41,aboard,according,...,versatility,vword,white,wine,work,working,world,year,york,yorkville
0,0,0,1,0,0,0,1,0,2,0,...,1,1,0,0,0,0,1,2,0,1
1,0,1,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,2,1,0
2,1,0,0,1,2,3,0,0,0,0,...,0,0,1,1,1,1,0,2,0,0


In [171]:
counts.T.sort_values(by=0, ascending=False).head(25)

Unnamed: 0,0,1,2
planta,8,0,0
plant,4,0,0
based,3,1,0
meat,3,0,0
star,2,0,0
trend,2,1,1
opened,2,0,1
restaurant,2,0,5
concept,2,0,1
vegetables,2,0,0


In [172]:
print(X.toarray())

[[0 0 1 ... 2 0 1]
 [0 1 0 ... 2 1 0]
 [1 0 0 ... 2 0 0]]


---

## N-gram Language Models

> `CountVectorizer`

`analyzer{‘word’, ‘char’, ‘char_wb’} or callable, default=’word’`

Whether the feature should be made of word n-gram or character n-grams. Option **_‘char_wb’_** creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

In [173]:
n_gram_vectorizer = CountVectorizer(stop_words='english', ngram_range=(2, 3), analyzer='word')
X2 = n_gram_vectorizer.fit_transform(corpus)

In [242]:
print(n_gram_vectorizer.get_feature_names()[:100])

['00 porterhouse', '00 porterhouse rpm', '10 menus', '10 menus past', '18 carrot', '18 carrot dog', '195 00', '195 00 porterhouse', '2014 2015', '2014 2015 cutting', '2014 national', '2014 national restaurant', '2015 cutting', '2015 cutting edge', '2015 dine', '2015 dine chicago', '2015 national', '2015 national restaurant', '2016 year', '2016 year plant', '41 consumers', '41 consumers believe', 'aboard similar', 'aboard similar number', 'aboard veggie', 'aboard veggie train', 'according data', 'according data 41', 'achatz late', 'achatz late homaro', 'acids grown', 'acids grown 10', 'advantage hottest', 'advantage hottest trends', 'adventurous dishes', 'adventurous dishes ingredients', 'adventurous fine', 'adventurous fine dining', 'agave lucuma', 'agave lucuma matcha', 'ahead 2014', 'ahead 2014 national', 'ahead national', 'ahead national restaurant', 'alienate meat', 'alienate meat eaters', 'allowing reign', 'allowing reign menu', 'almond milk', 'almond milk common', 'alternative ic

In [175]:
counts = pd.DataFrame(X2.toarray(),
                      columns=n_gram_vectorizer.get_feature_names())
counts

Unnamed: 0,00 porterhouse,00 porterhouse rpm,10 menus,10 menus past,18 carrot,18 carrot dog,195 00,195 00 porterhouse,2014 2015,2014 2015 cutting,...,year plant,year plant based,year planta,year planta leap,year taking,year taking city,york convenience,york convenience driven,yorkville neighbourhood,yorkville neighbourhood opened
0,0,0,0,0,1,1,0,0,0,0,...,1,1,1,1,0,0,0,0,1,1
1,0,0,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,1,0,0
2,1,1,0,0,0,0,1,1,1,1,...,0,0,0,0,1,1,0,0,0,0


In [176]:
counts.T.sort_values(by=0, ascending=False).head(25)

Unnamed: 0,0,1,2
plant based,3,0,0
cuisine continues trend,1,0,0
planta succeeded attracting,1,0,0
planta takes,1,0,0
planta takes different,1,0,0
plants posh,1,0,0
plants posh serving,1,0,0
curry original planta,1,0,0
curry original,1,0,0
posh serving,1,0,0


In [177]:
print(X2.toarray())

[[0 0 0 ... 0 1 1]
 [0 0 1 ... 1 0 0]
 [1 1 0 ... 0 0 0]]


---

## Co-Occurrence Counts/Vectors

Creating <mark>vocabulary</mark>

In [17]:
vocab = list(set((" ".join(corpus)).split()))

In [19]:
text_data = []

for i in corpus:
    text_data.append(i.split())

Counting a minimal amount of words in the sentence to set up a window's size

In [20]:
def MinWords():
    numWords = [len(sentence.split()) for sentence in corpus]
    return min(numWords)

In [21]:
MinWords()

220

In [22]:
def win_size():
    win_size = int(input('Enter window size: '))
    if(MinWords() <= win_size):
        print('operation not possible,select a smaller window_size')
        win_size()
    else:
        return win_size

In [23]:
window_size = win_size()

Enter window size:  5


In [61]:
words = []

for sentence in corpus:
    sentence = sentence.split(' ')
    for k in range(len(sentence) - window_size + 1):
        for l in range(k + 1, k + window_size + 1):
            if l <= len(sentence) - 1:
                words.append([sentence[k], sentence[l]])

In [62]:
len(words)

4727

In [63]:
words_1=[x[::-1] for x in words]
words.extend(words_1)

In [64]:
len(words)

9454

Creating a matrix filled with zeros

In [40]:
a = np.zeros((len(vocab), len(vocab)))

Creating <mark>Co-occurence matrix</mark>

In [41]:
df = pd.DataFrame(a, index = vocab, columns = vocab)

In [65]:
for word in words:
    df.at[word[0], word[1]] += 1

In [67]:
df

Unnamed: 0,"concept,",health-forward,careful,bacon,Tex,classic,they,pickle,food,inspiration,...,whole,prefers,some,centre,according,with,been,certainly,"U.S.,",broke
"concept,",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
health-forward,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
careful,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bacon,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Tex,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
with,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,...,0.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,2.0
been,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
certainly,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"U.S.,",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Making decomposition

In [75]:
# define a matrix
A = np.array(df)
print('Matrix A is: \n')
print(A)

Matrix A is: 

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


<mark>**Singular Value Decomposition (SVD)**</mark>

In [73]:
U, s, VT = svd(A)
print('SVD matrix is: \n')
print(U)

SVD matrix is: 

[[-0.01818625  0.02594168  0.01000554 ... -0.03512367  0.05528056
  -0.0467806 ]
 [-0.00633137  0.01814249 -0.00632434 ...  0.00363192  0.00685477
  -0.00081261]
 [-0.01279474 -0.01434892  0.01992181 ...  0.0073149  -0.08159316
   0.07990918]
 ...
 [-0.01668339  0.00591405  0.02317581 ... -0.02266221 -0.01665334
  -0.00829192]
 [-0.02737716 -0.0078941   0.03773913 ... -0.00250764  0.02064424
  -0.06777025]
 [-0.01642932 -0.00641956  0.00385237 ... -0.00400586 -0.04822946
   0.00550941]]


**Diagonal Matrix**

In [76]:
Sigma = np.diag(s)
print('Matrix Sigma is: \n')
print(Sigma)

Matrix Sigma is: 

[[1.51778217e+02 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 5.25071407e+01 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 4.63123890e+01 ... 0.00000000e+00
  0.00000000e+00 0.00000000e+00]
 ...
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 3.28517278e-02
  0.00000000e+00 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  2.88142799e-02 0.00000000e+00]
 [0.00000000e+00 0.00000000e+00 0.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 1.54562001e-02]]


In [77]:
U.shape,Sigma.shape,VT.shape

((557, 557), (557, 557), (557, 557))

---

## TF-IDF

Initialize `TfidfVectorizer` with desired parameters (default smoothing and normalization)

> If `'content'`, the input is expected to be a sequence of items that can be of type string or byte.

In [214]:
tfidf_vectorizer = TfidfVectorizer(input='content', stop_words='english')

Run TfidfVectorizer on our corpus

In [215]:
tfidf_vector = tfidf_vectorizer.fit_transform(corpus)

Make a DataFrame out of the resulting TF-IDF vector, setting the “feature names” or words as columns and the titles as rows

In [216]:
tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=corpus, columns=tfidf_vectorizer.get_feature_names())



Add column for document frequency aka number of times word appears in all documents

In [217]:
tfidf_df.loc['Document Frequency'] = (tfidf_df > 0).sum()

In [218]:
tfidf_slice = tfidf_df[['restaurant', 'menu', 'plant', 'vegan', 'cuisine', 'burger', 'kimchi','bar', 
                        'year', 'trend']]
tfidf_slice.sort_index().round(decimals=2)

Unnamed: 0,restaurant,menu,plant,vegan,cuisine,burger,kimchi,bar,year,trend
Document Frequency,2.0,3.0,1.0,2.0,2.0,1.0,2.0,1.0,3.0,3.0
"JuiceBrothers (which has locations in the Netherlands and two in the U.S.inNew York) is a convenience-driven café known for its bottled cold-pressed juices, cleanses, and shots, as well as health-forward bowls that can be packaged for later consumption. In both cases, ingredients are displayed front and centre on the packaging to show off on-trend clean labels – according to our data, 41% of U.S.consumers believe it’s important that clean labels be placed on the front of packaging instead of the back. Across the menu, the chain’s products are organic and vegan. Juice flavours range from Matcha Moringa Mylk (combines cashew milk, agave, lucuma, matcha powder, moringa powder, vanilla, cardamom, cinnamon, and nutmeg) to ColdBrew Latte (made simply with cold brew coffee and vanilla almond milk). Common breakfast dishes like granola also get a healthy spin from buckwheat, an ancient grain that’s gluten-free, contains all eight essential amino acids and has grown more than 10% on U.S.menus in the past year alone (Datassential MenuTrends ). Last year JuiceBrothers partnered with the Brooklyn -based ice cream chain Van Leeuwen to offer vegan ice cream at the newest JuiceBrothers location in Amsterdam for a limited time. The alternative ice cream was made using coconut milk, cashew milk, coconut oil, cane sugar, cocoa butter, and carob bean, and flavours included turmeric as well as trendy spirulina with pieces of matcha cake.",0.0,0.04,0.0,0.11,0.0,0.0,0.0,0.0,0.09,0.04
"Plant-based cuisine continues to trend across the country (and around the world), with many operators jumping aboard the veggie train. Planta is certainly aboard, though similar to a number of other meat-free restaurants, Planta prefers to avoid the ""vword"" (vegan), is careful to not alienate meat eaters, and instead describes itself as a plant-forward concept. Unlike some plant-based concepts that simply swap out burgers or steaks with tofu or other meat substitutes like tempeh, Planta takes a different approach in showcasing the versatility of vegetables in creative dishes, making them the star of the show (""crab"" cakes made from hearts of palm, eggplant lasagna). Planta has succeeded in attracting both omnivores and herbivores, catering successfully to those following the flexitarian trend. The Star says Planta ""proves plants are posh,"" serving up unique renditions of classic foods like the 18-Carrot Dog that features smoky carrots dressed to the Twitch ballpark mustard and pickle spears. Chef David Lee's menu also leverages vegetables to showcase flavours from across the globe – take inspiration from Kimchi Spring Rolls or Chickpea Curry. The original Planta restaurant, located in Toronto's Yorkville neighbourhood, opened in late 2016. The next year, the plant-based concept opened a fast-casual offshoot called Planta Burger. This year, Planta made the leap to the U.S., recently opening a restaurant in Miami's South Beach.",0.1,0.04,0.27,0.05,0.05,0.07,0.05,0.0,0.08,0.08
"Your guide to Chicago’s hottest new restaurants ahead of the National Restaurant Association show. White Oak Tavern Sink | Swim Boeufhaus 2 MAY 2015 DINE AROUND: CHICAGO Last year Datassential took you to the on-trend neighborhood of Logan Square ahead of the 2014 National Restaurant Association Show, and now we are once again taking you on an immersion tour of Chicago. But this year we are taking you to the city's newest restaurants opened in 2014 or 2015. These are the cutting-edge restaurant concepts taking advantage of the hottest trends. Chicago has long been known as a centre of culinary innovation, with chefs like Grant Achatz and the late Homaro Cantu pushing the envelope and redefining the American plate. And this issue is full of restaurants that take chances. At Intro they are taking a chance with a new chef every few months, allowing them full reign over the menu. At Duck Inn, they are taking a chance that Chicago's working-class Bridgeport neighbourhood will embrace an adventurous, fine dining menu filled with dishes like potted foie gras with guava jelly or rice cake fingers with kimchi sauce as a bar snack. At Baker Miller they are taking a chance that customers will embrace the house-milled grains and flours they use in their baked goods, giving them control over the flavour and texture, even if it comes at a higher cost. And you'll find a variety of adventurous dishes and ingredients in this issue, inviting you to take a chance on something new – chicken hearts at Momotaro, makgeolli (Korean rice wine) at Parachute, corned pig tongue at Tete Charcuterie, or a whole pig's head for two at Charlatan. You can take a chance on a new cuisine, like the Piemontese -focused Osteria Langhe or the Basque-inspired Salero and mfk, or a new concept, like the beer cafe Beermiscuous. Or go for broke with some of the city's over-the-top creations -- the 5 lb. bacon bomb at Kaiser Tiger, or the $195.00 porterhouse at RPM Steak. Of course, Chicago is also home to the type of comforting, crowd-pleasing restaurants that the Midwest is known for – hearty chicken paprikash at Bohemian House, filling Tex -Mex at updated diner Dove's Luncheonette, inventive pizzas at Parlor Pizza Bar, old-school Italian at Formento's, or a whole roasted chicken from River Roast. We also bring you extensive consumer data on the concepts found in this issue and the dishes found on Chicago restaurant menus. And we have included a handy fast casual directory in the back: a curated list of some of the notable fast-casual concepts found in Chicago, one of which may just be the next big thing. Just in time for this month's 2015 National Restaurant Association Show, this special edition of Datassential's Dine Around takes you on a trendspotting immersion tour of Chicago, highlighting the newest restaurants, concepts, and trends in the city. If you are interested in learning more about how you can put food trends to work for you, ask us about meeting a Datassential representative while you are in town.",0.17,0.05,0.0,0.0,0.03,0.0,0.03,0.09,0.05,0.03


Let’s drop `“Document Frequency”` since we were just using it for illustration purposes.

In [219]:
tfidf_df = tfidf_df.drop('Document Frequency', errors='ignore')

Let’s reorganize the DataFrame so that the words are in rows rather than columns.

In [220]:
tfidf_df.stack().reset_index()

Unnamed: 0,level_0,level_1,0
0,Plant-based cuisine continues to trend across ...,00,0.000000
1,Plant-based cuisine continues to trend across ...,10,0.000000
2,Plant-based cuisine continues to trend across ...,18,0.068063
3,Plant-based cuisine continues to trend across ...,195,0.000000
4,Plant-based cuisine continues to trend across ...,2014,0.000000
...,...,...,...
1195,Your guide to Chicago’s hottest new restaurant...,working,0.044379
1196,Your guide to Chicago’s hottest new restaurant...,world,0.000000
1197,Your guide to Chicago’s hottest new restaurant...,year,0.052422
1198,Your guide to Chicago’s hottest new restaurant...,york,0.000000


In [221]:
tfidf_df = tfidf_df.stack().reset_index()

In [222]:
tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})

To find out <mark>the top 10 words</mark> with the highest TF-IDF for every text, we’re going to sort by document and TF-IDF score and then groupby document and take the first 10 values.

In [223]:
tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

Unnamed: 0,document,term,tfidf
653,JuiceBrothers (which has locations in the Neth...,milk,0.292436
514,JuiceBrothers (which has locations in the Neth...,cream,0.219327
591,JuiceBrothers (which has locations in the Neth...,ice,0.219327
610,JuiceBrothers (which has locations in the Neth...,juicebrothers,0.219327
643,JuiceBrothers (which has locations in the Neth...,matcha,0.219327
470,JuiceBrothers (which has locations in the Neth...,cashew,0.146218
475,JuiceBrothers (which has locations in the Neth...,chain,0.146218
489,JuiceBrothers (which has locations in the Neth...,clean,0.146218
492,JuiceBrothers (which has locations in the Neth...,coconut,0.146218
494,JuiceBrothers (which has locations in the Neth...,cold,0.146218


In [224]:
top_tfidf = tfidf_df.sort_values(by=['document','tfidf'], ascending=[True,False]).groupby(['document']).head(10)

### Visualize TF-IDF

Let’s make a <mark>heatmap</mark> that shows the highest TF-IDF scoring words for each president, and let’s put a red dot next to two terms of interest: _'chicken', 'meat', 'matcha', 'milk'_

In [226]:
# Terms in this list will get a red dot in the visualization
term_list = ['chicken', 'meat', 'matcha', 'milk']

# adding a little randomness to break ties in term ranking
top_tfidf_plusRand = top_tfidf.copy()
top_tfidf_plusRand['tfidf'] = top_tfidf_plusRand['tfidf'] + np.random.rand(top_tfidf.shape[0])*0.0001

# base for all visualizations, with rank calculation
base = alt.Chart(top_tfidf_plusRand).encode(
    x = 'rank:O',
    y = 'document:N'
).transform_window(
    rank = "rank()",
    sort = [alt.SortField("tfidf", order="descending")],
    groupby = ["document"],
)

# heatmap specification
heatmap = base.mark_rect().encode(
    color = 'tfidf:Q'
)

# red circle over terms in above list
circle = base.mark_circle(size=100).encode(
    color = alt.condition(
        alt.FieldOneOfPredicate(field='term', oneOf=term_list),
        alt.value('red'),
        alt.value('#FFFFFF00')        
    )
)

# text labels, white for darker heatmap colors
text = base.mark_text(baseline='middle').encode(
    text = 'term:N',
    color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)

# display the three superimposed visualizations
(heatmap + circle + text).properties(width = 600)