# NLP

NLP is a branch of data science that consists of systematic processes for analyzing, understanding, and deriving information from the text data in a smart and efficient manner.

The most used terms in this notebook

1. Tokenization: Process of converting a text into tokens

2. Tokens - Words or entites present in the text

3. Text Object - A sentence or a phrase or a word or an article

## Text Preprocessing

Since, text is the most unstructred form of all the available data, various types of noise are present in it and the data is not readily analyzable without any pre-processing.

It is comprised with the three steps:

1. Noise Removal

2. Lexicon Normalization

3. Object Standardization

### Noise Removal

***Any piece of text which is not relevent to the context of the data and the end-output can be specified as the noise.***

For example: is, am, the, of, in etc.

Some other examples: URLs, #, media links, punctuations.


## Lexicon Normalization

Another type of textual noise is about the multiple representations exhibited by single word.

For example – “play”, “player”, “played”, “plays” and “playing” are the different variations of the word – “play”, Though they mean different but contextually all are similar. The step converts all the disparities of a word into their normalized form (also known as lemma). 

In [1]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /home/rockstar/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
from nltk.stem.wordnet import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('boys'))

boy


In [3]:
from nltk.stem.porter import PorterStemmer 
stem = PorterStemmer()

word = "multiplying" 


print('\n\nStemming\n\n')
print(stem.stem(word))



Stemming


multipli


In [4]:
import spacy
nlp = spacy.load('en')

In [5]:
doc = nlp("I am an NLP Engineer.")

## Tokenizing

In [6]:
for t in doc:
    print(t)

I
am
an
NLP
Engineer
.


In [7]:
doc = nlp("I don't smoke neither do I drink.")

for token in doc:
    print(token)

I
do
n't
smoke
neither
do
I
drink
.


Spacy splits the contractions.

## Text Preprocessing

In [8]:
print("Token \t\tLemma \t\tStopword")
print("-"*40)
var = 1
print(f"The value of Variable is {var}")

Token 		Lemma 		Stopword
----------------------------------------
The value of Variable is 1


In [9]:
print(f"Token \t\tLemma \t\tStopword")
print("-"*40)
for token in doc:
    print(f"{token}\t\t{token.lemma_}\t\t{token.is_stop}")

Token 		Lemma 		Stopword
----------------------------------------
I		-PRON-		True
do		do		True
n't		not		True
smoke		smoke		False
neither		neither		True
do		do		True
I		-PRON-		True
drink		drink		False
.		.		False


## Pattern Matching

My name is Gopal Singh.

Gopal Singh studies in 5th sem.

In [10]:
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [11]:
matcher.add("GOPAL", None, nlp("Gopal Singh"))

In [12]:
doc = nlp("My name is Gopal Singh.")

for i in doc:
    print(i)

My
name
is
Gopal
Singh
.


In [13]:
matches = matcher(doc)

In [14]:
print(matches)

[(11304736281928062578, 3, 5)]


In [15]:
match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id])

GOPAL


In [16]:
doc[start:end]

Gopal Singh

A girl was wearing red skirt with the yellow fabric on the top.

In [17]:
color_pattern = [nlp(text) for text in ('red', 'green', 'blue')]
prodcut_pattern = [nlp(text) for text in ('coat', 'bag', 'belt')]
material_pattern = [nlp(text) for text in ('silk', 'yellow fabric')]

In [18]:
type(color_pattern[0])

spacy.tokens.doc.Doc

In [19]:
arr = [1, 2, 3]
print(*arr)

1 2 3


In [20]:
matcher = PhraseMatcher(nlp.vocab)

matcher.add("COLOR", None, *color_pattern)
matcher.add("PRODUCT", None, *prodcut_pattern)
matcher.add("MATERIAL", None, *material_pattern)

In [21]:
doc = nlp("A girl was wearing red skirt with the silk red on the top.")

In [22]:
mathches = matcher(doc)

In [23]:
print(mathches)

[(10780996287991076397, 4, 5), (2214131371939929767, 8, 9), (10780996287991076397, 9, 10)]


In [24]:
match_id, start, end = mathches[0]
print(nlp.vocab.strings[match_id])

COLOR


In [25]:
doc[start:end].text

'red'

In [26]:
mathches

[(10780996287991076397, 4, 5),
 (2214131371939929767, 8, 9),
 (10780996287991076397, 9, 10)]

In [27]:
match_id, start, end = mathches[1]
print(nlp.vocab.strings[match_id])

MATERIAL


In [28]:
doc[start:end].text

'silk'

In [29]:
for match_id, start, end in mathches:
    rule_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    print(rule_id, span.text)

COLOR red
MATERIAL silk
COLOR red


## Yelp Reviews

In [30]:
import pandas as pd

df = pd.read_json('nlp-data/restaurant.json')
df.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
109,lDJIaF4eYRF4F7g6Zb9euw,lb0QUR5bc4O-Am4hNq9ZGg,r5PLDU-4mSbde5XekTXSCA,4,2,0,0,I used to work food service and my manager at ...,2013-01-27 17:54:54
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
1204,UF-JqzMczZ8vvp_4tPK3bQ,slfi6gf_qEYTXy90Sw93sg,r5PLDU-4mSbde5XekTXSCA,5,1,0,0,Amazing Steak and Cheese... Better than any Ph...,2011-03-20 00:57:45
1251,geUJGrKhXynxDC2uvERsLw,N_-UepOzAsuDQwOUtfRFGw,r5PLDU-4mSbde5XekTXSCA,1,0,0,0,Although I have been going to DeFalco's for ye...,2018-07-17 01:48:23
1354,aPctXPeZW3kDq36TRm-CqA,139hD7gkZVzSvSzDPwhNNw,r5PLDU-4mSbde5XekTXSCA,2,0,0,0,"Highs: Ambience, value, pizza and deserts. Thi...",2018-01-21 10:52:58


In [31]:
import spacy
from spacy.matcher import PhraseMatcher

text = df["text"].values[14]

In [32]:
nlp = spacy.blank('en')

In [33]:
review_doc = nlp(text)

In [34]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")

In [35]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

menu_pattern = [nlp(item) for item in menu]

In [36]:
matcher.add("MENU", None, *menu_pattern)

In [37]:
matches = matcher(review_doc)

In [38]:
print(matches)

[(8291075388056826051, 2, 3), (8291075388056826051, 16, 17), (8291075388056826051, 58, 59)]


In [39]:
for m_id, start, end in matches:
    print(nlp.vocab.strings[m_id], review_doc[start:end])

MENU Purista
MENU prosciutto
MENU meatball


## Matching on whole dataset

In [40]:
from collections import defaultdict


def value():
    return "Not available"

dd = {
    "a":[1, 2, 2, 3],
    "b":[11, 22, 221, 33],
    "c":[331, 1212, 121211, 11]
}

data = pd.DataFrame(dd, index=[1, 2, 3, 4])
data

Unnamed: 0,a,b,c
1,1,11,331
2,2,22,1212
3,2,221,121211
4,3,33,11


In [41]:
d = defaultdict(list)

d["a"] = [1, 2, 3]
d["b"] = [3, 4, 4]

In [42]:
d

defaultdict(list, {'a': [1, 2, 3], 'b': [3, 4, 4]})

In [43]:
d["a"]

[1, 2, 3]

In [44]:
d["b"]

[3, 4, 4]

In [45]:
d["c"].append([1, 2])

In [46]:
d

defaultdict(list, {'a': [1, 2, 3], 'b': [3, 4, 4], 'c': [[1, 2]]})

In [47]:
data

Unnamed: 0,a,b,c
1,1,11,331
2,2,22,1212
3,2,221,121211
4,3,33,11


In [48]:
for i, d in data.iterrows():
    print(d["c"])

331
1212
121211
11


In [49]:
from collections import defaultdict

# If key is not found then return empty list
ratings = defaultdict(list)

for i, review in df.iterrows():
    doc = nlp(review["text"])
    matches = matcher(doc)
    
    # just making sure that we have unique menu items
    items = set([doc[start:end] for m_id, start, end in matches])
    print(items)
    
    for item in items:
        ratings[item.text.lower()].append(review.stars)
        print(ratings)
    

{Chicken Parmigiana}
defaultdict(<class 'list'>, {'chicken parmigiana': [4]})
{Eggplant, eggplant, eggplant, eggplant}
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4]})
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4]})
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4, 4]})
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4, 4, 4]})
{Steak and Cheese, pizza}
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4, 4, 4], 'steak and cheese': [5]})
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4, 4, 4], 'steak and cheese': [5], 'pizza': [5]})
{meatball, meatball}
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4, 4, 4], 'steak and cheese': [5], 'pizza': [5], 'meatball': [1]})
defaultdict(<class 'list'>, {'chicken parmigiana': [4], 'eggplant': [4, 4, 4, 4], 'steak and cheese': [5], 'pizza': [5], 'meatball': [1, 1]})
{pizza, P

{calzone, Meatball, Cheesesteak, pizza}
defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5], 'steak and cheese': [5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5], 'cannoli': [2, 5, 5, 5, 5, 5, 5, 1], 'pasta': [2, 4, 5, 4, 4, 4, 5, 5, 3, 3, 3, 5], 'purista': [5, 5, 5, 5, 5], 'prosciutto': [5, 5, 4, 5], 'cheese steak': [4, 5, 5, 5], 'cheesesteak': [3, 3, 5, 5, 5, 4, 4, 4, 5], 'calzone': [3, 5, 5, 4, 4, 4, 5], 'italian combo': [5, 1, 1], 'tiramisu': [5], 'chicken spinach salad': [5], 'italian beef': [3], 'salami': [4], 'chicken parm': [1, 4, 5], 'chicken cutlet': [4, 3], 'turkey sandwich': [4], 'ziti': [4, 5], 'chicken pesto': [4], 'tuna salad': [4], 'lasagna': [5], 'artichoke salad': [5], 'fettuccini alfredo': [5], 'pizzas': [5, 4], 'turkey breast': [5], 'calzones': [4, 5], 'mac and cheese': [5, 5, 5, 5, 5, 5, 5], 'grilled veggie': [5], 'garlic bread': [5], 'spaghetti': [4

{pizza, pizza}
defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4], 'steak and cheese': [5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1], 'cannoli': [2, 5, 5, 5, 5, 5, 5, 1, 3, 3, 3, 5, 4, 5], 'pasta': [2, 4, 5, 4, 4, 4, 5, 5, 3, 3, 3, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 2, 5, 5, 5], 'purista': [5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4], 'prosciutto': [5, 5, 4, 5, 5, 5, 5, 5, 4], 'cheese steak': [4, 5, 5, 5, 5, 4, 4, 5, 2, 2, 5, 2], 'cheesesteak': [3, 3, 5, 5, 5, 4, 4, 4, 5, 5, 4, 5, 3, 3, 3, 3, 5], 'calzone': [3, 5, 5, 4, 4, 4, 5, 5, 5, 5, 3, 3, 5, 5, 4, 4, 4, 4], 'italian combo': [5, 1, 1, 4], 'tiramisu': [5, 5], 'chicken spinach salad': [5], 'italian beef': [3], 'salami': [4, 5, 5], 'chicken parm': [1, 4, 5, 3, 5, 5, 5, 5], 'chicken cu

{chicken Parmesan, cheesesteak}
defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 5, 5, 4], 'steak and cheese': [5, 5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 1, 4, 3, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1, 5, 4, 4, 5, 5, 5, 5, 5, 5], 'cannoli': [2, 5, 5, 5, 5, 5, 5, 1, 3, 3, 3, 5, 4, 5, 4, 5, 5], 'pasta': [2, 4, 5, 4, 4, 4, 5, 5, 3, 3, 3, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 2, 5, 5, 5, 4, 5, 4, 5, 5, 3, 5, 4, 5, 5, 5], 'purista': [5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5], 'prosciutto': [5, 5, 4, 5, 5, 5, 5, 5, 4, 5], 'cheese steak': [4, 5, 5, 5, 5, 4, 4, 5, 2, 2, 5, 2, 5, 5, 4, 5, 4], 'cheesesteak': [3, 3, 5, 5, 5, 4, 4, 4, 5, 5, 4, 5, 3, 3, 3, 3, 5, 4, 5, 4], 'calzone': [3, 5, 5, 4, 4, 4, 5, 5, 5, 5

defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4, 5, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 5, 5, 4, 5, 5, 5, 3, 3], 'steak and cheese': [5, 5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 1, 4, 3, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5, 2, 3, 5, 5, 5, 5], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1, 5, 4, 4, 5, 5, 5, 5, 5, 5, 2, 1, 4, 5, 4], 'cannoli': [2, 5, 5, 5, 5, 5, 5, 1, 3, 3, 3, 5, 4, 5, 4, 5, 5, 5, 5, 2, 4, 4, 4, 4], 'pasta': [2, 4, 5, 4, 4, 4, 5, 5, 3, 3, 3, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 2, 5, 5, 5, 4, 5, 4, 5, 5, 3, 5, 4, 5, 5, 5, 4, 4, 5, 3, 4, 3, 3, 3, 4, 5, 5, 5, 5], 'purista': [5, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5, 4, 5, 5, 5, 5], 'prosciutto': [5, 5, 4, 5, 5, 5, 5, 5, 4, 5, 5], 'cheese steak': [4, 5, 5, 5, 5, 4, 4, 5, 2, 2, 5, 2, 5, 5, 4, 5, 4, 5, 4], 'cheesesteak': [3, 3

defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4, 5, 5, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 5, 5, 4, 5, 5, 5, 3, 3, 5, 5, 5, 4, 4, 5, 4], 'steak and cheese': [5, 5, 5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 1, 4, 3, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5, 2, 3, 5, 5, 5, 5, 5, 4, 5, 4, 4, 4, 4, 5, 5, 5, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1, 5, 4, 4, 5, 5, 5, 5, 5, 5, 2, 1, 4, 5, 4, 4, 4, 4, 4, 5, 4, 5, 5, 5], 'cannoli': [2, 5, 5, 5, 5, 5, 5, 1, 3, 3, 3, 5, 4, 5, 4, 5, 5, 5, 5, 2, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5], 'pasta': [2, 4, 5, 4, 4, 4, 5, 5, 3, 3, 3, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 2, 5, 5, 5, 4, 5, 4, 5, 5, 3, 5, 4, 5, 5, 5, 4, 4, 5, 3, 4, 3, 3, 3, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 4], 'purist

defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4, 5, 5, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 5, 5, 4, 5, 5, 5, 3, 3, 5, 5, 5, 4, 4, 5, 4, 5, 4], 'steak and cheese': [5, 5, 5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 1, 4, 3, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5, 2, 3, 5, 5, 5, 5, 5, 4, 5, 4, 4, 4, 4, 5, 5, 5, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 4, 4, 5, 5, 5, 5, 4, 5, 4, 5, 5, 1, 4, 4, 5], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1, 5, 4, 4, 5, 5, 5, 5, 5, 5, 2, 1, 4, 5, 4, 4, 4, 4, 4, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 1, 1, 4], 'cannoli': [2, 5, 5, 5, 5, 5, 5, 1, 3, 3, 3, 5, 4, 5, 4, 5, 5, 5, 5, 2, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 4, 3, 3, 3, 4, 4, 5, 5, 5], 'pasta': [2, 4, 5, 4, 4, 4, 5, 5, 3, 3, 3, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 5, 4, 2, 5, 5, 5, 4, 5,

set()
{pizza, pasta, Pasta, Meatball, Chicken Parmesan}
defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4, 5, 5, 5, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 5, 5, 4, 5, 5, 5, 3, 3, 5, 5, 5, 4, 4, 5, 4, 5, 4, 4, 4, 5, 5], 'steak and cheese': [5, 5, 5, 5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 1, 4, 3, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5, 2, 3, 5, 5, 5, 5, 5, 4, 5, 4, 4, 4, 4, 5, 5, 5, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 4, 4, 5, 5, 5, 5, 4, 5, 4, 5, 5, 1, 4, 4, 5, 3, 3, 3, 3, 5, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 5, 5, 4, 2, 2, 2, 5, 5, 4, 5, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 4, 4, 4, 4, 4, 4, 5, 5, 3], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1, 5, 4, 4, 5, 5, 5, 5, 5, 5, 2, 1, 4, 5, 4, 4, 4, 4, 4, 5, 4, 5, 5, 5, 5, 5, 5, 4, 5, 1, 1, 4, 4, 4, 5, 5, 5, 5, 5, 

{garlic bread, cannoli, garlic bread}
defaultdict(<class 'list'>, {'chicken parmigiana': [4, 5, 4, 4, 5, 5, 5, 5], 'eggplant': [4, 4, 4, 4, 3, 1, 1, 1, 1, 5, 4, 4, 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 5, 5, 4, 5, 5, 5, 3, 3, 5, 5, 5, 4, 4, 5, 4, 5, 4, 4, 4, 5, 5, 5, 2, 5, 5, 5], 'steak and cheese': [5, 5, 5, 5], 'pizza': [5, 2, 2, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 5, 1, 1, 4, 3, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 3, 3, 4, 4, 5, 5, 5, 5, 2, 3, 5, 5, 5, 5, 5, 4, 5, 4, 4, 4, 4, 5, 5, 5, 4, 4, 3, 3, 3, 3, 5, 5, 5, 5, 4, 5, 5, 5, 5, 4, 4, 5, 5, 3, 3, 5, 4, 4, 5, 5, 5, 5, 4, 5, 4, 5, 5, 1, 4, 4, 5, 3, 3, 3, 3, 5, 5, 5, 5, 4, 4, 5, 5, 5, 5, 5, 5, 5, 4, 2, 2, 2, 5, 5, 4, 5, 3, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 1, 4, 4, 4, 4, 4, 4, 5, 5, 3, 4, 5, 5, 5, 5, 5, 4, 2, 2, 5, 5, 2, 2, 5, 5, 5, 2, 2, 5, 5, 5, 5, 5, 4, 4], 'meatball': [1, 1, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 5, 5, 4, 3, 3, 3, 5, 5, 5, 5, 1, 1, 4, 5, 5, 3, 1, 5, 4, 4, 5, 5, 5, 5, 5, 5, 2, 1, 4, 5, 4, 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [50]:
ratings

defaultdict(list,
            {'chicken parmigiana': [4,
              5,
              4,
              4,
              5,
              5,
              5,
              5,
              5,
              4,
              4,
              4,
              3,
              4,
              5,
              5,
              4,
              5],
             'eggplant': [4,
              4,
              4,
              4,
              3,
              1,
              1,
              1,
              1,
              5,
              4,
              4,
              4,
              3,
              3,
              3,
              3,
              3,
              4,
              4,
              3,
              4,
              5,
              5,
              4,
              5,
              5,
              5,
              3,
              3,
              5,
              5,
              5,
              4,
              4,
              5,
              4,
            

In [51]:
a = [2, 2, 4, 4, 5]
sum(a)/len(a)

3.4

In [52]:
menu_card = {}
for item, rating in ratings.items():
    menu_card[item] = round(sum(rating)/len(rating), 2)

In [53]:
menu_card_2 = {}
menu_card = {}
for item, rating in ratings.items():
    menu_card_2[round(sum(rating)/len(rating), 2)] = item

In [54]:
min_rating = min(menu_card_2.keys())
menu_card_2[min_rating]

'chicken cutlet'

## Text Classification

In [55]:
import pandas as pd

spam = pd.read_csv('nlp-data/spam.csv')
spam.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Bag of Words

ML models can only learn upon numeric data, so we will be needing to convert the text data into a numeric data. But the question is HOW??


#### The answer is Bag Of Words.


Tea is healthy but Tea makes you insomaiac.

[Tea is healthy but makes you insomainac]

{2, 1, 1, 1, 1, 1}

##### Term frequencies.

In [56]:
import spacy

nlp = spacy.blank('en')

In [60]:
classifier = nlp.create_pipe("textcat", config={
                "exclusive_classes": True,
                "architecture": "bow"})

In [61]:
nlp.add_pipe(classifier)

In [62]:
classifier.add_label("ham")
classifier.add_label("spam")

1

## Training the model

In [63]:
train_texts = spam["text"].values

In [64]:
labels = [{"cats": {"ham":label == "ham", "spam":label=="spam"}} for label in spam["label"]]

In [65]:
labels

[{'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': False, 'spam': True}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham': True, 'spam': False}},
 {'cats': {'ham'

In [66]:
prepared_data = list(zip(train_texts, labels))

In [67]:
prepared_data[:2]

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}})]

In [68]:
from spacy.util import minibatch

spacy.util.fix_random_seed(1)

#training
optimizer = nlp.begin_training()

#Creating batches
batches = minibatch(prepared_data, size=8)

In [69]:
for batch  in batches:
    # splitting the data and class
    text, label = zip(*batch)
    #update the model's parameters.
    nlp.update(text, label, sgd=optimizer)

In [70]:
import random

optimizer = nlp.begin_training()
losses = {}

for epoch in range(10):
    random.shuffle(prepared_data)
    batches = minibatch(prepared_data, size=16)
    
    for batch in batches:
        # splitting the data and class
        text, label = zip(*batch)
        #update the model's parameters.
        nlp.update(text, label, sgd=optimizer, losses=losses)
        print(losses)
    

{'textcat': 4.993294714950025e-05}
{'textcat': 0.0004141824319958687}
{'textcat': 0.0004879304033238441}
{'textcat': 0.0005448188458103687}
{'textcat': 0.0006072485048207454}
{'textcat': 0.000681831494148355}
{'textcat': 0.0008422988976235501}
{'textcat': 0.0012316910942899995}
{'textcat': 0.0013354102047742344}
{'textcat': 0.0015823341454961337}
{'textcat': 0.0017626259141252376}
{'textcat': 0.001905064687889535}
{'textcat': 0.001960944562597433}
{'textcat': 0.0019927010689571034}
{'textcat': 0.0020798269943043124}
{'textcat': 0.002175362311390927}
{'textcat': 0.0022027929917385336}
{'textcat': 0.0022613055152760353}
{'textcat': 0.0027012505706807133}
{'textcat': 0.002774062097159913}
{'textcat': 0.00282931288893451}
{'textcat': 0.0028780617140000686}
{'textcat': 0.003160573498462327}
{'textcat': 0.003313605790026486}
{'textcat': 0.0037786330794915557}
{'textcat': 0.004320775682572275}
{'textcat': 0.0044600507826544344}
{'textcat': 0.004510825983743416}
{'textcat': 0.00464298004590091

{'textcat': 0.039737069167586014}
{'textcat': 0.039764917888078344}
{'textcat': 0.04001457322328861}
{'textcat': 0.040099408240621415}
{'textcat': 0.040119383036199}
{'textcat': 0.040150176129372994}
{'textcat': 0.040162738132949016}
{'textcat': 0.0402328066525115}
{'textcat': 0.0403724184402563}
{'textcat': 0.04086898706418651}
{'textcat': 0.04090127803692667}
{'textcat': 0.040950315342797694}
{'textcat': 0.040973460545956186}
{'textcat': 0.041060744839796826}
{'textcat': 0.041217087095446914}
{'textcat': 0.04124781965265356}
{'textcat': 0.04126653417051784}
{'textcat': 0.04140222655996695}
{'textcat': 0.04145475388077102}
{'textcat': 0.04161691860508654}
{'textcat': 0.041794074727022235}
{'textcat': 0.04182750441987082}
{'textcat': 0.04183638424228775}
{'textcat': 0.041862455992031755}
{'textcat': 0.0418761026435277}
{'textcat': 0.042200176502319664}
{'textcat': 0.042205414414183906}
{'textcat': 0.04221030781036461}
{'textcat': 0.042257387530753476}
{'textcat': 0.042451301041637635}


{'textcat': 0.06648568398736643}
{'textcat': 0.06648827512799471}
{'textcat': 0.066524060393931}
{'textcat': 0.06655053123358812}
{'textcat': 0.06705817706642847}
{'textcat': 0.06715737749436812}
{'textcat': 0.06717158750689123}
{'textcat': 0.06719866779531003}
{'textcat': 0.06722524159522436}
{'textcat': 0.06725271544746647}
{'textcat': 0.06728770197696576}
{'textcat': 0.06751515541145636}
{'textcat': 0.06760651058175426}
{'textcat': 0.06769152051310812}
{'textcat': 0.06772574070964765}
{'textcat': 0.06773676747798163}
{'textcat': 0.06774842843879014}
{'textcat': 0.06777235777735768}
{'textcat': 0.06798168003479077}
{'textcat': 0.06844388649369648}
{'textcat': 0.06848434174935392}
{'textcat': 0.06855516114046623}
{'textcat': 0.0686126399832574}
{'textcat': 0.06862712992642628}
{'textcat': 0.0686294973056647}
{'textcat': 0.06877899236133089}
{'textcat': 0.06878562733209037}
{'textcat': 0.0688089408427004}
{'textcat': 0.06881866570302009}
{'textcat': 0.06910349048530406}
{'textcat': 0.0

{'textcat': 0.0861664653627372}
{'textcat': 0.08617885682059523}
{'textcat': 0.08619828744366487}
{'textcat': 0.08621022535106704}
{'textcat': 0.08624120352641285}
{'textcat': 0.08624365233612252}
{'textcat': 0.08627991592715034}
{'textcat': 0.08629897077065607}
{'textcat': 0.08631371310497116}
{'textcat': 0.08634502427179314}
{'textcat': 0.08649383688629086}
{'textcat': 0.08651010631496092}
{'textcat': 0.08652516925525333}
{'textcat': 0.08689838548474427}
{'textcat': 0.0869321512977308}
{'textcat': 0.08694612960920267}
{'textcat': 0.08695525132446846}
{'textcat': 0.08695833478344639}
{'textcat': 0.08696259025049358}
{'textcat': 0.08724197977602444}
{'textcat': 0.08732087053243731}
{'textcat': 0.08733231005402331}
{'textcat': 0.08741553271914881}
{'textcat': 0.08753800089556307}
{'textcat': 0.08754090105679779}
{'textcat': 0.0880360930695474}
{'textcat': 0.08806639292555474}
{'textcat': 0.08808040310572096}
{'textcat': 0.08859474893137076}
{'textcat': 0.08860214844565917}
{'textcat': 0

{'textcat': 0.10001304908089992}
{'textcat': 0.1000354898258422}
{'textcat': 0.1002230509600679}
{'textcat': 0.1002423268198811}
{'textcat': 0.10026056221229851}
{'textcat': 0.10026192547596224}
{'textcat': 0.10026720192539074}
{'textcat': 0.10027589917856972}
{'textcat': 0.10028942285546805}
{'textcat': 0.10119297077156375}
{'textcat': 0.10122528015611465}
{'textcat': 0.10125189353905739}
{'textcat': 0.10128963559739645}
{'textcat': 0.10129370837879037}
{'textcat': 0.10129941605467252}
{'textcat': 0.10130682236743382}
{'textcat': 0.10134762168857492}
{'textcat': 0.10135702940081615}
{'textcat': 0.10137728726206774}
{'textcat': 0.10151289308217315}
{'textcat': 0.10152628481139914}
{'textcat': 0.10190146240623221}
{'textcat': 0.10190347935034083}
{'textcat': 0.10190633625873602}
{'textcat': 0.10191025541237764}
{'textcat': 0.1032494685200902}
{'textcat': 0.10325321274197563}
{'textcat': 0.10326599323437335}
{'textcat': 0.10334086890637195}
{'textcat': 0.10368013281097888}
{'textcat': 0.

{'textcat': 0.11306391226514734}
{'textcat': 0.11306954930284974}
{'textcat': 0.11307726259826723}
{'textcat': 0.11307985772543816}
{'textcat': 0.11308142256461906}
{'textcat': 0.1130880514576802}
{'textcat': 0.11311859884176556}
{'textcat': 0.1131266659897392}
{'textcat': 0.11313004020644257}
{'textcat': 0.1131567785084826}
{'textcat': 0.1132319327332425}
{'textcat': 0.11324380140081303}
{'textcat': 0.11325781136815749}
{'textcat': 0.1132944438247705}
{'textcat': 0.11379950261812155}
{'textcat': 0.11391797677401883}
{'textcat': 0.11392364193193316}
{'textcat': 0.11392499613475593}
{'textcat': 0.11392913827921802}
{'textcat': 0.11393255184179907}
{'textcat': 0.11397202805551387}
{'textcat': 0.11397978717400292}
{'textcat': 0.1139937684877168}
{'textcat': 0.11399582082117377}
{'textcat': 0.1140707735964952}
{'textcat': 0.11407413174924841}
{'textcat': 0.11407725320418649}
{'textcat': 0.11408218300152839}
{'textcat': 0.114083924776196}
{'textcat': 0.11408910481554813}
{'textcat': 0.11435

{'textcat': 0.12166558530174143}
{'textcat': 0.12174666421509528}
{'textcat': 0.1217497822207747}
{'textcat': 0.12176588498272167}
{'textcat': 0.12177082869123979}
{'textcat': 0.12177657329215208}
{'textcat': 0.12178182544761285}
{'textcat': 0.12178367503025811}
{'textcat': 0.12179670259234854}
{'textcat': 0.12180252085232723}
{'textcat': 0.12224483935420949}
{'textcat': 0.12224770301173749}
{'textcat': 0.12224972636255416}
{'textcat': 0.12226514644686404}
{'textcat': 0.12274868048376675}
{'textcat': 0.12277614215969379}
{'textcat': 0.12278262716796462}
{'textcat': 0.12278779823060404}
{'textcat': 0.12280177712233353}
{'textcat': 0.12280267795654254}
{'textcat': 0.12280645762507447}
{'textcat': 0.12284433273885043}
{'textcat': 0.12285493586909979}
{'textcat': 0.12290809329147123}
{'textcat': 0.12291973102924203}
{'textcat': 0.12292311570831771}
{'textcat': 0.12292407151426232}
{'textcat': 0.12292753481915497}
{'textcat': 0.1229282340792679}
{'textcat': 0.12292940358793203}
{'textcat': 

{'textcat': 0.1286857990998982}
{'textcat': 0.12893302327573508}
{'textcat': 0.12893894764918912}
{'textcat': 0.12894470361651145}
{'textcat': 0.12894764933706426}
{'textcat': 0.1289517269456013}
{'textcat': 0.12895189136607144}
{'textcat': 0.12899774836542122}
{'textcat': 0.1290463215088522}
{'textcat': 0.12904657467858272}
{'textcat': 0.12904889123798569}
{'textcat': 0.12904984908870176}
{'textcat': 0.1290535875987331}
{'textcat': 0.1290539877791872}
{'textcat': 0.12907019029997002}
{'textcat': 0.1290715755461207}
{'textcat': 0.12907545769742512}
{'textcat': 0.129084508157419}
{'textcat': 0.12908648569676018}
{'textcat': 0.12908939585811652}
{'textcat': 0.12909075462671638}
{'textcat': 0.12909519379759615}
{'textcat': 0.12909916135681954}
{'textcat': 0.12910013818630262}
{'textcat': 0.12913446124669292}
{'textcat': 0.12913531327963312}
{'textcat': 0.1291367115051827}
{'textcat': 0.129140250550293}
{'textcat': 0.1294111625956731}
{'textcat': 0.12942104657209086}
{'textcat': 0.12944816

{'textcat': 0.13404269725512563}
{'textcat': 0.13405768573144883}
{'textcat': 0.1340582161629129}
{'textcat': 0.13405960587115828}
{'textcat': 0.1340656591377467}
{'textcat': 0.13406644880367935}
{'textcat': 0.13406730765089492}
{'textcat': 0.13406891726211256}
{'textcat': 0.13407037531035826}
{'textcat': 0.13407626878145606}
{'textcat': 0.13407982219931114}
{'textcat': 0.13411779278456493}
{'textcat': 0.134425847982385}
{'textcat': 0.13442751916150542}
{'textcat': 0.13442923329208156}
{'textcat': 0.1344311251987449}
{'textcat': 0.13443135510489412}
{'textcat': 0.1344634292832012}
{'textcat': 0.13446781318772594}
{'textcat': 0.13446887345948255}
{'textcat': 0.13446942397145278}
{'textcat': 0.13447188683763045}
{'textcat': 0.1344730616400085}
{'textcat': 0.13448238961139225}
{'textcat': 0.1344838935829813}
{'textcat': 0.13448691012816028}
{'textcat': 0.13451405180201448}
{'textcat': 0.13466521088373895}
{'textcat': 0.13467225625352341}
{'textcat': 0.1346887184168537}
{'textcat': 0.13470

{'textcat': 0.13748956519677336}
{'textcat': 0.13748984656444918}
{'textcat': 0.13749574984854007}
{'textcat': 0.13749748408247342}
{'textcat': 0.13749811726904682}
{'textcat': 0.13749854386418292}
{'textcat': 0.13749957041471816}
{'textcat': 0.13749989278102248}
{'textcat': 0.1375284317497858}
{'textcat': 0.13759157294651914}
{'textcat': 0.13759529416500982}
{'textcat': 0.13759721813420356}
{'textcat': 0.13769611176411445}
{'textcat': 0.13771194074087134}
{'textcat': 0.13771997557449822}
{'textcat': 0.1377202220214997}
{'textcat': 0.13773603434177062}
{'textcat': 0.13773662584949875}
{'textcat': 0.13773766047987124}
{'textcat': 0.13776999532299783}
{'textcat': 0.13777116022438918}
{'textcat': 0.13777549232818842}
{'textcat': 0.1377763049543148}
{'textcat': 0.13777744670974812}
{'textcat': 0.1377798698155459}
{'textcat': 0.13789547012825665}
{'textcat': 0.1378977442535927}
{'textcat': 0.13789851772847328}
{'textcat': 0.13790409070591636}
{'textcat': 0.13790475520400491}
{'textcat': 0.1

{'textcat': 0.14180716599723553}
{'textcat': 0.1418156198705418}
{'textcat': 0.1418166707588142}
{'textcat': 0.14181766326002787}
{'textcat': 0.14182563829983508}
{'textcat': 0.14182634857694154}
{'textcat': 0.14182668925799646}
{'textcat': 0.14182695099513865}
{'textcat': 0.14182731277573168}
{'textcat': 0.1418287377160965}
{'textcat': 0.14182953517367025}
{'textcat': 0.14183418283666072}
{'textcat': 0.1418364032235928}
{'textcat': 0.14184038964710055}
{'textcat': 0.14184043467706786}
{'textcat': 0.14184141661529637}
{'textcat': 0.14184226565514635}
{'textcat': 0.1418603468372801}
{'textcat': 0.14186105221959977}
{'textcat': 0.14186221036719715}
{'textcat': 0.14186561933250985}
{'textcat': 0.14186748326027399}
{'textcat': 0.14187642549130786}
{'textcat': 0.1418766023655884}
{'textcat': 0.14187795902188682}
{'textcat': 0.14187846230806045}
{'textcat': 0.14187979182276322}
{'textcat': 0.1419671414619792}
{'textcat': 0.14196783453059325}
{'textcat': 0.14196887058693974}
{'textcat': 0.141

{'textcat': 0.1448859153310318}
{'textcat': 0.14488605253735987}
{'textcat': 0.14488698220454044}
{'textcat': 0.14488874542384256}
{'textcat': 0.1449002774956094}
{'textcat': 0.14490151661574746}
{'textcat': 0.14490180113411189}
{'textcat': 0.14490191018126186}
{'textcat': 0.144903136723884}
{'textcat': 0.1449041532589499}
{'textcat': 0.14490420200233078}
{'textcat': 0.14490594143072943}
{'textcat': 0.14490730505216565}
{'textcat': 0.14490749332687702}
{'textcat': 0.14490844590468388}
{'textcat': 0.1449090068364507}
{'textcat': 0.14490936548374922}
{'textcat': 0.14490950401371894}
{'textcat': 0.14491177945850442}
{'textcat': 0.14491192348253268}
{'textcat': 0.14491734398000577}
{'textcat': 0.14491995838187055}
{'textcat': 0.14492024380501167}
{'textcat': 0.14492065002243582}
{'textcat': 0.14492084475644962}
{'textcat': 0.14494348851698025}
{'textcat': 0.14501386597370303}
{'textcat': 0.14501591094252664}
{'textcat': 0.14502210085432754}
{'textcat': 0.1450229004818988}
{'textcat': 0.145

{'textcat': 0.14693299907234536}
{'textcat': 0.14693584216028643}
{'textcat': 0.1469361104621898}
{'textcat': 0.14693669746752125}
{'textcat': 0.14693821524978734}
{'textcat': 0.1469417895553251}
{'textcat': 0.14724838394458306}
{'textcat': 0.14724879676491032}
{'textcat': 0.14725170845944735}
{'textcat': 0.14725262880709256}
{'textcat': 0.14726105089901598}
{'textcat': 0.147262228453525}
{'textcat': 0.14726361519488496}
{'textcat': 0.14726548350618618}
{'textcat': 0.14728543569718244}
{'textcat': 0.147289553000161}
{'textcat': 0.1472901544380365}
{'textcat': 0.14729786379261345}
{'textcat': 0.14729836952470254}
{'textcat': 0.14730670975118887}
{'textcat': 0.14730738950764266}
{'textcat': 0.14731435972997264}
{'textcat': 0.1473430335295065}
{'textcat': 0.147366581618809}
{'textcat': 0.14736670257145779}
{'textcat': 0.1473667048407663}
{'textcat': 0.1473667480995744}
{'textcat': 0.14736695369521446}
{'textcat': 0.14736835571256113}
{'textcat': 0.14736960396152488}
{'textcat': 0.14737389

{'textcat': 0.14940005499365694}
{'textcat': 0.14940016318673455}
{'textcat': 0.14940743089796027}
{'textcat': 0.1494140474337169}
{'textcat': 0.14976502245776513}
{'textcat': 0.14976591703601483}
{'textcat': 0.14976635119590154}
{'textcat': 0.14976755623283933}
{'textcat': 0.14976831880265107}
{'textcat': 0.14977640896447864}
{'textcat': 0.14977775399663895}
{'textcat': 0.1498201273802604}
{'textcat': 0.1498203149798567}
{'textcat': 0.1498203922497947}
{'textcat': 0.1498370920568146}
{'textcat': 0.14985996550295622}
{'textcat': 0.1498613950922032}
{'textcat': 0.14986158847780584}
{'textcat': 0.1498617302217593}
{'textcat': 0.14987665905432945}
{'textcat': 0.14988642618068693}
{'textcat': 0.14989194993945998}
{'textcat': 0.14989236478878465}
{'textcat': 0.14990641097129864}
{'textcat': 0.14990652479326028}
{'textcat': 0.14990801554193323}
{'textcat': 0.14990816973971133}
{'textcat': 0.14990828703783987}
{'textcat': 0.1499109232762177}
{'textcat': 0.14991095669793109}
{'textcat': 0.1499

In [71]:
texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" , 
        "Free Free Free Free"]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)

[[9.9991047e-01 8.9551839e-05]
 [1.9246925e-02 9.8075312e-01]
 [1.4866841e-01 8.5133159e-01]]


In [72]:
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])


['ham', 'spam', 'spam']


## Word Embedding


In Summery, The idea behind Word2Vec is pretty simple. We’re making an assumption that the meaning of a word can be inferred by the company it keeps. This is analogous to the saying, “show me your friends, and I’ll tell who you are.”

If you have two words that have very similar neighbors (meaning: the context in which its used is about the same), then these words are probably quite similar in meaning or are at least related. For example, the words shocked, appalled, and astonished are usually used in a similar context

Word embeddings (also called word vectors) represent each word numerically in such a way that the vector corresponds to how that word is used or what it means. 

In [73]:
# !pip3 install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz
# !python -m spacy download en_core_web_lg

In [74]:
import numpy as np
import spacy

nlp = spacy.load('en_core_web_lg')

In [75]:
text = "These vectors can be used as features for machine learning models."

vectors = np.array([token.vector for token in  nlp(text)])

In [76]:
vectors.shape

(12, 300)

In [77]:
import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('nlp-data/spam.csv')


doc_vectors = np.array([nlp(text).vector for text in spam.text])
    
doc_vectors.shape

(5572, 300)

In [78]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

In [79]:
from sklearn.svm import LinearSVC

svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )

Accuracy: 97.312%


In [80]:
def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [81]:
a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

0.7030031

In [82]:
review_data = pd.read_csv('nlp-data/yelp_ratings.csv')
review_data.head()

Unnamed: 0,text,stars,sentiment
0,Total bill for this horrible service? Over $8G...,1.0,0
1,I *adore* Travis at the Hard Rock's new Kelly ...,5.0,1
2,I have to say that this office really has it t...,5.0,1
3,Went in for a lunch. Steak sandwich was delici...,5.0,1
4,Today was my second out of three sessions I ha...,1.0,0


In [83]:
reviews = review_data[:100]
# We just want the vectors so we can turn off other models in the pipeline
with nlp.disable_pipes():
    vectors = np.array([nlp(review.text).vector for idx, review in reviews.iterrows()])
    
vectors.shape

(100, 300)

In [84]:
vectors = np.load('nlp-data/review_vectors.npy')

In [85]:
vectors

array([[-0.20143504,  0.1837154 , -0.01327053, ..., -0.05922916,
         0.01440009,  0.09077955],
       [-0.02590548,  0.1517007 , -0.11389936, ..., -0.04916738,
         0.03085417,  0.07205424],
       [-0.07666641,  0.19274631, -0.14321738, ..., -0.04575825,
         0.0689992 ,  0.09280958],
       ...,
       [-0.03841371,  0.16862842, -0.24175283, ..., -0.10739233,
         0.14741549,  0.12238124],
       [-0.01221176,  0.11620302, -0.09448893, ..., -0.06332556,
         0.02805696,  0.13142744],
       [ 0.01070178,  0.1630349 , -0.06763948, ..., -0.08762769,
         0.00377347,  0.15404755]], dtype=float32)

In [86]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(vectors, review_data.sentiment, 
                                                    test_size=0.1, random_state=1)

# Create the LinearSVC model
model = LinearSVC(random_state=1, dual=False)
# Fit the model
model.fit(X_train, y_train)
# Uncomment and run to see model accuracy
print(f'Model test accuracy: {model.score(X_test, y_test)*100:.3f}%')

Model test accuracy: 93.847%


In [87]:
review = """I absolutely love this place. The 360 degree glass windows with the 
Yerba buena garden view, tea pots all around and the smell of fresh tea everywhere 
transports you to what feels like a different zen zone within the city. I know 
the price is slightly more compared to the normal American size, however the food 
is very wholesome, the tea selection is incredible and I know service can be hit 
or miss often but it was on point during our most recent visit. Definitely recommend!

I would especially recommend the butternut squash gyoza."""

def cosine_similarity(a, b):
    return np.dot(a, b)/np.sqrt(a.dot(a)*b.dot(b))

review_vec = nlp(review).vector


vec_mean = vectors.mean(axis=0)
# Subtract the mean from the vectors
centered = vectors-vec_mean

# Calculate similarities for each document in the dataset
# Make sure to subtract the mean from the review vector
sims = np.array([cosine_similarity(review_vec - vec_mean, vec) for vec in centered])

# Get the index for the most similar document
most_similar = sims.argmax()

In [88]:
most_similar

5930

In [89]:
print(review_data.iloc[most_similar].text)

After purchasing my final christmas gifts at the Urban Tea Merchant in Vancouver, I was surprised to hear about Teopia at the new outdoor mall at Don Mills and Lawrence when I went back home to Toronto for Christmas.
Across from the outdoor skating rink and perfect to sit by the ledge to people watch, the location was prime for tea connesieurs... or people who are just freezing cold in need of a drinK!
Like any gourmet tea shop, there were large tins of tea leaves on the walls, and although the tea menu seemed interesting enough, you can get any specialty tea as your drink. We didn't know what to get... so the lady suggested the Goji Berries... it smelled so succulent and juicy... instantly SOLD! I got it into a tea latte and watched the tea steep while the milk was steamed, and surprisingly, with the click of a button, all the water from the tea can be instantly drained into the cup (see photo).. very fascinating!

The tea was aromatic and tasty, not over powering. The price was also 

## NER

In [90]:
from spacy import displacy

review = review_data.iloc[most_similar].text

doc = nlp(review)

In [97]:
displacy.render(doc, style="ent", jupyter=True)

In [98]:
spacy.explain("GPE")

'Countries, cities, states'

In [99]:
spacy.explain("LOC")

'Non-GPE locations, mountain ranges, bodies of water'