# NLP with Spacy

I suggest that you read the documentation for yourself. The best way to learn how to do something in Python is uasually to read the documentation and Google your question. 

[Here is the spacy documentation](https://spacy.io/usage/spacy-101).

The data we'll be working with today is a set of all the Jeopardy questions and their matched answers, along with some information about the category, value, etc. We're mostly interested in the questions, and that's what we'll use spacy for. 

In [28]:
import spacy
import pandas as pd
from pathlib import Path
import time
nlp = spacy.load('en_core_web_sm')

In [2]:
jpdy_path = Path.cwd() / 'data' / 'JEOPARDY_CSV.csv'
jpdy = pd.read_csv(jpdy_path)
jpdy.head()

Unnamed: 0,Show Number,Air Date,Round,Category,Value,Question,Answer
0,4680,2004-12-31,Jeopardy!,HISTORY,$200,"For the last 8 years of his life, Galileo was ...",Copernicus
1,4680,2004-12-31,Jeopardy!,ESPN's TOP 10 ALL-TIME ATHLETES,$200,No. 2: 1912 Olympian; football star at Carlisl...,Jim Thorpe
2,4680,2004-12-31,Jeopardy!,EVERYBODY TALKS ABOUT IT...,$200,The city of Yuma in this state has a record av...,Arizona
3,4680,2004-12-31,Jeopardy!,THE COMPANY LINE,$200,"In 1963, live on ""The Art Linkletter Show"", th...",McDonald's
4,4680,2004-12-31,Jeopardy!,EPITAPHS & TRIBUTES,$200,"Signer of the Dec. of Indep., framer of the Co...",John Adams


In [3]:
print(jpdy.columns)

Index(['Show Number', ' Air Date', ' Round', ' Category', ' Value',
       ' Question', ' Answer'],
      dtype='object')


In [4]:
jpdy.columns = [c.strip() for c in jpdy.columns]

In [5]:
print(len(jpdy.index))

216930


I'm going to cut jpdy to a tenth of its size, because my laptop doesn't have enough memory or processing power to run nlp() on every single question in this database. 

In [6]:
jpdy = jpdy.iloc[:int(len(jpdy.index)/10)]

In [7]:
print(len(jpdy.index))

21693


In [8]:
start = time.time()
jpdy['Question'] = [nlp(q) for q in jpdy['Question']]
end = time.time()
print(f'Elapsed time: {end - start} seconds')
print(jpdy['Question'][:5])

Elapsed time: 148.8704149723053 seconds
0    (For, the, last, 8, years, of, his, life, ,, G...
1    (No, ., 2, :, 1912, Olympian, ;, football, sta...
2    (The, city, of, Yuma, in, this, state, has, a,...
3    (In, 1963, ,, live, on, ", The, Art, Linklette...
4    (Signer, of, the, Dec., of, Indep, ., ,, frame...
Name: Question, dtype: object


Notice the elapsed time above, even after I set it to one tenth of all my data. 

## Attributes of Spacy tokens

In [15]:
first_question = jpdy.iloc[0]['Question']
print(first_question, '\n')

for word in first_question:
    print(word.text, word.pos_, word.tag_, word.shape_, word.is_alpha, word.is_punct, word.is_stop)

[8, years, life, ,, Galileo, house, arrest, espousing, man, theory] 

8 NUM CD d False False False
years NOUN NNS xxxx True False False
life NOUN NN xxxx True False False
, PUNCT , , False True False
Galileo PROPN NNP Xxxxx True False False
house NOUN NN xxxx True False False
arrest NOUN NN xxxx True False False
espousing VERB VBG xxxx True False False
man NOUN NN xxx True False False
theory NOUN NN xxxx True False False


## Document Similarity
A common task in NLP is trying to measure how similar two documents are semantically. Spacy has a way for us to do that. Let's try to figure out the most similar question to our first question. To do that, let's first take out the stop words. 

In [20]:
new_question_list = []
for question_index, question in enumerate(jpdy['Question']):
    new_question_list.append(nlp(' '.join([token.text for token in question if not token.is_stop])))
        
jpdy['Question'] = new_question_list
first_question = jpdy.iloc[0]['Question']
print(jpdy.iloc[0]['Question'])


8 years life , Galileo house arrest espousing man theory
<class 'type'>


In [30]:
jpdy['Similarity to Question 1'] = [q.similarity(first_question) for q in jpdy['Question']]
jpdy = jpdy.sort_values(by = 'Similarity to Question 1', ascending = False)
print(jpdy['Question'][1:6])
for question in jpdy['Question'][1:6]:
    print(' '.join([doc.text for doc in question]))

  jpdy['Similarity to Question 1'] = [q.similarity(first_question) for q in jpdy['Question']]


9358     (1287, storm, flooded, land, separating, North...
2536     (chapter, 52, novel, ,, boisterous, crowd, gat...
1541     (Da, ,, comrade--, fork, pierces, bird, ,, lau...
11850    (1903, Jack, London, work, dog, ,, half, St., ...
6841     ((, <, href="http://www.j, -, archive.com, /, ...
Name: Question, dtype: object
1287 storm flooded land separating North Sea & Zuiderzee , turning village major port city
chapter 52 novel , boisterous crowd gathering Fagin execution
Da , comrade-- fork pierces bird , launching jet fragrant melted butter chicken
1903 Jack London work dog , half St. Bernard , half Scotch shepherd , survives wilderness
( < href="http://www.j - archive.com / media/2007 - 12 - 26_DJ_27.jpg " target="_blank">A trainer dog stop curb Cheryl Clue Crew Seeing Eye New Jersey.</a > )    dog trained stop curbs reasons , safety & orientation , people visually impaired determine location counting


  jpdy['Similarity to Question 1'] = [q.similarity(first_question) for q in jpdy['Question']]


In [None]:
print(jpdy.tail)

## Spacy in practice
I used spacy to turn text into features (or variables) for a machine learning pipeline that I created at my job. First, to be in comliance with the GDPR, I had to take out all potentially personally identifying information in the text. I was able to do that with Spacy, because I can set up rules like "if word.pos_ == 'PROPN': sentence[index] = word.shape. I was able to take out stop words, punctuation and proper nouns, conduct sentiment analysis, and create more variables to use in my anomaly detection model because I used spacy to make the unstructred data that I had into a more structured format. 

The other thing about Spacy is that it takes a LONG TIME to do things. Neural networks take a lot of computation, and they're generally better suited for GPUs than CPUs, but laptops have much more in the way of CPUs than GPUs if they have one at all. Usually you'll run Spacy on a server if you're trying to apply it on a large dataset like this. In my previous work I had a 16 core server and it still took Spacy quite a while to work. 