# spaCy

spaCy is an open-source software library for advanced natural language processing, written in Cython.
<br>
It's focus is on providing software for production usage and excels at large-scale information extraction tasks.
<br>

spaCy provides the following key features:
<ol>
    <li>Non-destructive tokenization</li>
    <li>Named entity recognition</li>
    <li>"Alpha tokenization" support for over 25 languages</li>
    <li>Statistical models models for 8 languages</li>
    <li>Pre-trained word vectors</li>
    <li>Part-of-speech tagging</li>
    <li>Labelled dependency parsing</li>
    <li>Syntax-driven sentence segmentation</li>
    <li>Text classification</li>
    <li>Built-in visualizers for syntax and named entities</li>
    <li>Deep learning integration</li>
</ol>

In [3]:
import warnings
warnings.filterwarnings('ignore')

import re
import string
from collections import defaultdict, Counter

import pandas as pd
import numpy as np

import spacy
from spacy.tokenizer import Tokenizer

from gensim.models import Word2Vec, KeyedVectors

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

In [4]:
# Load the data

train_data = pd.read_csv('../data/train.csv')
test_data = pd.read_csv('../data/test.csv')

train_data = train_data[:400000]

train_text = train_data['question_text'].values
train_labels = train_data['target'].values

test_text = test_data['question_text'].values
test_qid = test_data['qid'].values

# load the Spacy model
spacy_model = spacy.load('en_core_web_sm')

## Data preprocessing

The first thing that we can do with the data is to convert all the letters to lowecase

In [5]:
# Convert to lowercase
train_text = [token.lower() for token in train_text]
test_text = [token.lower() for token in test_text]

### Processing pipeline

When we call <b>spacy_model</b> on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. 

<img src="processing.png" alt="processing">

<br>
where <b>tagger</b> assigns pat-of-speech tags, <b>parser</b> assigns dependency labels and <b>ner</b> detects and labels named entities.

### Word tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens.
<br>
The input to the tokenizer is a unicode text, and the output is a Doc object, which is a sequence of tokens.

SpaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.
<br><br>
The tokenization algorithm is done in the following steps:
<ol>
    <li>Iterate over space-separated substrings.</li>
    <li>Check whether we have an explicitly defined rule for this substring. If we do, use it.</li>
    <li>Otherwise, try to consume a prefix.</li>
    <li>If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.</li>
    <li>If we didn't consume a prefix, try to consume a suffix.</li>
    <li>If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.</li>
    <li>Once we can't consume any more of the string, handle it as a single token.</li>
</ol>

In [6]:
tokenizer = Tokenizer(spacy_model.vocab)

tokenized_words_train = [tokenizer(sent) for sent in train_text]
tokenized_words_test = [tokenizer(sent) for sent in test_text]

In [5]:
np.save('tokenized_words_spacy_train', tokenized_words_train)
np.save('tokenized_words_spacy_test', tokenized_words_test)

In [6]:
tokenized_words_train = np.load('tokenized_words_spacy_train.npy')
tokenized_words_test = np.load('tokenized_words_spacy_test.npy')

In [12]:
tokenized_words_train[0:5]

[how did quebec nationalists see their province as a nation in the 1960s?,
 do you have an adopted dog, how would you encourage people to adopt and not shop?,
 why does velocity affect time? does velocity affect space geometry?,
 how did otto von guericke used the magdeburg hemispheres?,
 can i convert montra helicon d to a mountain bike by just changing the tyres?]

In [18]:
# Remove punctuation and numbers
tokenized_words_train = [[word for word in sent if word.is_alpha] for sent in tokenized_words_train]

In [25]:
tokenized_words_train[0]

[how, did, quebec, nationalists, see, their, province, as, a, nation, in, the]

In [21]:
# Remove non-ASCII characters
tokenized_words_train_flat = [item for sublist in tokenized_words_train for item in sublist]

cleaned_data = [re.sub(r'[^\x00-\x7f]', r'', word.text) for word in tokenized_words_train_flat]

In [55]:
# Remove low-frequency words
freq_words = Counter(cleaned_data)

cleaned_data = { key : value for key, value in freq_words.items() if value > 10 }

filtered_data = []
temp_array = []

for sent in tokenized_words_train:
    for word in sent:
        if word.text in cleaned_data.keys():
            temp_array.append(word)
    filtered_data.append(temp_array)
    temp_array = []

In [169]:
# freq_words

TypeError: unhashable type: 'slice'

In [57]:
# Remove stop words
# filtered_data = [[word for word in sent if word.is_stop == False] for sent in tokenized_words_train]

filtered_data_no_stopwords = [[word for word in sent if word.is_stop == False] for sent in filtered_data]

filtered_data = filtered_data_no_stopwords

In [58]:
filtered_data[:5]

[[quebec, nationalists, province, nation],
 [adopted, encourage, people, adopt],
 [velocity, affect, velocity, affect, space],
 [otto, von],
 [convert, d, mountain, bike, changing]]

In [59]:
# Lemmatization
filtered_data_no_lemma = [[word.lemma_ for word in sent] for sent in filtered_data]

filtered_data = filtered_data_no_lemma

In [61]:
filtered_data[:5]

[['quebec', 'nationalist', 'province', 'nation'],
 ['adopt', 'encourage', 'people', 'adopt'],
 ['velocity', 'affect', 'velocity', 'affect', 'space'],
 ['otto', 'von'],
 ['convert', 'have', 'mountain', 'bike', 'change']]

## Word embedding

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.
<br>
Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
<br>
It tries to make words with similar context occupy close spatial positions.
<br><br>
The Word2Vec model can be obtained using 2 techniques: 
<ol>
    <li>Skip Gram</li>
    <li>Common Bag Of Words (CBOW)</li>
</ol>

In [2]:
embed_wiki = KeyedVectors.load_word2vec_format('../data/wiki-news-300d-1M.vec')

In [62]:
X = [[embed_wiki[word] for word in sent if word in embed_wiki.vocab] for sent in filtered_data] 

In [171]:
# X[0]

In [69]:
# Get average of the vectors
X_avg = []

for vector in X:
    if len(vector) >= 1:
        X_avg.append(np.mean(vector))
    else:
        X_avg.append(0)

In [70]:
X_avg = np.array(X_avg)
X_avg = X_avg.reshape(-1, 1)

In [72]:
X_avg[:5]

array([[ 0.00032158],
       [-0.00506292],
       [-0.004897  ],
       [ 0.001083  ],
       [-0.00152267]])

In [75]:
# Cross valudation
LR = LogisticRegression()

scores = cross_val_score(
    LR, 
    X_avg, 
    train_labels, 
    cv = 5, 
    scoring = 'f1_macro'
)

In [76]:
avg_score = np.sum(scores) / len(scores)
avg_score

0.4839921463708913

## Additional stuff

In [106]:
text = train_text[:100]

tokenized_text = [spacy_model(word) for word in text]

### Speech tagging

<img src="tagging.png" alt="tagging">

In [100]:
# Speech tagging
tagged_text = [{word : word.tag_ for word in sent} for sent in tokenized_text]

In [170]:
# tagged_text

### Dependency parsing

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.
<br>
<img src="parsing.png" alt="parsing">

#### Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun

In [144]:
# Get noun chunks

sample_sent = tokenized_text[1]

text = []
root = []
root_dep = []
root_head = []

for chunk in sample_sent.noun_chunks:
    text.append(chunk.text)
    root.append(chunk.root.text)
    root_dep.append(chunk.root.dep_)
    root_head.append(chunk.root.head.text)

df = pd.DataFrame({
        'TEXT': text, 
        'ROOT.TEXT': root, 
        'ROOT.DEP': root_dep, 
        'ROOT.HEAD.TEXT': root_head
    })

df = df[['TEXT', 'ROOT.TEXT', 'ROOT.DEP', 'ROOT.HEAD.TEXT']]

In [145]:
print(sample_sent)
df

do you have an adopted dog, how would you encourage people to adopt and not shop?


Unnamed: 0,TEXT,ROOT.TEXT,ROOT.DEP,ROOT.HEAD.TEXT
0,you,you,nsubj,have
1,an adopted dog,dog,dobj,have
2,you,you,nsubj,encourage
3,people,people,dobj,encourage


<b>Text</b>: The original noun chunk text.
<br>
<b>Root text</b>: The original text of the word connecting the noun chunk to the rest of the parse.
<br>
<b>Root dep</b>: Dependency relation connecting the root to its head.
<br>
<b>Root head text</b>: The text of the root token's head.

### Named entity recognition

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

In [146]:
# Named entity recognition
sample_sent = 'Google was founded in 1998 in California'
doc = spacy_model(sample_sent)

text = []
start = []
end = []
label = []

for ent in doc.ents:
    text.append(ent.text)
    start.append(ent.start_char)
    end.append(ent.end_char)
    label.append(ent.label_)
    
df = pd.DataFrame({
    'TEXT': text, 
    'START': start, 
    'END': end, 
    'LABEL': label
})

df = df[['TEXT', 'START', 'END', 'LABEL']]

In [147]:
df

Unnamed: 0,TEXT,START,END,LABEL
0,Google,0,6,ORG
1,1998,22,26,DATE
2,California,30,40,GPE


### Similarity

In [168]:
text = 'dog cat chair'
doc = spacy_model(text)

token1 = doc[0]
token2 = doc[1]
token3 = doc[2]


print('Similarity between {0} and {1} is: {2}'.format(token1, token2, token1.similarity(token2)))
print('Similarity between {0} and {1} is: {2}'.format(token1, token3, token1.similarity(token3)))
print('Similarity between {0} and {1} is: {2}'.format(token1, token1, token1.similarity(token1)))

Similarity between dog and cat is: 0.509429395198822
Similarity between dog and chair is: 0.35649821162223816
Similarity between dog and dog is: 1.0
