# spaCy

spaCy is an open-source software library for advanced natural language processing, written in Cython.
<br>
It's focus is on providing software for production usage and excels at large-scale information extraction tasks.
<br>

spaCy provides the following key features:
<ol>
    <li>Non-destructive tokenization</li>
    <li>Named entity recognition</li>
    <li>"Alpha tokenization" support for over 25 languages</li>
    <li>Pre-trained word vectors</li>
    <li>Part-of-speech tagging</li>
    <li>Labelled dependency parsing</li>
    <li>Syntax-driven sentence segmentation</li>
    <li>Text classification</li>
    <li>Built-in visualizers for syntax and named entities</li>
    <li>Deep learning integration</li>
</ol>

In [1]:
import warnings
warnings.filterwarnings('ignore')

import re
import string
from collections import defaultdict, Counter

import pandas as pd
import numpy as np

import spacy
from spacy.tokenizer import Tokenizer

from gensim.models import Word2Vec, KeyedVectors

import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import cross_val_score

In [2]:
# Load the data

# train_data = pd.read_csv('../data/train.csv')
train_data = pd.read_csv('../data/small_train.csv')
test_data = pd.read_csv('../data/test.csv')

# train_data = train_data[:200000]

train_text = train_data['question_text'].values
train_labels = train_data['target'].values

test_text = test_data['question_text'].values
test_qid = test_data['qid'].values

# load the Spacy model
spacy_model = spacy.load('en_core_web_sm')

In [None]:
# python -m spacy download en_core_web_sm ==> command to install a spaCy model

## Data preprocessing

The first thing that we can do with the data is to convert all the letters to lowecase

In [3]:
# Convert to lowercase
train_text = [token.lower() for token in train_text]
test_text = [token.lower() for token in test_text]

### Processing pipeline

When we call <b>spacy_model</b> on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. 

<img src="processing.png" alt="processing">

<br>
where <b>tagger</b> assigns pat-of-speech tags, <b>parser</b> assigns dependency labels and <b>ner</b> detects and labels named entities.

### Word tokenization

Tokenization is the task of splitting a text into meaningful segments, called tokens.
<br>
The input to the tokenizer is a unicode text, and the output is a Doc object, which is a sequence of tokens.

SpaCy introduces a novel tokenization algorithm, that gives a better balance between performance, ease of definition, and ease of alignment into the original string.
<br><br>
The tokenization algorithm is done in the following steps:
<ol>
    <li>Iterate over space-separated substrings.</li>
    <li>Check whether we have an explicitly defined rule for this substring. If we do, use it.</li>
    <li>Otherwise, try to consume a prefix.</li>
    <li>If we consumed a prefix, go back to the beginning of the loop, so that special-cases always get priority.</li>
    <li>If we didn't consume a prefix, try to consume a suffix.</li>
    <li>If we can't consume a prefix or suffix, look for "infixes" — stuff like hyphens etc.</li>
    <li>Once we can't consume any more of the string, handle it as a single token.</li>
</ol>

In [4]:
tokenizer = Tokenizer(spacy_model.vocab)

tokenized_words_train = [tokenizer(sent) for sent in train_text]
tokenized_words_test = [tokenizer(sent) for sent in test_text]

In [None]:
np.save('tokenized_words_spacy_train', tokenized_words_train)
np.save('tokenized_words_spacy_test', tokenized_words_test)

In [None]:
tokenized_words_train = np.load('tokenized_words_spacy_train.npy')
tokenized_words_test = np.load('tokenized_words_spacy_test.npy')

In [None]:
tokenized_words_train[0:5]

In [5]:
# Remove punctuation and numbers
tokenized_words_train = [[word for word in sent if word.is_alpha] for sent in tokenized_words_train]

In [None]:
tokenized_words_train[2]

In [6]:
# Remove non-ASCII characters
tokenized_words_train_flat = [item for sublist in tokenized_words_train for item in sublist]

cleaned_data = [re.sub(r'[^\x00-\x7f]', r'', word.text) for word in tokenized_words_train_flat]

In [7]:
# Remove low-frequency words
freq_words = Counter(cleaned_data)

cleaned_data = { key : value for key, value in freq_words.items() if value > 10 }

filtered_data = []
temp_array = []

for sent in tokenized_words_train:
    for word in sent:
        if word.text in cleaned_data.keys():
            temp_array.append(word)
    filtered_data.append(temp_array)
    temp_array = []

In [None]:
freq_words

<b>Caution!</b> Removing the stop words might not be the best approach.

In [None]:
# Remove stop words
# filtered_data_no_stopwords = [[word for word in sent if word.is_stop == False] for sent in filtered_data]

# filtered_data = filtered_data_no_stopwords

In [8]:
# Lemmatization
filtered_data_no_lemma = [[word.lemma_ for word in sent] for sent in filtered_data]

filtered_data = filtered_data_no_lemma

In [None]:
filtered_data[:5]

## Word embedding

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers.
<br>
Word2Vec is one of the most popular technique to learn word embeddings using shallow neural network. It was developed by Tomas Mikolov in 2013 at Google.
<br>
It tries to make words with similar context occupy close spatial positions.
<br><br>
The Word2Vec model can be obtained using 2 techniques: 
<ol>
    <li>Skip Gram</li>
    <li>Common Bag Of Words (CBOW)</li>
</ol>

In [9]:
embed_wiki = KeyedVectors.load_word2vec_format('../data/wiki-news-300d-1M.vec')

In [10]:
X = [[embed_wiki[word] for word in sent if word in embed_wiki.vocab] for sent in filtered_data] 

In [16]:
X[0]

[array([-8.120e-02,  3.960e-02, -1.820e-02, -5.860e-02,  8.300e-02,
        -2.740e-02,  2.170e-02,  1.860e-02,  4.780e-02,  1.640e-02,
        -4.920e-02, -5.420e-02, -3.350e-02, -4.930e-02,  3.570e-02,
         6.080e-02,  9.100e-03,  9.810e-02, -1.360e-01,  5.870e-02,
        -1.060e-02, -3.830e-02,  1.400e-02,  2.154e-01, -2.550e-02,
         1.940e-02,  6.100e-03, -2.200e-02, -5.250e-02, -5.330e-02,
         4.160e-02,  4.580e-02,  1.761e-01, -1.000e-02, -7.480e-02,
         6.100e-02, -4.910e-02, -2.860e-02,  8.860e-02, -6.320e-02,
         5.110e-02, -1.239e-01,  9.600e-03, -8.960e-02,  4.720e-02,
         5.700e-03, -1.206e-01,  8.110e-02, -3.650e-02, -1.055e-01,
         2.380e-02, -1.230e-01, -6.766e-01, -4.420e-02, -6.900e-03,
         2.820e-02, -9.410e-02,  2.770e-02, -1.113e-01, -1.184e-01,
        -1.700e-03, -4.110e-02,  3.690e-02,  3.240e-02, -6.030e-02,
        -1.790e-01, -2.960e-02, -3.940e-02, -1.300e-03, -3.100e-02,
        -6.640e-02, -2.900e-03, -3.000e-02,  2.6

In [47]:
# Get average of the vectors
X_avg = []

for vector in X:
    if len(vector) >= 1:
        X_avg.append(np.mean(vector, axis=0))
    else:
        X_avg.append(np.zeros(300))

In [17]:
X_avg[0]

array([ 2.26583332e-02, -2.42666658e-02,  1.51583338e-02,  1.97500009e-02,
       -1.43416673e-02,  2.72416677e-02,  6.77499920e-02,  1.30166672e-02,
        9.54999961e-03, -2.90416684e-02, -1.05916662e-02,  2.03166660e-02,
        5.16000092e-02,  2.08750013e-02,  1.45583330e-02, -1.94583330e-02,
        3.54166664e-02,  5.03416695e-02, -5.43833375e-02,  1.39249973e-02,
       -7.91499987e-02, -8.33333237e-04,  1.90249998e-02,  2.24833284e-02,
       -2.36666668e-02, -2.69249994e-02,  2.50416640e-02,  6.63583353e-02,
       -3.49999964e-03, -3.56749967e-02,  2.44499985e-02, -7.48333381e-03,
        4.66083325e-02, -2.47499999e-02,  6.45666644e-02,  2.88166646e-02,
       -5.04250042e-02,  6.69166585e-03,  6.97499886e-03, -4.32500010e-03,
        7.40833348e-03,  1.97416656e-02, -1.65999997e-02, -9.30000003e-03,
        1.13333447e-03, -8.40000156e-03, -2.00999994e-02, -5.99999621e-04,
       -6.04999997e-03, -2.09583323e-02,  1.56166665e-02,  3.38833332e-02,
       -6.66491687e-01, -

In [48]:
X_avg = np.array(X_avg)

In [13]:
# Cross valudation
LR = LogisticRegression()

scores = cross_val_score(
    LR, 
    X_avg, 
    train_labels, 
    cv = 5, 
    scoring = 'f1_macro'
)

In [18]:
avg_score = np.sum(scores) / len(scores)
avg_score

0.6377160304145425

In [40]:
X_test = [[embed_wiki[word.text] for word in sent if word.text in embed_wiki.vocab] for sent in tokenized_words_test] 

In [43]:
# Get average of the vectors
X_avg_test = []

for vector in X_test:
    if len(vector) >= 1:
        X_avg_test.append(np.mean(vector, axis=0))
    else:
        X_avg_test.append(np.zeros(300))

In [49]:
LR.fit(X_avg, train_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [50]:
y_pred = LR.predict(X_avg_test)

In [55]:
df = pd.DataFrame({'qid': test_qid, 'prediction': y_pred})
df = df[['qid', 'prediction']]
df.to_csv('submission.csv', index = False)

## Additional stuff

In [None]:
text = train_text[:100]

tokenized_text = [spacy_model(word) for word in text]

### Speech tagging

<img src="tagging.png" alt="tagging">

In [None]:
# Speech tagging
tagged_text = [{word : word.tag_ for word in sent} for sent in tokenized_text]

In [None]:
tagged_text

### Dependency parsing

A dependency parser analyzes the grammatical structure of a sentence, establishing relationships between "head" words and words which modify those heads.
<br>
<img src="parsing.png" alt="parsing">

#### Noun chunks

Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun

In [None]:
# Get noun chunks

sample_sent = tokenized_text[1]

text = []
root = []
root_dep = []
root_head = []

for chunk in sample_sent.noun_chunks:
    text.append(chunk.text)
    root.append(chunk.root.text)
    root_dep.append(chunk.root.dep_)
    root_head.append(chunk.root.head.text)

df = pd.DataFrame({
        'TEXT': text, 
        'ROOT.TEXT': root, 
        'ROOT.DEP': root_dep, 
        'ROOT.HEAD.TEXT': root_head
    })

df = df[['TEXT', 'ROOT.TEXT', 'ROOT.DEP', 'ROOT.HEAD.TEXT']]

In [None]:
print(sample_sent)
df

<b>Text</b>: The original noun chunk text.
<br>
<b>Root text</b>: The original text of the word connecting the noun chunk to the rest of the parse.
<br>
<b>Root dep</b>: Dependency relation connecting the root to its head.
<br>
<b>Root head text</b>: The text of the root token's head.

### Named entity recognition

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.
<br>
In the image below, we can see the entity types that spaCy supports

<img src="entity_types.png" alt="entity_types" >

In [None]:
# Named entity recognition
sample_sent = 'Google was founded in 1998 in California'
doc = spacy_model(sample_sent)

text = []
start = []
end = []
label = []

for ent in doc.ents:
    text.append(ent.text)
    start.append(ent.start_char)
    end.append(ent.end_char)
    label.append(ent.label_)
    
df = pd.DataFrame({
    'TEXT': text, 
    'START': start, 
    'END': end, 
    'LABEL': label
})

df = df[['TEXT', 'START', 'END', 'LABEL']]

In [None]:
df

### Similarity

In [None]:
text = 'dog cat chair'
doc = spacy_model(text)

token1 = doc[0]
token2 = doc[1]
token3 = doc[2]


print('Similarity between {0} and {1} is: {2}'.format(token1, token2, token1.similarity(token2)))
print('Similarity between {0} and {1} is: {2}'.format(token1, token3, token1.similarity(token3)))
print('Similarity between {0} and {1} is: {2}'.format(token1, token1, token1.similarity(token1)))

### spaCy Word2Vec model

In [None]:
# load the Spacy model
spacy_model = spacy.load('en_core_web_md')

In [None]:
text = train_text[:100]

tokenized_text = [spacy_model(word) for word in text]

text = []
has_vector = []
vector_norm = []

tokenized_text

for sent in tokenized_text:
    for word in sent:
        text.append(word.text)
        has_vector.append(word.has_vector)
        vector_norm.append(word.vector_norm)
    
    
df = pd.DataFrame({
        'TEXT': text, 
        'HAS VECTOR': has_vector, 
        'VECTOR NORM': vector_norm, 
    })

df = df[['TEXT', 'HAS VECTOR', 'VECTOR NORM']]

In [None]:
df.head()

In [None]:
def represent_word_spacy(word, etypes=set([u'MONEY', u'DATE',
                                           u'TIME',
                                           u'CARDINAL',
                                           u'PERCENT'])):
    """
    Returns the word or its representation (unicode)
    Args:
        word (spacy word): word from spacy
        etypes (set): set of words for different entities
    Returns:
        spacy token represented in different format
    """
    
    if word.like_url:
        return 'url'
    elif word.ent_type_ in etypes:
        return word.ent_type_.lower()
    elif u'%' in word.lemma_:
        return u'percent'
    elif word.like_num:
        return u'cardinal'
    elif ("dd" and ":") in word.shape_:
        return u'time'
    elif "$" in word.text:
        return re.sub('\d','d', word.text)
    # checks if the string is date 
    elif (("d/d/d" or "d-d-d" or "d.d.d") in re.sub('\d+', 'd', word.text)):
        return u'date'
    else:
        return (word.lemma_).lower()

In [None]:
cleaned_data = []
temp = []

for sent in tokenized_text:
    for word in sent:
        temp.append(represent_word_spacy(word))
    cleaned_data.append(temp)
    temp = []

In [None]:
tokenized_text[0]

In [None]:
cleaned_data[0]