#**FastText for Semantic Similarity**

 **FastText** is an extremely useful module for word embedding and text classification problems. **FastText** has been developed by Facebook and has shown excellent results on many NLP problems, such as **semantic similarity detection and text classification.**

This project is on how FastText library creates vector representations that can be used to find semantic similarities between the words. 


Need to install Wikipedia Library for Python

In [None]:
pip install wikipedia 

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11696 sha256=ff7f7b872419eca1fea0bb42a82c957fca1ab020590ba1957268df39a7d33260
  Stored in directory: /root/.cache/pip/wheels/15/93/6d/5b2c68b8a64c7a7a04947b4ed6d89fb557dcc6bc27d1d7f3ba
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


FastText supports both Continuous **Bag of Words** and **Skip-Gram models**.  Skip-gram model has been implemented to learn vector representation of words from the Wikipedia articles on different topics of artificial intelligence, machine learning, deep learning, and neural networks. I have chosen these to make a corpus with these similar topics.

## Importing required libraries

In [None]:
import wikipedia
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from keras.preprocessing.text import Tokenizer
from gensim.models.fasttext import FastText 
#using the FastText module from the gensim.models.fasttext library. 
#For the word representation and semantic similarity, the Gensim model can be used for FastText.
import numpy as np
import matplotlib.pyplot as plt
from string import punctuation
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize
from nltk import WordPunctTokenizer
en_stop = set(nltk.corpus.stopwords.words('english'))
%matplotlib inline

import re
from nltk.stem import WordNetLemmatizer

##Scraping Articles from Wikipedia

- To scrape a Wikipedia page,the page method has been used from the wikipedia module. The name of the page that is for scrapping purpose is passed as a parameter to the page method. 
- The method returns WikipediaPage object, that is used to retrieve the page contents via the content attribute.

In [None]:
artificial_intelligence = wikipedia.page("Artificial Intelligence").content
deep_learning = wikipedia.page("Deep Learning").content
neural_network = wikipedia.page("Neural Network").content

### tokenized scraped text data into sentences using the sent_tokenize method.
artificial_intelligence = sent_tokenize(artificial_intelligence)
deep_learning = sent_tokenize(deep_learning)
neural_network = sent_tokenize(neural_network)

### sentences from the three articles are being joined here together via the extend method.
artificial_intelligence.extend(deep_learning)
artificial_intelligence.extend(neural_network)

#Data Preprocessing

- Cleaning our text data by removing punctuations and numbers. 
- Converting the data into the lower case. 
- lemmatizing the words to their root form. 
- The stop words and the words with the length less than 4 will be removed from the corpus, that is chosen randomly for this test, so you may allow the words with smaller or greater lengths in the corpus.

- The preprocess_text function, performs the preprocessing tasks.

In [None]:
stemmer = WordNetLemmatizer()

def preprocess_text(document):
        # Removing all the special characters
        document = re.sub(r'\W', ' ', str(document))

        # Removing all single characters
        document = re.sub(r'\s+[a-zA-Z]\s+', ' ', document)

        # Removing single characters from the start
        document = re.sub(r'\^[a-zA-Z]\s+', ' ', document)

        # Substituting multiple spaces with single space
        document = re.sub(r'\s+', ' ', document, flags=re.I)

        # Removing prefixed 'b'
        document = re.sub(r'^b\s+', '', document)

        # Converting to Lowercase
        document = document.lower()

        # Lemmatization
        tokens = document.split()
        tokens = [stemmer.lemmatize(word) for word in tokens]
        tokens = [word for word in tokens if word not in en_stop]
        tokens = [word for word in tokens if len(word) > 3]

        preprocessed_text = ' '.join(tokens)

        return preprocessed_text

In [None]:
###  CHECKING if the function performs the desired task by preprocessing a dummy sentence

sent = preprocess_text("Hello! I'm here to check if my function performs the desired task by preprocessing a dummy sentence.... ")
print(sent)


final_corpus = [preprocess_text(sentence) for sentence in artificial_intelligence if sentence.strip() !='']

word_punctuation_tokenizer = nltk.WordPunctTokenizer()
word_tokenized_corpus = [word_punctuation_tokenizer.tokenize(sent) for sent in final_corpus]


hello check function performs desired task preprocessing dummy sentence



here, the punctuations and stop words have been removed, and the sentences have been lemmatized. Furthermore, words with length less than 4, such as "era", have also been removed.


## Creating Words Representation

Preprocessing is done for our corpus. Next step is to create word representations using FastText. 
steps :
- Firstly, defining the the **hyper-parameters** for our **FastText model**
- **embedding_size, window_size, min_word, down_sampling** these are the important hyperparameters for fastText model 

In [None]:
embedding_size = 60
window_size = 40
min_word = 5
down_sampling = 1e-2

- **embedding_size** :  the size of the embedding vector. In other words, each word in our corpus will be represented as a 60-dimensional vector.

- **window_size** :  the size of the number of words occurring before and after the word based on which the word representations will be learned for the word.
In the skip-gram model we input a word to the algorithm and the output is the context words. If the window size is 40, for each input there will be 80 outputs: 40 words that occur before the input word and 40 words that occur after the input word. The word embeddings for the input word are learned using these 80 output words.
- **min_word** :  specifies the minimum frequency of a word in the corpus for which the word representations will be generated. 
- **down_sampling** : the most frequently occurring word will be down-sampled by a number specified 

## Creating FastText model for Word Representations

In [None]:
%%time
ft_model = FastText(word_tokenized_corpus,
                      size=embedding_size,
                      window=window_size,
                      min_count=min_word,
                      sample=down_sampling,
                      sg=1,
                      iter=100)

CPU times: user 26.8 s, sys: 266 ms, total: 27.1 s
Wall time: 25.4 s


- **sg** parameter : defines the type of model. **sg = 1** specifies that we want to create skip-gram model and zero specifies the bag of words model, which is the default value as well.

### Checking the word representation 

using **wv** method to check the word representation for the words "Sunanda"
- In the output it will show a 60-dimensional vector for the word "Sunanda"

In [None]:
print(ft_model.wv['Sunanda'])  

[-0.14300203  0.42836142 -0.19133398 -0.62088096  0.02198491  0.04482359
  0.17954879 -1.009901    0.11191189  0.17173022 -0.8434557  -0.12193018
 -0.8853135   0.5310766  -0.1306121  -0.06348509  0.2299048  -0.8656431
 -0.89303595  0.80831206 -0.11225256 -0.5862316  -0.00986825 -0.20372969
 -0.11302911  0.71904874  0.2676304  -0.9952991   0.8046882   1.6498783
 -0.52074516 -0.27212813 -0.4511383  -0.59613365 -0.48741227 -0.6358657
 -0.37141305 -0.28175837 -0.41582265  0.38244763 -0.78257376  0.31139666
 -0.29351392 -0.21120054 -0.28536057  0.36699802 -0.7435014  -0.62926894
 -0.63642716 -0.1170398   0.11857605 -0.1451082   0.0335955   0.18470442
 -0.18027362  0.6841604  -0.60830015 -0.01248473 -0.17693934 -0.07898621]


- Finding top 5 most similar words for the words 'artificial', 'intelligence', 'machine', 'network', 'recurrent', 'deep'. 
- Any number of words can be chosen. The following script prints the specified words along with the 5 most similar words.

In [None]:
semantically_similar_words = {words: [item[0] for item in ft_model.wv.most_similar([words], topn=5)]
                  for words in ['artificial', 'intelligence', 'machine', 'network', 'recurrent', 'deep']}

for k,v in semantically_similar_words.items():
    print(k+":"+str(v))

artificial:['intelligence', 'superintelligence', 'social', 'policy', 'moral']
intelligence:['artificial', 'superintelligence', 'intelligent', 'creating', 'turing']
machine:['described', 'argument', 'intelligent', 'ethical', 'study']
network:['neural', 'specifically', 'convolutional', 'biological', 'recurrent']
recurrent:['supervised', 'current', 'unsupervised', 'drug', 'depth']
deep:['learning', 'scale', 'specifically', 'abstract', 'depth']




*   cosine similarity between the vectors for any two words



In [None]:
print(ft_model.wv.similarity(w1='artificial', w2='intelligence'))

0.7221809
