# SMASAC - fastText Embedding

This notebook shows how to use the [fastText](https://fasttext.cc) to generate word, tweet representation in embedding space.

This notebook is structured as follow:

1. Preprocessing the data
2. Training the fastText embedding model
3. Query similar word based on embedding model

In [1]:
from pathlib import Path
import fastText
import sklearn
import sklearn.metrics
import numpy as np
import re

# Configuration

Folder structure of this project:

* data: data directory
    - twitter_las_vegas_shooting : Text for training, sample of 50k tweets
    - twitter_las_vegas_shooting.preprocessed : Preprocessed training text
    - twitter_las_vegas_shooting.labels : Hashtags in training corpus
    - twitter_las_vegas_shooting.embedding : Hashtags emebdding vectors
    - twitter_las_vegas_shooting.low_dim_embedding : Hashtags embedding vectors in 2D
* model: model directory


We will use `twitter_las_vegas_shooting` for training, which contains 50,000 tweets crawled during Las Vegas mass shooting massacre. 

In [2]:
root_dir = Path("..")
data_dir = root_dir / "data"
notebook_dir = root_dir / "notebooks"
model_dir = root_dir / "model" 

if not model_dir.exists():
    model_dir.mkdir()

In [3]:
# corpus
data_path = data_dir / "twitter_las_vegas_shooting"
# Training corpus filename
input_filename = str(data_path)
# Model filename
model_filename = str(model_dir / "twitter.bin")

# Preprocessing

Preprocessing tweet to obtain a good representation of language model.

* Remove hashtags
* Remove mentioned
* Remove punctuations
* Remove urls
* Convert tweet to lowercase

In [4]:
# Preprocessing Config
preprocess_config = {
    "hashtag": True,
    "mentioned": True,
    "punctuation": True,
    "url": True,
}

# Pattern
hashtag_pattern = "#\w+"
mentioned_pattern = "@\w+"
url_pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

trans_str = "!\"$%&\'()*+,-./:;<=>?[\\]^_`{|}~" + "…"
translate_table = str.maketrans(trans_str, " " * len(trans_str))

def preprocess(s):
    s = s.lower()
    if preprocess_config["hashtag"]:
        s = re.sub(hashtag_pattern, "", s)
    if preprocess_config["mentioned"]:
        s = re.sub(mentioned_pattern, "", s)
    if preprocess_config["url"]:
        s = re.sub(url_pattern, "", s)
    if preprocess_config["punctuation"]:
        s = " ".join(s.translate(translate_table).split())
    return s


**Preprocessing Example**  
Here is an example output of preprocessing. 

In [5]:
# example of preprocessing
example_tweet = "RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn"

print("Original Tweet:")
print(example_tweet)
print()
print("Preprocessed Tweet:")
print(preprocess(example_tweet))

Original Tweet:
RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn

Preprocessed Tweet:
rt remembering keri lynn galvan from thousand oaks california


**Preprocessing corpus**

In [6]:
# Preprocessing
preprocessed_data_path = data_dir / "twitter_las_vegas_shooting.preprocessed"

with data_path.open() as f:
    lines = [l.strip() for l in f.readlines()]

with preprocessed_data_path.open("w") as f:
    for l in lines:
        f.write(preprocess(l))
        f.write("\n")

# use preprocessed data as input
input_filename = str(preprocessed_data_path)

# Training fastText embedding model

Use corpus after preprocessing to generate the 100 dimensions embedding representation model.

In [7]:
# fastText Config
embedding_model = "skipgram"
lr = 0.05
dim = 100
ws = 5
epoch = 5
minCount = 5
minCountLabel = 0
minn = 3
maxn = 6
neg = 5
wordNgrams = 1
loss = "ns"
bucket = 2000000
thread = 12
lrUpdateRate = 100
t = 1e-4
verbose = 2

In [8]:
model = fastText.train_unsupervised(
    input = input_filename,
    model=embedding_model,
    lr=lr,
    dim=dim,
    ws=ws,
    epoch=epoch,
    minCount=minCount,
    minCountLabel=minCountLabel,
    minn=minn,
    maxn=maxn,
    neg=neg,
    wordNgrams=wordNgrams,
    loss=loss,
    bucket=bucket,
    thread=thread,
    lrUpdateRate=lrUpdateRate,
    t=t,
    verbose=verbose,
)

print("Training finished.")
print("Dimension: {}".format(model.get_dimension()))
print("Number of words: {}".format(len(model.get_words())))

# Output model to disk if needed
model.save_model(model_filename)

# Load saved model if needed
model = fastText.load_model(model_filename)

Training finished.
Dimension: 100
Number of words: 6040


# Query

**Get word vectors of corpus**

In [9]:
words = np.array(model.get_words())
word_vectors = np.array([model.get_word_vector(w) for w in words])

**Similarity of word vectors**
In text embedding space, cosine similarity could be used for measuring  similarity between words

In [10]:
# Calculate N neighbors based on cosine similarity
def calc_n_cosine_neighbor(inX, X, N):
    if inX.ndim == 1:
        inX = [inX]
    distances = sklearn.metrics.pairwise.pairwise_distances(
        X, inX, metric="cosine")
    sortedDist = distances.reshape((distances.shape[0],)).argsort()
    return sortedDist[:N], distances

# calculate nearest neighbours based on cosine similarity
def nn(query, words=words, word_vectors=word_vectors, k=10):
    """
    words: numpy array of words
    k: (optional, 10 by default) top k labels
    """
    global model
    v = model.get_word_vector(query)
    idx, _ = calc_n_cosine_neighbor(v, word_vectors, k)
    return words[idx]

## Query nearest words

In [11]:
q = "lasvegasshooting"

neighbours = nn("lasvegasshooting", k=20)

print("Neighbours of word \"{}\":".format(q))
for word in neighbours:
    print(word)

Neighbours of word "lasvegasshooting":
shooting
lasvegas
vegas”
las
vegas
“shooting”
rt
vega
shootin
🙏🏾
shooting”
👍
💀
</s>
buzzfeednews
❤
cc
🙏🙏🙏
shooti
866


## Get sentence vector

Use API `get_sentence_vector` to get a representation of sentende

In [12]:
example_tweet = "RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn"

tweet_vector = model.get_sentence_vector(example_tweet)
print("Tweet vector in embedding space:")
print(example_tweet)
print()
print(tweet_vector)

print()
print("Words similar this tweet")
idx, _ = calc_n_cosine_neighbor(tweet_vector, word_vectors, 20)
print([words[i] for i in idx])

Tweet vector in embedding space:
RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn

[-0.00780739  0.07937411 -0.11600392 -0.02279454  0.04590917  0.13462882
 -0.03541077 -0.02716665  0.00733223  0.08974808  0.00969621  0.01571963
  0.0755126  -0.0293568  -0.00756462  0.04680231 -0.05055553 -0.10340837
  0.00997324 -0.01222676  0.07336241  0.01554108 -0.08822611 -0.05604139
 -0.00364648 -0.03380729  0.06654789  0.07940322  0.07688745  0.01282769
  0.044875    0.0508399   0.02789218 -0.07943692 -0.02775313  0.08784501
  0.00673831 -0.03849108  0.00583337 -0.02804394 -0.03844274 -0.06221488
  0.01038171  0.04532661  0.06733818 -0.12473185 -0.01861746 -0.01122745
 -0.03541558  0.05415991 -0.08699425  0.01179982 -0.10676883  0.02461812
  0.08142392 -0.00230244 -0.09551181 -0.03706734  0.01363133 -0.01571399
  0.00781586 -0.01471497  0.08395781 -0.03696184  0.05110154  0.00028789
 -0.08268341  0.0503899