# SMASAC - Hashtag Visualization

This notebook shows how to use the [fastText](https://fasttext.cc) text embedding and [t-SNE](https://lvdmaaten.github.io/tsne) dimensionality reduction to explore and visualize the space of hashtags in a meaningful way.

This notebook is structured as follow:

1. Preprocessing the data
2. Training the fastText embedding model
3. Dimensionality reduction using t-SNE
4. Exploring and visualizing space of hashtags

In [1]:
from pathlib import Path
import fastText
import sklearn
import sklearn.metrics
import numpy as np
import re
import mpld3
import matplotlib.pyplot as plt

# Configuration

Folder structure of this project:

* data: data directory
    - twitter_las_vegas_shooting : Text for training, sample of 50k tweets
    - twitter_las_vegas_shooting.preprocessed : Preprocessed training text
    - twitter_las_vegas_shooting.labels : Hashtags in training corpus
    - twitter_las_vegas_shooting.embedding : Hashtags emebdding vectors
    - twitter_las_vegas_shooting.low_dim_embedding : Hashtags embedding vectors in 2D
* model: model directory


We will use `twitter_las_vegas_shooting` for training, which contains 50,000 tweets crawled during Las Vegas mass shooting massacre. 


In [2]:
root_dir = Path("..")
data_dir = root_dir / "data"
model_dir = root_dir / "model" 

# Create model directory if not exist
if not model_dir.exists():
    model_dir.mkdir()

In [3]:
# Corpus
data_path = data_dir / "twitter_las_vegas_shooting"
# Training corpus filename
input_filename = str(data_path)
# Model filename
model_filename = str(model_dir / "twitter_hashtag.bin")

# Training Embedding Model

## Preprocessing corpus

In hashtag visualization we need to keep the hashtag in preprocessing to obtain a good representation of hashtag feature.

In preprocessing, we will
* Remove mentioned
* Remove punctuations
* Remove urls
* Convert tweet to lowercase

In [4]:
# Keep hashtags in preprocessing

# Preprocessing Config
preprocess_config = {
    "hashtag": False, # we don't want to remove hashtags
    "mentioned": True,
    "punctuation": True,
    "url": True,
}

# Pattern
hashtag_pattern = "#\w+"
mentioned_pattern = "@\w+"
url_pattern = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

trans_str = "!\"$%&\'()*+,-./:;<=>?[\\]^_`{|}~" + "…"
translate_table = str.maketrans(trans_str, " " * len(trans_str))

def preprocess(s):
    s = s.lower()
    if preprocess_config["hashtag"]:
        s = re.sub(hashtag_pattern, "", s)
    if preprocess_config["mentioned"]:
        s = re.sub(mentioned_pattern, "", s)
    if preprocess_config["url"]:
        s = re.sub(url_pattern, "", s)
    if preprocess_config["punctuation"]:
        s = " ".join(s.translate(translate_table).split())
    return s


**Preprocessing Example**  
Here is an example output of preprocessing. 

In [5]:
# example of preprocessing
example_tweet = "RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn"

print("Original Tweet:")
print(example_tweet)
print()
print("Preprocessed Tweet:")
print(preprocess(example_tweet))

Original Tweet:
RT @TheLeadCNN: Remembering Keri Lynn Galvan, from Thousand Oaks, California. #LasVegasLost https://t.co/QuvXa6WvlE https://t.co/hDF2d3Owgn

Preprocessed Tweet:
rt remembering keri lynn galvan from thousand oaks california #lasvegaslost


**Preprocessing corpus**

In [6]:
# Preprocessing
preprocessed_data_path = data_dir / "twitter_las_vegas_shooting.preprocessed"

with data_path.open() as f:
    lines = [l.strip() for l in f.readlines()]

with preprocessed_data_path.open("w") as f:
    for l in lines:
        f.write(preprocess(l))
        f.write("\n")

# use preprocessed data as input
input_filename = str(preprocessed_data_path)

## Training fastText embedding model

Use corpus after preprocessing to generate the 100 dimensions embedding representation model.

In [7]:
# fastText Configuration
embedding_model = "skipgram"
lr = 0.05
dim = 100
ws = 5
epoch = 5
minCount = 5
minCountLabel = 0
minn = 3
maxn = 6
neg = 5
wordNgrams = 1
loss = "ns"
bucket = 2000000
thread = 12
lrUpdateRate = 100
t = 1e-4
verbose = 2

In [8]:
model = fastText.train_unsupervised(
    input = input_filename,
    model=embedding_model,
    lr=lr,
    dim=dim,
    ws=ws,
    epoch=epoch,
    minCount=minCount,
    minCountLabel=minCountLabel,
    minn=minn,
    maxn=maxn,
    neg=neg,
    wordNgrams=wordNgrams,
    loss=loss,
    bucket=bucket,
    thread=thread,
    lrUpdateRate=lrUpdateRate,
    t=t,
    verbose=verbose,
)

print("Training finished.")
print("Dimension: {}".format(model.get_dimension()))
print("Number of words: {}".format(len(model.get_words())))

# Output model to disk if needed
model.save_model(model_filename)

# Load saved model if needed
model = fastText.load_model(model_filename)

Training finished.
Dimension: 100
Number of words: 6366


# Dimensionality reduction using T-SNE

In order to represent high dimension word vectors, we need to reduce the dimension to 2. The t-SNE is a helpful way which could keep the distribution of data in 2D.

In [9]:
# t-SNE Configuration
N_COMPONENTS = 2  # should be 2 for 2D plot
n_components = N_COMPONENTS
perplexity = 30.0
n_iter = 5000

def sklearn_tsne(embedding):
    from sklearn.manifold import TSNE
    tsne = TSNE(perplexity=perplexity, n_components=n_components,
                n_iter=n_iter, metric="cosine")
    low_dim_embedding = tsne.fit_transform(embedding)
    return low_dim_embedding

Hashtags dimensionality reduction

1. Get all words from model
2. Filter hashtags
3. Get high dimension embedding vectors of hashtags
4. Reduce dimension using t-SNE


**Note:** t-SNE is time-consuming so we are going to use pre-trained data instead.
Uncomment the code below to run the dimensionality reduction process.

In [10]:
# words = np.array(model.get_words())
# labels = list(filter(lambda w: w.startswith("#"), ws))
# embedding = np.array([model.get_word_vector(w) for w in labels])
# low_dim_embedding = sklearn_tsne(embedding)
# label_vector = {}
# for i, label in enumerate(labels):
#     label_vector[label] = (low_dim_embedding[i, :], embedding[i, :])

Here we will use precalculated result in data folder.

* `twitter_las_vegas_shooting.labels` : hashtags
* `twitter_las_vegas_shooting.embedding` : high dimension embedding vectors of hashtags
* `twitter_las_vegas_shooting.low_dim_embedding` : low dimension embedding vectors of hashtags


In [11]:
# TSNE is time-consuming, an optional way is use our pre-trained data instead
def load_text(filename):
    with open(filename) as f:
        lines = f.readlines()
    return [l.strip() for l in lines]

labels = load_text(str(data_dir / "twitter_las_vegas_shooting.labels"))
embedding = np.loadtxt(data_dir / "twitter_las_vegas_shooting.embedding")
low_dim_embedding = np.loadtxt(data_dir / "twitter_las_vegas_shooting.low_dim_embedding")

label_vector = {}
for i, label in enumerate(labels):
    label_vector[label] = (low_dim_embedding[i, :], embedding[i, :])

# Interactive Plot

Plot hashtags and its low dimension representation on an interactive figure.

In [12]:
def calc_n_cosine_neighbor(inX, X, N):
    if inX.ndim == 1:
        inX = [inX]
    distances = sklearn.metrics.pairwise.pairwise_distances(
        X, inX, metric="cosine")
    sortedDist = distances.reshape((distances.shape[0],)).argsort()
    return sortedDist[:N], distances

def plot_interactive_scatter(low_dim_embedding, labels, inx, q, info):
    from matplotlib.patches import Circle

    fig, ax = plt.subplots()
    fig.set_size_inches(10, 10)
    plt.title(info)

    low_dim_embedding = np.concatenate([low_dim_embedding, inx])
    labels.append(q)

    # mark query
    c_x, c_y = inx[0]
    circle = Circle((c_x, c_y), 10, facecolor='none',
                    edgecolor='red', linewidth=3, alpha=0.5)
    ax.add_patch(circle)

    scatter = ax.scatter(
        low_dim_embedding[:, 0],
        low_dim_embedding[:, 1],
    )

    tooltip = mpld3.plugins.PointLabelTooltip(scatter, labels=labels)
    for i, label in enumerate(labels):
        x, y = low_dim_embedding[i, :]
        ax.text(x, y, label, alpha=0.4)
    mpld3.plugins.connect(fig, tooltip)
    return fig

def process_nonexistent_word(w):
    return model.get_word_vector(w)

def process_query(q):
    q = q.strip()

    def is_hashtag(x): return x.startswith("#")

    if is_hashtag(q) and q in label_vector:
        v = label_vector[q]
    elif (not is_hashtag(q)) and ("#" + q) in label_vector:
        q = "#" + q
        v = label_vector[q]
    else:
        v = (None, process_nonexistent_word(q))
    return q, v

def query(q):
    LOW_DIM_EMBEDDING = 0
    EMBEDDING = 1    
    N_NEIGHBOR = 400

    q, vs = process_query(q)

    inx_embedding = vs[EMBEDDING]
    inx_low_dim_embedding = vs[LOW_DIM_EMBEDDING]
    
    idx, _ = calc_n_cosine_neighbor(inx_embedding[np.newaxis, :], embedding, N_NEIGHBOR)
    
    plot_labels = [labels[i] for i in idx]

    
    # For nonexistent word, use it's cloest neighbor to approximately represent its position in 2D plot
    info = q
    if inx_low_dim_embedding is None:
        inx_low_dim_embedding = low_dim_embedding_sample[idx[-1], :]
        info = "Nonexistent word: " + q
    
    return plot_interactive_scatter(low_dim_embedding[idx, :], plot_labels, inx_low_dim_embedding[np.newaxis, :], q, info)


Now we can generate interactive plot based on queries.

Try to change variable `q` to what you want to query, and explore the hashtag space.

In [13]:
q = "lasvegas"
mpld3.display(query(q))

For more detail:
[Twitter Visualization](https://github.com/guyao/twitter-visualization)

[Unsupervised Hashtag Retrieval and Visualization for Crisis Informatics](https://arxiv.org/abs/1801.05906)
