# Tokenize captions and get embeddings

### Imports

In [116]:
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

In [117]:
captions = pd.read_csv("captions.csv", error_bad_lines=False, warn_bad_lines=False)
print("Skipping bad lines - return to this later")
print(captions.shape)
captions.sample(10)

Skipping bad lines - return to this later
(155392, 3)


Unnamed: 0,image,above_text,below_text
140541,typical 1D,1D,ROCKS
153195,Hulk SMASH,ira,smash! not shash!
132644,Conspiracy cat,BLUETOOTH,
135206,Abo,DON'T YOU CUT MY FUCKING,gUMMO SAPLINGS CUNT
31112,Look at All the Fucks I Give,Look at all the homework,I haven't done
31954,Picard facepalm,a budale,borg te jebo
78951,Santa Claus Troll Face,U want pony?,cool story bro
79204,no memory gandalf,i have no memory,"of how to ""rip a gb"" with this gooch person."
29586,mindenki nyugodjon le a picsába,Mindenki nyugodjon le a picsába,van feltámadás
150789,Chill out slut,,can*


There seems to be a non-negligible number of captions written in Spanish.

In [118]:
np.sum(pd.isna(captions))

image           13
above_text    6137
below_text    7199
dtype: int64

In [120]:
captions.iloc[np.where(pd.isna(captions.image))]

Unnamed: 0,image,above_text,below_text
18546,,several people get up and leave as they can se...,
43899,,teacher is even later than you,
57525,,Ekki málið :),
100719,,Ert þú starfsmaður þarna eða eigandi?,
100723,,uppiskorpi!!!,
100725,,Eða bara eldisfiskur. LOL.,
100728,,Takk kærlega fyrir þetta :),
105241,,makes us strong,
105243,,makes us strong,
114690,,Nei þá nærðu í rauðvín,


NA values for labels appear to happen when text is in a different language. I think it is safe to say that we can drop these. For `above_text` and `below_text`, this indicates that the meme did not contain text either above or below the picture. We can't throw these out, so just replace them with a empty string.

In [121]:
captions = captions[pd.notnull(captions.image)]

In [122]:
captions = captions.replace(np.nan, '', regex=True)

In [123]:
np.sum(pd.isna(captions))

image         0
above_text    0
below_text    0
dtype: int64

### Set up vocabulary dictionary

In Dank Learning, it looks like they create a vocabulary dictionary from all words in the captions and labels, i.e., meme format names. See [here](https://github.com/alpv95/MemeProject/blob/master/im2txt/MemeNote.ipynb) for their exact process.

In [124]:
all_phrases = np.append(captions.image, [captions.above_text, captions.below_text])

In [125]:
rand_inds = np.random.randint(len(all_phrases)-1, size=10)
for phrase in all_phrases[rand_inds]:
    print(phrase)

or third grade poetry
tHAT YOU DO NOT HAVE TO POST EVERY PICTURE OF YOUR DOG ON FACEBOOK
will you pretty please
Computer kid
y you no
mickey mouse
flniuydl

is DA TRUE OFFENSIVE TACTIC!
Disgusted Ginger


### Tokenize

In [126]:
tokenizer = RegexpTokenizer(r'[\w\']+')

In [136]:
all_words = []
for phrase in all_phrases:
    for word in tokenizer.tokenize(phrase):
        all_words.append(word)

In [144]:
len(all_words)

1845822