# Tweets Tokenization

The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of four tasks.

The [data](https://drive.google.com/file/d/15x_wPAflvYQ2Xh38iNQGrqUIWLj5l5Nw/view?usp=share_link) contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used.

Grading:
- 30 points - Tokenize tweets by hand
- 30 points - Implement 4 tokenizers
- 20 points - Stemming and Lemmatization
- 20 points - Explain sentencepiece (for masters only)


Remarks: 
- Use Python 3 or greater
- Max is 80 points for bachelors, 100 points for masters

## Tokenize tweets by hand

As a first task you need to tokenize 15 tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows:

- Each smiley is a separate token
- Each hashtag is an individual token. Each user reference is an individual token
- If a word has spaces between them then it is converted to a single token
- If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop
- All punctuations are individual tokens. This includes double-quotes and single quotes also
- A URL is a single token

Example of output

    Input tweet
    @xfranman Old age has made N A T O!

    Tokenized tweet (separated by comma)
    @xfranman , Old , age , has , made , NATO , !

## Unzipping the zipped file

In [4]:
!unzip Assignment1_data.zip -d '/content/tweets'

Archive:  Assignment1_data.zip
  inflating: /content/tweets/file5   
  inflating: /content/tweets/__MACOSX/._file5  
  inflating: /content/tweets/file4   
  inflating: /content/tweets/__MACOSX/._file4  
  inflating: /content/tweets/file3   
  inflating: /content/tweets/__MACOSX/._file3  
  inflating: /content/tweets/file2   
  inflating: /content/tweets/__MACOSX/._file2  
  inflating: /content/tweets/file1   
  inflating: /content/tweets/__MACOSX/._file1  


In [26]:
# Function to open the different tweet files and merge them into one
def open_and_merge(filename):

  with open('/content/tweets/' + filename) as f:
      lines = f.readlines()

  with open('/content/merged', 'w') as merged:
      merged.writelines(lines)

In [27]:
for i in range(5):
  open_and_merge('file' + str(i+1))

In [30]:
# For the 1st task using only 1st file
with open('/content/tweets/file1') as f:
      lines = f.readlines()
lines[:15]

['@anitapuspasari waduh..\n',
 '" Could journos please stop putting the word ""gate"" after everything they write... gate."\n',
 "20% More Ridiculous Sale @20x200 ends tonight! - get 20% off by entering 'RIDONK' at checkout. More info: http://bit.ly/ridonktues\n",
 "@Studio85 I have a pair of those shoes. They are comfy. Like being barefoot. Okay for running, but not on concrete, as I've discovered.\n",
 'RT @twilightus Team Carlisle is a Trending Topic- help him out RT Follow @peterfacinelli see a grown man n a bikini dance Hollywood Blvd\n',
 '@karenrubin you might have to reinstall - that happened to me a few months ago, now I use Nambu on my Mac\n',
 'Just Posted: Redneck Dragon - Part XXVIII (http://cli.gs/gWy0yT)\n',
 '" ""Paul McCartney ... went through all his education there and nobody thought he had any musical talent,"" http://tinyurl.com/nkdbdq"\n',
 '@ambienteer Yeah, pretty much how i feel about it.\n',
 '@florianseroussi Nothing really noticeable? Are you kidding?\n',
 '


    1. Input tweet
    @anitapuspasari waduh..\n

    1. Tokenized tweet
    @anitapuspasari, waduh, .., \n

    2. Input tweet
    " Could journos please stop putting the word ""gate"" after everything they write... gate."\n
    
    2. Tokenized tweet
    ", Could, journos, please, stop, putting, the, word, "", gate, "", after, everything, they, write, ..., gate, ., ", \n

    3. Input tweet
    20% More Ridiculous Sale @20x200 ends tonight! - get 20% off by entering 'RIDONK' at checkout. More info: http://bit.ly/ridonktues\n

    3. Tokenized tweet
    20%, More, Ridiculous Sale, @20x200, ends, tonight, !, -, get, 20%, off, by, entering, 'RIDONK', at, checkout, ., More, info, :, http://bit.ly/ridonktues\n

    4. Input tweet
    @Studio85 I have a pair of those shoes. They are comfy. Like being barefoot. Okay for running, but not on concrete, as I've discovered.\n

    4. Tokenized tweet
    @Studio85, I, have, a, pair, of, those, shoes, ., They, are, comfy, ., Like, being, barefoot, ., Okay, for, running, , ,but, not, on, concrete, , ,as, I, ', ve, discovered, ., \n

    5. Input tweet
    RT @twilightus Team Carlisle is a Trending Topic- help him out RT Follow @peterfacinelli see a grown man n a bikini dance Hollywood Blvd\n

    5. Tokenized tweet
    RT, @twilightus, Team, Carlisle, is, a, Trending, Topic, -, help, him, out, RT, Follow, @peterfacinelli, see, a, grown, man, n, a, bikini, dance, Hollywood, Blvd\n

    6. Input tweet
    @karenrubin you might have to reinstall - that happened to me a few months ago, now I use Nambu on my Mac\n

    6. Tokenized tweet
    @karenrubin, you, might, have, to, reinstall, -, that, happened, to, me, a, few, months, ago, , ,now, I, use, Nambu, on, my, Mac\n

    7. Input tweet
    Just Posted: Redneck Dragon - Part XXVIII (http://cli.gs/gWy0yT)\n

    7. Tokenizes tweet
    Just, Posted, :, Redneck, Dragon, -, Part, XXVIII, (http://cli.gs/gWy0yT)\n

    8. Input tweet
    " ""Paul McCartney ... went through all his education there and nobody thought he had any musical talent,"" http://tinyurl.com/nkdbdq"\n

    8. Tokenized tweet
    ", "", Paul, McCartney, ..., went, through, all, his, education, there, and, nobody, thought, he, had, any, musical, talent, , , "", http://tinyurl.com/nkdbdq, ", \n

    9. Input tweet
    @ambienteer Yeah, pretty much how i feel about it.\n

    9. Tokenized tweet
    @ambienteer, Yeah, , , pretty, much, how, i, feel, about, it, ., \n

    10. Input tweet
    @florianseroussi Nothing really noticeable? Are you kidding?\n

    10. Tokenized tweet
    @florianseroussi, Nothing, really, noticeable, ?, Are, you, kidding, ?, \n

    11. Input tweet
    @toiletooth Hours?\n

    11. Tokenized tweet
    @toiletooth, Hours, ?, \n

    12. Input tweet
    " Obama,Hamas,and the Mullahs being ""helpfu l""http://www.jpost.com/servlet/Satellite?cid=1245184848467&pagename=JPost%2FJPArticle%2FPrinter"\n

    12. Tokenized tweet
    ", Obama, , , Hamas, , , and, the, Mullahs, being, "", helpfu, l, "", http://www.jpost.com/servlet/Satellite?cid=1245184848467&pagename=JPost%2FJPArticle%2FPrinter, ", \n

    13. Input tweet
    RT @BBHLabs 81% of twitter users are UNDER 30 + more v. interesting statistics here: http://www.sysomos.com/insidetwitter/\n

    13. Tokenized tweet
    RT, @BBHLabs, 81%, of, twitter, users, are, UNDER, 30, +, more, v, ., interesting, statistics, here, :, http://www.sysomos.com/insidetwitter/\n

    14. Input tweet
    @Birdingperu Great looking hummer! RTThe world's most spectacular hummingbird Marvelous Spatuletail on a feeder. http://bit.ly/aGHYZ\n

    14. Tokenized tweet
    @Birdingperu, Great, looking, hummer, !, RTThe, world's, most, spectacular, hummingbird, Marvelous, Spatuletail, on, a, feeder, ., http://bit.ly/aGHYZ\n

    15. Input tweet
    attn. chas. whitman: RT @villagevoice Jonas Brothers at Rockefeller Center for the Today Show tomorrow morn—EEEEEEEEEE!\n

    15. Input tweet
    attn, ., chas, ., whitman, :, RT, @villagevoice, Jonas, Brothers, at, Rockefeller, Center, for, the, Today, Show, tomorrow, morn, —, EEEEEEEEEE, !, \n


## Implement 4 tokenizers

Your task is to implement the 4 different tokenizers that take a list of tweets on a topic and output tokenization for each:

- White Space Tokenization
- Sentencepiece
- Tokenizing text using regular expressions
- NLTK TweetTokenizer

For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.

## Loading merged tweets

In [54]:
# Saving all the tweets in tweets
with open('/content/merged') as f:
      tweets = f.readlines()

## 1. WhiteSpace tokenizer

In [55]:
# nltk whitespace tokenizer
from nltk.tokenize import WhitespaceTokenizer

# Function for whitespace tokenization
def white_space_tokenizer(text: str) -> list:
    tk = WhitespaceTokenizer()
    return tk.tokenize(text)

### The following is an example on 1 file, all the files can also be used

In [56]:
tokens = []
for tweet in tweets:
   tokens.append(white_space_tokenizer(tweet))
tokens[1]

['Being',
 'A',
 'Work',
 'At',
 'Home',
 'Mom',
 '(WAHM)',
 'Is',
 'A',
 '24/7',
 'Job',
 '»',
 'Messing',
 'With',
 'My',
 'Mind',
 'http://bit.ly/17rLra']

## 2. Sentence piece tokenizer

In [18]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.97


In [39]:
import sentencepiece as spm

# Training on the tweets model for tokenization
spm.SentencePieceTrainer.Train(input='/content/merged',
                               model_prefix='sp_tweets',
                               vocab_size=520,
                               pad_id=0,                
                               unk_id=1,
                               bos_id=2,
                               eos_id=3
                               )

In [52]:
# Function for sentence piece tokenizer
def sentencepiece_wrapper(text: str) -> list:
   
   sp = spm.SentencePieceProcessor(model_file=str('/content/sp_tweets.model'))
   encoded_input = sp.Encode(text)

   tokenized_input = [sp.IdToPiece(id) for id in encoded_input]
    
   return tokenized_input

In [53]:
sp_tokens = []
for tweet in tweets:
   sp_tokens.append(sentencepiece_wrapper(tweet))
sp_tokens[1]

['▁Be',
 'ing',
 '▁A',
 '▁W',
 'or',
 'k',
 '▁A',
 't',
 '▁H',
 'ome',
 '▁Mo',
 'm',
 '▁(',
 'W',
 'A',
 'H',
 'M',
 ')',
 '▁Is',
 '▁A',
 '▁2',
 '4',
 '/',
 '7',
 '▁J',
 'o',
 'b',
 '▁',
 '»',
 '▁M',
 'ess',
 'ing',
 '▁W',
 'i',
 'th',
 '▁M',
 'y',
 '▁M',
 'in',
 'd',
 '▁h',
 't',
 'tp',
 '://',
 'bi',
 't',
 '.',
 'ly',
 '/',
 '17',
 'r',
 'L',
 'ra']

## 3. Regex tokenization

In [81]:
import re

# Function for tokenization using regex 
def re_tokenizer(text: str) -> list:
    token = re.split('[^\S\"\'\-\:\,\;\.\!\?]+', text)
    return token

In [82]:
re_tokens = []
for tweet in tweets:
   re_tokens.append(re_tokenizer(tweet))
re_tokens[1]

['Being',
 'A',
 'Work',
 'At',
 'Home',
 'Mom',
 '(WAHM)',
 'Is',
 'A',
 '24/7',
 'Job',
 '»',
 'Messing',
 'With',
 'My',
 'Mind',
 'http://bit.ly/17rLra',
 '']

## 4. Nltk tweet tokenizer

In [83]:
import nltk
from nltk.tokenize import TweetTokenizer

# Function for nltk tweet tokenization
def nltk_tweet_tokenizer(text: str) -> list:
    tk = TweetTokenizer()
    return tk.tokenize(text)

In [85]:
nltk_tokens = []
for tweet in tweets:
   nltk_tokens.append(nltk_tweet_tokenizer(tweet))
nltk_tokens[1]

['Being',
 'A',
 'Work',
 'At',
 'Home',
 'Mom',
 '(',
 'WAHM',
 ')',
 'Is',
 'A',
 '24/7',
 'Job',
 '»',
 'Messing',
 'With',
 'My',
 'Mind',
 'http://bit.ly/17rLra']

Run your implementations on the data. Compare the results, decide which one is better. List the advantages of the best tokenizer.



*   All of them are good, but depending on the use case and domain of the problem itself. If something like a sentiment or disaster or any type of classificiation required on a supervised set of such data then nltk, regex, whitespace all would work
*   Sentence piece is the best one because it dissects the words themselves which makes the usage more broad and powerful, especially in cross lingual and or multi-lingual applications.



## Stemming and Lemmatization

Your task is to write two functions: stem and lemmatize. Input is a text, so you need to tokenize it first.

## The nltk tweet tokenized tokens would be used for this task

In [99]:
from nltk.stem.snowball import SnowballStemmer

def stem(text: str) -> list:
    snow_stemmer = SnowballStemmer(language='english')
    
    # To store the stem words
    stem_words = []
    for word in text:
      x = snow_stemmer.stem(word)
      stem_words.append(x)
    return stem_words

## Testing on nltk tweet tokenizer tokens

In [100]:
stem_words = []
for token in nltk_tokens:
  stem_words.append(stem(token))
stem_words[1]

['be',
 'a',
 'work',
 'at',
 'home',
 'mom',
 '(',
 'wahm',
 ')',
 'is',
 'a',
 '24/7',
 'job',
 '»',
 'mess',
 'with',
 'my',
 'mind',
 'http://bit.ly/17rlra']

## Lemmatization

In [90]:
!python -m spacy download en_core_web_md 

2023-02-14 13:36:47.136014: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-14 13:36:47.136107: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-02-14 13:36:48.392833: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting en-core-web-md==3.4.1
  Downloading https://github.com/explosion/spacy-models/releases/download

In [96]:
import spacy

nlp = spacy.load('en_core_web_md')

# Function for lemmatization using spacy
def lemmatize(text: str) -> list:
    lemma_tokens = []
    tokens = nlp(text)
    for token in tokens:
      lemma_tokens.append(token.lemma_)
    return lemma_tokens

In [98]:
lemma_words = []
for tweet in tweets:
  lemma_words.append(lemmatize(tweet))
lemma_words[1]

['be',
 'a',
 'work',
 'at',
 'home',
 'Mom',
 '(',
 'WAHM',
 ')',
 'be',
 'A',
 '24/7',
 'job',
 '»',
 'mess',
 'with',
 'my',
 'mind',
 'http://bit.ly/17rlra',
 '\n']

## Explain sentencepiece (for masters only)

For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works.

...

## Resources

1. [Regular Expressions 1](https://realpython.com/regex-python/)
2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/)
2. [Spacy Lemmatizer](https://spacy.io/api/lemmatizer)
2. [NLTK Stem](https://www.nltk.org/howto/stem.html)
3. [SentencePiece](https://github.com/google/sentencepiece)
4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)