# Tweets Tokenization

The goal of the assignment is to write a tweet tokenizer. The input of the code will be a set of tweet text and the output will be the tokens in each tweet. The assignment is made up of four tasks.

The [data](https://drive.google.com/file/d/15x_wPAflvYQ2Xh38iNQGrqUIWLj5l5Nw/view?usp=share_link) contains 5 files whereby each contains 44 tweets. Each tweet is separated by a newline. For manual tokenization only one file should be used.

Grading:
- 30 points - Tokenize tweets by hand
- 30 points - Implement 4 tokenizers
- 20 points - Stemming and Lemmatization
- 20 points - Explain sentencepiece (for masters only)


Remarks: 
- Use Python 3 or greater
- Max is 80 points for bachelors, 100 points for masters

## Tokenize tweets by hand

As a first task you need to tokenize 15 tweets by hand. This will allow you to understand the problem from a linguistic point of view. The guidelines for tweet tokenization are as follows:

- Each smiley is a separate token
- Each hashtag is an individual token. Each user reference is an individual token
- If a word has spaces between them then it is converted to a single token
- If a sentence ends with a word that legitimately has a full stop (abbreviations, for example), add a final full stop
- All punctuations are individual tokens. This includes double-quotes and single quotes also
- A URL is a single token

Example of output

    Input tweet
    @xfranman Old age has made N A T O!

    Tokenized tweet (separated by comma)
    @xfranman , Old , age , has , made , NATO , !

    1. Input tweet
    @anitapuspasari waduh..
    1. Tokenized tweet
    @anitapuspasari, waduh..

    2. Input tweet
    " Could journos please stop putting the word ""gate"" after everything they write... gate."
    2. Tokenized tweet
    " , Could , journos , please , stop , putting , the , word , ", " , gate , ", " , after , everything , they , write , ... , gate , . , "
    
    3. Input tweet
    20% More Ridiculous Sale @20x200 ends tonight! - get 20% off by entering 'RIDONK' at checkout. More info: http://bit.ly/ridonktues

    3. Tokenized tweet
    20% , More , Ridiculous , Sale , @20x200 , ends , tonight , ! , - , get , 20% , off , by , entering , ', RIDONK, ' , at , checkout , . , More , info , : , http://bit.ly/ridonktues

    4. Input tweet
    @Studio85 I have a pair of those shoes. They are comfy. Like being barefoot. Okay for running, but not on concrete, as I've discovered.

    4. Tokenized tweet
    @Studio85, I, have, a, pair, of, those, shoes, ., They, are, comfy, ., Like, being, barefoot, ., Okay, for, running, ,, but, not, on, concrete, ,, as, I've, discovered, .
    
    5. Input tweet
    RT @twilightus Team Carlisle is a Trending Topic- help him out RT Follow @peterfacinelli see a grown man n a bikini dance Hollywood Blvd

    5. Tokenized tweet
    RT, @twilightus, Team, Carlisle, is, a, Trending, Topic- , help, him, out, RT, Follow, @peterfacinelli, see, a, grown, man, n, a, bikini, dance, Hollywood, Blvd

    6. Input tweet
    @karenrubin you might have to reinstall - that happened to me a few months ago, now I use Nambu on my Mac

    6. Tokenized tweet
    @karenrubin, you, might, have, to, reinstall, -, that, happened, to, me, a, few, months, ago, ,, now, I, use, Nambu, on, my, Mac
    
    7. Input tweet
    Just Posted: Redneck Dragon - Part XXVIII (http://cli.gs/gWy0yT)

    7. Tokenized tweet
    Just, Posted , :, Redneck, Dragon, -, Part, XXVIII, (, http://cli.gs/gWy0yT, )

    8. Input tweet
    " ""Paul McCartney ... went through all his education there and nobody thought he had any musical talent,"" http://tinyurl.com/nkdbdq"

    8. Tokenized tweet
    " , "" , Paul, McCartney, ..., went, through, all, his, education, there, and, nobody, thought, he, had, any, musical, talent, , "" , http://tinyurl.com/nkdbdq, ""
    
    9. Input tweet
    @ambienteer Yeah, pretty much how i feel about it.

    9. Tokenized tweet
    @ambienteer, Yeah, ,, pretty, much, how, i, feel, about, it, .

    10. Input tweet
    @florianseroussi Nothing really noticeable? Are you kidding?

    10. Tokenized tweet
    @florianseroussi, Nothing, really, noticeable, ?, Are, you, kidding, ?
    
    11. Input tweet
    @toiletooth Hours?

    11. Tokenized tweet
    @toiletooth, Hours, ?

    12. Input tweet
    " Obama,Hamas,and the Mullahs being ""helpfu l""http://www.jpost.com/servlet/Satellite?cid=1245184848467&pagename=JPost%2FJPArticle%2FPrinter"

    12. Tokenized tweet
    " , Obama, ,, Hamas, ,, and, the, Mullahs, being, "" , helpful , "" , http://www.jpost.com/servlet/Satellite?cid=1245184848467&pagename=JPost%2FJPArticle%2FPrinter , "
    
    13. Input tweet
    RT @BBHLabs 81% of twitter users are UNDER 30 + more v. interesting statistics here: http://www.sysomos.com/insidetwitter/

    13. Tokenized tweet
    RT, @BBHLabs, 81%, of, twitter, users, are, UNDER, 30, +, more, v., interesting, statistics, here, :, http://www.sysomos.com/insidetwitter/

    14. Input tweet
    @Birdingperu Great looking hummer! RTThe world's most spectacular hummingbird Marvelous Spatuletail on a feeder. http://bit.ly/aGHYZ

    14. Tokenized tweet
    @Birdingperu, Great, looking, hummer, !, RTThe, world's, most, spectacular, hummingbird, Marvelous, Spatuletail, on, a, feeder, ., http://bit.ly/aGHYZ

    15. Input tweet
    attn. chas. whitman: RT @villagevoice Jonas Brothers at Rockefeller Center for the Today Show tomorrow morn—EEEEEEEEEE!

    15. Tokenized tweet
    attn., chas., whitman, :, RT, @villagevoice, Jonas, Brothers, at, Rockefeller, Center, for, the, Today, Show, tomorrow, morn, — , EEEEEEEEE , !


## Implement 4 tokenizers

Your task is to implement the 4 different tokenizers that take a list of tweets on a topic and output tokenization for each:

- White Space Tokenization
- Sentencepiece
- Tokenizing text using regular expressions
- NLTK TweetTokenizer

For tokenizing text using regular expressions use the rules in task 1. Combine task 1 rules into regular expression and create a tokenizer.

In [1]:
from typing import List

def white_space_tokenizer(text: str) -> List[str]:
    return text.split()

In [2]:
# Read all tweets and write data to one file
data = []
for i in range(1, 6):
    with open(f"Assignment1_data/file{i}", "r") as f:
        tweets = f.readlines()
        data.extend(tweets)
        with open("Assignment1_data/all_tweets", "a") as output:
            output.writelines(tweets)
        

In [3]:
import sentencepiece as spm

# Train the SentencePiece model on the input text
spm.SentencePieceTrainer.Train('--input=Assignment1_data/all_tweets --model_prefix=tweet --vocab_size=500')

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=Assignment1_data/all_tweets --model_prefix=tweet --vocab_size=500
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: Assignment1_data/all_tweets
  input_format: 
  model_prefix: tweet
  model_type: UNIGRAM
  vocab_size: 500
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: 

In [4]:
import sentencepiece as spm

def sentencepiece_wrapper(text: str) -> List[str]:
    # Load the pre-trained SentencePiece model for Twitter data
    sp = spm.SentencePieceProcessor()
    sp.Load("tweet.model")

    # Tokenize the text into pieces
    tokens = sp.EncodeAsPieces(text)

    # Return the list of tokens
    return tokens

In [5]:
import re

def re_tokenizer(text: str) -> List[str]:
    smiley = r':\)|;\)|:-\)'
    hashtag = r'#\w+'
    user_reference = r'@\w+'
    full_stop = r'\w+.'
    url = r'https?://[^\s]+'
    punc = r'\W'
    pattern = f"{smiley}|{hashtag}|{user_reference}|{full_stop}|{url}|{punc}"
    return re.findall(pattern, text)

In [6]:
import nltk
from nltk.tokenize import TweetTokenizer

def nltk_tweet_tokenizer(text: str) -> List[str]:
    tknzr = TweetTokenizer()
    return tknzr.tokenize(text)

Run your implementations on the data. Compare the results, decide which one is better. List the advantages of the best tokenizer.

In [7]:
for i, tweet in enumerate(data[:10], 1):
    print(f"Tweet {i}:")
    print("White space tokenizer:")
    print(white_space_tokenizer(tweet))
    print()
    print("Sentencepiece:")
    print(sentencepiece_wrapper(tweet))
    print()
    print("Regular expression tokenizer:")
    print(re_tokenizer(tweet))
    print()
    print("NLTK TweetTokenizer:")
    print(nltk_tweet_tokenizer(tweet))
    print("-"*30)
    print()
    

Tweet 1:
White space tokenizer:
['@anitapuspasari', 'waduh..']

Sentencepiece:
['▁@', 'an', 'ita', 'p', 'u', 's', 'p', 'a', 's', 'ar', 'i', '▁w', 'ad', 'u', 'h', '..']

Regular expression tokenizer:
['@anitapuspasari', ' ', 'waduh.', '.', '\n']

NLTK TweetTokenizer:
['@anitapuspasari', 'waduh', '..']
------------------------------

Tweet 2:
White space tokenizer:
['"', 'Could', 'journos', 'please', 'stop', 'putting', 'the', 'word', '""gate""', 'after', 'everything', 'they', 'write...', 'gate."']

Sentencepiece:
['▁"', '▁C', 'ould', '▁', 'j', 'our', 'no', 's', '▁pleas', 'e', '▁sto', 'p', '▁put', 't', 'ing', '▁the', '▁w', 'o', 'rd', '▁""', 'ga', 'te', '""', '▁a', 'f', 'ter', '▁', 'e', 'very', 'th', 'ing', '▁they', '▁w', 'r', 'ite', '...', '▁g', 'at', 'e', '.', '"']

Regular expression tokenizer:
['"', ' ', 'Could ', 'journos ', 'please ', 'stop ', 'putting ', 'the ', 'word ', '"', '"', 'gate"', '"', ' ', 'after ', 'everything ', 'they ', 'write.', '.', '.', ' ', 'gate.', '"', '\n']

NLTK

NLTK TweetTokenizer seems the best tokenizer for tweets. Here are some advantages:

1. Handles emojis and emoticons: The TweetTokenizer is specifically designed to handle emojis and emoticons, which are commonly used in tweets and other social media texts.

2. Handles hashtags and mentions: It also handles hashtags and mentions, which are unique features of social media texts.

3. Preserves contractions: The TweetTokenizer preserves contractions, such as "don't" and "can't", which is important for preserving the meaning of the text.

4. Supports different languages: The TweetTokenizer supports multiple languages, making it a suitable option for multilingual text.

## Stemming and Lemmatization

Your task is to write two functions: stem and lemmatize. Input is a text, so you need to tokenize it first.

In [8]:
from typing import List
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

def stem(text: str) -> List[str]:
    stemmer = SnowballStemmer("english")
    tokens = word_tokenize(text)
    stemmed_words = [stemmer.stem(word) for word in tokens]
    return stemmed_words

In [9]:
import spacy

nlp = spacy.load("en_core_web_sm")

def lemmatize(text: str) -> List[str]:
    doc = nlp(text)
    lemmas = [token.lemma_ for token in doc]
    return lemmas

In [10]:
for i, tweet in enumerate(data[:10], 1):
    print(f"Tweet {i}:")
    print(tweet)
    print()
    stemmed_tokens = stem(tweet)
    print("Stemmed Tokens:", stemmed_tokens)
    print()
    lemmatized_tokens = lemmatize(tweet)
    print("Lemmatized Tokens:", lemmatized_tokens)
    print("-"*30)
    print()

Tweet 1:
@anitapuspasari waduh..


Stemmed Tokens: ['@', 'anitapuspasari', 'waduh', '..']

Lemmatized Tokens: ['@anitapuspasari', 'waduh', '..', '\n']
------------------------------

Tweet 2:
" Could journos please stop putting the word ""gate"" after everything they write... gate."


Stemmed Tokens: ['``', 'could', 'journo', 'pleas', 'stop', 'put', 'the', 'word', '``', "''", 'gate', "''", "''", 'after', 'everyth', 'they', 'write', '...', 'gate', '.', "''"]

Lemmatized Tokens: ['"', 'could', 'journo', 'please', 'stop', 'put', 'the', 'word', '"', '"', 'gate', '"', '"', 'after', 'everything', 'they', 'write', '...', 'gate', '.', '"', '\n']
------------------------------

Tweet 3:
20% More Ridiculous Sale @20x200 ends tonight! - get 20% off by entering 'RIDONK' at checkout. More info: http://bit.ly/ridonktues


Stemmed Tokens: ['20', '%', 'more', 'ridicul', 'sale', '@', '20x200', 'end', 'tonight', '!', '-', 'get', '20', '%', 'off', 'by', 'enter', 'ridonk', "'", 'at', 'checkout', '.', 'mor

## Explain sentencepiece (for masters only)

For this task you will have to use sentencepiece text tokenizer. Your task will be to read how it works and write a minimum 10 sentences explanation of the tokenizer works.

In [11]:
import sentencepiece as spm

# Train the SentencePiece model on the input text
spm.SentencePieceTrainer.Train('--input=Assignment1_data/all_tweets --model_prefix=tweet --vocab_size=500')

sentencepiece_trainer.cc(177) LOG(INFO) Running command: --input=Assignment1_data/all_tweets --model_prefix=tweet --vocab_size=500
sentencepiece_trainer.cc(77) LOG(INFO) Starts training with : 
trainer_spec {
  input: Assignment1_data/all_tweets
  input_format: 
  model_prefix: tweet
  model_type: UNIGRAM
  vocab_size: 500
  self_test_sample_size: 0
  character_coverage: 0.9995
  input_sentence_size: 0
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  split_digits: 0
  treat_whitespace_as_suffix: 0
  allow_whitespace_only_pieces: 0
  required_chars: 
  byte_fallback: 0
  vocabulary_output_piece_score: 1
  train_extremely_large_corpus: 0
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 0
  bos_id: 1
  eos_id: 2
  pad_id: -1
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: 

In [12]:
import sentencepiece as spm

def sentencepiece_wrapper(text: str) -> List[str]:
    # Load the pre-trained SentencePiece model for Twitter data
    sp = spm.SentencePieceProcessor()
    sp.Load("tweet.model")

    # Tokenize the text into pieces
    tokens = sp.EncodeAsPieces(text)

    # Return the list of tokens
    return tokens

In [13]:
for i, tweet in enumerate(data[:10], 1):
    print(f"Tweet {i}:")
    print(tweet)
    print()
    print(sentencepiece_wrapper(tweet))
    print("-"*50)
    print()

Tweet 1:
@anitapuspasari waduh..


['▁@', 'an', 'ita', 'p', 'u', 's', 'p', 'a', 's', 'ar', 'i', '▁w', 'ad', 'u', 'h', '..']
--------------------------------------------------

Tweet 2:
" Could journos please stop putting the word ""gate"" after everything they write... gate."


['▁"', '▁C', 'ould', '▁', 'j', 'our', 'no', 's', '▁pleas', 'e', '▁sto', 'p', '▁put', 't', 'ing', '▁the', '▁w', 'o', 'rd', '▁""', 'ga', 'te', '""', '▁a', 'f', 'ter', '▁', 'e', 'very', 'th', 'ing', '▁they', '▁w', 'r', 'ite', '...', '▁g', 'at', 'e', '.', '"']
--------------------------------------------------

Tweet 3:
20% More Ridiculous Sale @20x200 ends tonight! - get 20% off by entering 'RIDONK' at checkout. More info: http://bit.ly/ridonktues


['▁', '20', '%', '▁Mor', 'e', '▁', 'R', 'id', 'i', 'c', 'u', 'l', 'ous', '▁S', 'al', 'e', '▁@', '20', 'x', '20', '0', '▁', 'end', 's', '▁tonight', '!', '▁-', '▁get', '▁', '20', '%', '▁of', 'f', '▁b', 'y', '▁', 'ent', 'er', 'ing', '▁', "'", 'R', 'I', 'D', 'ON', 'K', "'",

SentencePiece is a text tokenizer library that uses an unsupervised learning method to segment a sentence into subwords. It is designed for the processing of text in NLP, specifically for languages with no clear boundary between words. The library's purpose is to provide a flexible, fast and accurate tokenization solution for text data, regardless of the language.

1. SentencePiece is an unsupervised learning algorithm, meaning it does not rely on annotated data to train.
2.It segments a sentence into subwords or pieces, based on their frequency and co-occurrence patterns in the data.
3.The model created by SentencePiece can handle different languages and scripts, making it highly versatile.
4.The tokenization process uses a probability-based approach, which ensures that the most frequently occurring pieces are assigned the shortest representation.
5.SentencePiece allows for a user-defined vocabulary size, ensuring that the model only uses the most important pieces of information.
6.It supports both character-based and subword-based tokenization, making it suitable for a wide range of NLP tasks.
7.SentencePiece provides a compact representation of text data, reducing the size of the text data to be processed.
8.The model created by SentencePiece can be easily saved and loaded, making it possible to use the same tokenization process across different models or languages.
9.SentencePiece is compatible with many NLP libraries and frameworks, making it a useful addition to any NLP project.
10.The use of SentencePiece can significantly improve the accuracy of NLP tasks, such as language modeling, machine translation, and text classification.

## Resources

1. [Regular Expressions 1](https://realpython.com/regex-python/)
2. [Regular Expressions 2](https://realpython.com/regex-python-part-2/)
2. [Spacy Lemmatizer](https://spacy.io/api/lemmatizer)
2. [NLTK Stem](https://www.nltk.org/howto/stem.html)
3. [SentencePiece](https://github.com/google/sentencepiece)
4. [sentencepiece tokenizer](https://towardsdatascience.com/sentencepiece-tokenizer-demystified-d0a3aac19b15)