# Preprocessing

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from nltk.corpus import twitter_samples

## Exploring the Data

The sample dataset from NLTK is separated into positive and negative tweets. It contains 5000 positive tweets and 5000 negative tweets exactly.

In [None]:
postive_tweets = twitter_samples.strings("positive_tweets.json")
negative_tweets = twitter_samples.strings("negative_tweets.json")

print("Number of positive tweets: ", len(postive_tweets))
print("Number of negative tweets: ", len(negative_tweets))
print("Total number of tweets: ", len(postive_tweets) + len(negative_tweets))

At first, we want to get an understanding of what the data looks like.

When you scroll through the samples, you will notice a couple of things that differentiate tweets from normal texts, for example:
- usernames, so-called `handles`, e.g `@Lambd2ja`
- hashtags, e.g. `#FollowFriday`
- emojis and smileys, e.g. 💞 or `:)`
- URLs, e.g. `https://t.co/smyYriipxI"`
- slang words
- etc.

Make yourself familiar with both the positive and negative tweets!

In [None]:
postive_tweets

In [None]:
negative_tweets

## Tweet Preprocessing

We will be using the `htwgnlp` Python package to preprocess the data.
It contains a `preprocessing` module with a `TweetProcessor` class.
The boilerplate code for the class is given, as well as some unit tests that describe the desired behavior.

Your job will be to implement the `TweetProcessor` class, which is located in `src/htwgnlp/preprocessing.py`
The task is completed successfully if all tests for the first assignment pass.
You can run the test using the following command:

```bash
make assignment_1
```

> As you can check in the `Makefile`, this is will execute `pytest tests/htwgnlp/test_preprocessing.py` under the hood.

Let's assume we have the following requirements for the preprocessing pipeline of our tweets:

- remove URLs as they are usually shortened and don't add much information to the tweet
- remove hashtag symbols `#` but preserve the word of the hashtag since it gives valuable information about the content of the tweet
- remove english stopwords
- remove standard punctuation, but keep emojis like `:)`
- Twitter handles like `@stuartthull` should be removed completely
- after preprocessing, it is expected to have the tweet in a tokenized and stemmed for, i.e. a list of words.
- for tokenization, you should use [NLTK's `TweetTokenizer`](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer)
- for stemming, you should use [`NLTK's PorterStemmer`](https://www.nltk.org/api/nltk.stem.porter.html)
- Also, tweets should be lowercased and repeated character sequences should not be more than 3, e.g. `looooove` should be transformed to `looove`

For more implementation details, please refer to the [docstrings](https://realpython.com/documenting-python-code/#documenting-your-python-code-base-using-docstrings) of the `htwgnlp.preprocessing.TweetProcessor` class.


In [None]:
from htwgnlp.preprocessing import TweetProcessor

The following code shows the intended usage of the `TweetProcessor` class.

In [None]:
# instatiate a TweetProcessor object
processor = TweetProcessor()

# we use a selected tweet as an example
i = 2277
postive_tweets[i]

Each processing step described above is encapsulated in a separate method of the `TweetProcessor` class, and can be called separately. 
For example, the `remove_urls(tweet: str)` method.

If your implementation works correctly, the URL `https://t.co/3tfYom0N1i` should be removed, when you execute the following line:

In [None]:
tweet = processor.remove_urls(postive_tweets[i])
tweet

The `remove_hashtag(tweet: str)` method should transform `#sunflowers #favourites #happy #Friday` to `sunflowers favourites happy Friday`

> Note that lowercasing comes later in the process.

In [None]:
tweet = processor.remove_hashtags(tweet)
tweet

After tokenization, the tweet should be lowercased, and repeated characters as well as twitter handles should be removed.

> For this step, make sure to read the docs of [NLTK's `TweetTokenizer`](https://www.nltk.org/api/nltk.tokenize.casual.html#nltk.tokenize.casual.TweetTokenizer)

The expected output is a list of tokens. Specifically for our example, at this point, it should be: `['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']`

In [None]:
tweet = processor.tokenize(tweet)
print(tweet)

After removing stopwords: `['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']`

In [None]:
tweet = processor.remove_stopwords(tweet)
print(tweet)

After removing punctuation, it makes no difference for our example: `['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']`

> Note that the requirement is to only remove common punctuation, and want to keep emojis like `:)`. However, one could argue if we should want to remove `...` but for this pipeline, let's keep it simple.

In [None]:
tweet = processor.remove_punctuation(tweet)
print(tweet)

Finally, the last step is stemming. 
After applying the Porter Stemmer, the tweet should look like this: `['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']`

In [None]:
tweet = processor.stem(tweet)
print(tweet)

And the `process_tweet(tweet: str)` method is a shortcut for all of the above.

So after a successful pipeline the input tweet should look like this:

```txt
['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']
```

In [None]:
print(f"{'Tweet:':<20}{postive_tweets[i]}")
print(f"{'Processed tweet:':<20}{processor.process_tweet(postive_tweets[i])}")

When your tests run successfully, this notebook should as well deliver the expected output.

Congratulations! 🥳🚀 You just completed your first assignment!