# Cleaning and filtering tweets

The extract_documents method is quite large, but it's convenient to contain all our transformations in a pipeline
to add an additional transformation, create a lambda and add it into the right point in the pipe

Note that this compositional approach allows us to do convenient things, like filtering out empty
tweets more than once, so other transformations don't blow up or need extra logic
    
### TODO

1. other things we may want to exclude
    - emoji tokens
    - tweets that are too short
2. ...


### Notes

Looks like we might need to make sure we're using the right api endpoint, mode, version etc, so that we get the full version of the tweets with full_text/text longer than 140 chars.

```
t = tweets[4]
print(t.text)
# => RT @Boeufblogginon: @margokingston1 @cunningham_cch @Bowenchris @quaedvliegs Dutton is one nasty piece of work. Could you imagine what he w…

print(t.full_text)
# => None
```

In [163]:
def extract_documents(tweets):
    extract_text = lambda tweets: [tweet.text for tweet in tweets]
    convert_whitespace_chars = lambda tweets: [tweet.replace('\n', ' ').replace('\r', ' ').replace('\t', ' ') for tweet in tweets]
    squash_whitespace = lambda tweets: [tweet.replace('  ', ' ') for tweet in tweets]
    tokenize = lambda tweets: [tweet.strip().split() for tweet in tweets]
    strip_links = lambda tweets: [[token for token in tokens if "http" not in token] for tokens in tweets]
    strip_mentions = lambda tweets: [[token for token in tokens if token[0] is not '@'] for tokens in tweets]
    strip_hashtags = lambda tweets: [[token for token in tokens if token[0] is not '#'] for tokens in tweets]
    filter_empty = lambda tweets: [tweet for tweet in tweets if len(tweet) > 0]
    filter_retweets = lambda tweets: [tweet for tweet in tweets if tweet[0] != 'RT']
    rejoin = lambda tweets: [' '.join(tokens) for tokens in tweets]

    documents = tweets
    
    for transformation in [
        extract_text,
        convert_whitespace_chars,
        squash_whitespace,
        tokenize,
        strip_links,
        strip_mentions,
        strip_hashtags,
        filter_empty,
        filter_retweets,
        filter_empty,
        rejoin,
    ]:
        documents = transformation(documents)

    return documents

In [164]:
# demonstration: applying our pipeline to locally stored data
import pickle
import twitter

term = "libspill"
cache_filename = f"cached_tweets_{term}.pkl"
tweets = None
with open(cache_filename, 'rb') as f:
    tweets = pickle.load(f)
documents = extract_documents(tweets)


In [151]:
# Note: we could use this approach for streaming, but probably we will just read all of them from the file
def append_tweets(documents, tweets):
    new_documents = extract_documents(tweets)
    
    documents.extend(new_documents)

    documents = list(set(documents))

# eg:
#documents = []
#append_tweets(documents, tweets)