# Processing Tweets

Here the relevant tweets are tokenized and lemmatized. The suite used to do this is the Frog language processing suite published by Van den Bosch, A., Busser, G.J., Daelemans, W., and Canisius in 2007. The detailed documentation can be found at https://languagemachines.github.io/frog/

The LaMachine distribution was used here, using a virtual machine and a vagrant client to access it (a Linux environment is required but the general analysis was ran in a Windows environment, hence the need for a virtual machine). This means that variable such as "path" here will depend on your system environment and installation of the Frog suite. 

This process is computationally expensive as it requires rather thorough analysis of the tweets. This means that processes such as morphological analysis or named-entity-recongnition are also ran to lemmatize the tweets. This script saves three processed versions of the tweets:
<ol>
<li> A simply tokenized version of the tweet </li>
    
<li> A lemmatized version of the tweet</li>
    
<li> A full token-by-token breakdown provided by frog (list of disctionaries each containign information for one token)</li>
</ol>

In [None]:
import json
import frog
from os import listdir

#This path should correspond to where tweets are stored on your machine. 
#The monthly data files need to be named the same way the Preprocessing script saves them ("##_string.json")
frog = frog.Frog(frog.FrogOptions(parser=True, ner=True))  

path = '/vagrant/Processed_2020/'
for month in ['08', '09']: #, '08', '09', '10', '11', '12', '01', '02']:  #controls for month 
    for file in listdir(path):
        if file.split('_')[0] == month:   
            with open(path + file, 'r') as infile:
                data = json.loads(infile.read())
            
            print('Month ' + month + ':')
            print('Loaded ' + str(len(data.keys())) + ' tweets')
               
            counter = 0
                        
            for identifier in data.keys():
                if 'joined_text' in data[identifier]: 
                    tweet_raw = data[identifier]['joined_text']
                elif data[identifier]['truncated'] is True:
                    tweet_raw = data[identifier]['extended_tweet']['full_text']
                else:
                    tweet_raw = data[identifier]['text']

                tweet_proc = frog.process(tweet_raw) 
                
                #Here you can customize what information gets added to the tweets
                data[identifier]['lemmatized'] = [token['lemma'] for token in tweet_proc]
                data[identifier]['tokenized'] = [token['text'] for token in tweet_proc]
                data[identifier]['full_frog'] = tweet_proc
                
                #This counter simply keeps track of the process
                counter += 1
                if counter % 2000 == 0:
                    print('processed ' + str(counter) + ' tweets')
            
            #This determines the directory processed tweets are saved into
            with open('/vagrant/' + month + '_processed.json', 'w') as outfile:
                json.dump(data, outfile)
                
            print('Saved data for month ' + month)  



## Alternative - stemming

An alternative is to stem the tokens of a tweet rather than lemmatize them, which can be done using nltk. The results of this process were not as good as using lemmatization (by my estimation), but the process is much simpler and much faster to execute.

In [None]:
import json
from nltk.stem import SnowballStemmer
from os import listdir

#This path should correspond to where tweets are stored on your machine. 
#The monthly data files need to be named the same way the Preprocessing script saves them ("##_string.json")
path = ''

for file in listdir(path):
    with open(path + file, 'r') as infile:
        data = json.loads(infile.read())

    print('Month ' + file.split('_')[0] + ':')
    print('Loaded ' + str(len(data.keys())) + ' tweets')

    counter = 0
    stemmer = SnowballStemmer("dutch")

    for identifier in data.keys():
        tweet_raw = data[identifier]['tokenized']
        tweet_proc = [stemmer.stem(word) for word in tweet_raw]

        data[identifier]['snowball'] = tweet_proc
        
        
        counter += 1
        if counter % 10000 == 0:

            print('processed ' + str(counter) + ' tweets')

    with open(path + file.split('_')[0] + '_processed.json', 'w') as outfile:
        json.dump(data, outfile)

    print('Saved data for month ' + file.split('_')[0])


    
