# Notebook to translate tweets into English

This notebook translates Russian and Ukrainian tweets into english, adding a new column to the dataframe for the translated text

### Setup and load data

In [None]:
import pandas as pd
from tqdm import tqdm

# Just pandas display options
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)

In [None]:
# These variables are only used for loading the desired file here
keywords_ru = 'макдональдс россия'
keywords_uk = 'макдональдс росія'
num_tweets = 5000 # per day

tweets_ru_df = pd.read_csv('data/tweets_raw_ru_df_' + keywords_ru + str(num_tweets) + 'dailytweets_' + start_date + '_to_' + end_date + '.csv')
tweets_uk_df = pd.read_csv('data/tweets_raw_uk_df_' + keywords_uk + str(num_tweets) + 'dailytweets_' + start_date + '_to_' + end_date + '.csv')

### Translating tweets into English

There are sentiment analysis tools for Russian and Ukrainian (e.g. https://github.com/bureaucratic-labs/dostoevsky, https://github.com/skupriienko/Ukrainian-Sentiment-Analysis). These may well be better than translating the tweets into English and using an English sentiment analysis tool since they would pick up on all the subtleties and nuances of the languages and also the translator often just mistranslates or translates poorly. Nevertheless for a first pass it will be simplest to translate all tweets into English and analyse them the same way. 

Using native sentiment analysis may also complicate comparisons because they are designed differently e.g. the Ukrainian sentiment scores are either -1 or +1 wheras the English scores can be -4, -3, -2, -1, 0, 1, 2, ,3 ,4.

Who knows; if we do both ways we may discover sentiment based on the *language* itself rather than the pure word content (see https://github.com/text-machine-lab/rusentiment/blob/master/Guidelines/guidelines_%5BRU%5D.md)

In [5]:
from googletrans import Translator
translator = Translator()
from IPython.display import clear_output
import time

#translator.raise_Exception = True

# Translate text into English. This adds a new column to the dataframe
def translate_text(tweets_lang_df, lang):
    for idx, tweet in tqdm(tweets_ru_df.iterrows()):
        #print(idx)
        text = tweet['Text']
        text_translated = translator.translate(text, src = lang).text
        tweets_lang_df.loc[idx, 'Text Translated'] = text_translated
        
        # Important to bypass google restrictions! 
        # This requires a bit of experimentation to get the best runtime without being stopped by google
        #if idx >0 and idx%100 == 0: 
        #    time.sleep(120)
        time.sleep(1.5)
            
    return tweets_lang_df
        
tweets_ru_df = translate_text(tweets_ru_df, 'ru')
tweets_uk_df = translate_text(tweets_ru_df, 'uk')
    

303it [12:25,  2.46s/it]
303it [13:34,  2.69s/it]


### Save dataframes

In [None]:
tweets_ru_df.to_csv('data/tweets_trans_ru_df_' + keywords_ru + '_ru_df' + str(num_tweets) + 'dailytweets_' + start_date + '_to_' + end_date + '.csv')
tweets_uk_df.to_csv('data/tweets_trans_uk_df_' + keywords_uk + '_uk_df' + str(num_tweets) + 'dailytweets_' + start_date + '_to_' + end_date + '.csv')