This notebook is used for collecting google trends data.
- It takes the countries from the UNHCR refugees dataset
- Gets the languages associated  to each country
- Gets the two letter language codes associated with each language
- Translates words from english to those languages

#### Read in relevant dictionaries, dataframes, modules

In [71]:
import pytrends
from country_abbrev import *
from country_language import *
from pytrends.request import TrendReq
import pandas as pd
import itertools
import googletrans

# get the list of all unique countries:
countries = pd.Series(pd.read_csv('../../data/data.csv', engine="pyarrow").Country_o.unique()).to_frame(name='country')

# list of all unique languages:
unique_languages = pd.Series(list(set(list(itertools.chain(*country_language_dict.values())))), name='language').str.lower()

# list of language codes from googletrans
langcodes = pd.DataFrame.from_dict(googletrans.LANGCODES, orient='index')

Merge list of languages 

In [72]:
refugee_lang = unique_languages.to_frame().merge(langcodes, left_on='language', right_index=True, how='left')

Out of the approximately 190 languages, there are about 110 left that don't have codes associated with the specific names we provide. This could be due to not data cleaning, because appear to be less commonly used languages we will skip this for now.

In [73]:
refugee_lang[refugee_lang[0].isna()].sample(10)

Unnamed: 0,language,0
75,slovene,
122,sami,
162,kirundi,
45,bassa,
167,tok pisin,
182,forro,
26,tongan,
146,taiwanese hokkien,
121,berber,
61,ndebele,


In [74]:
refugee_lang.dropna(inplace=True)


Set up translator(s)

In [122]:
from deep_translator import GoogleTranslator
translator = GoogleTranslator(source='en', target='en') # output -> Weiter so, du bist großartig

def translate_keywords_slow(translator, series, lang):
    translator.target = lang
    series = series.str.split('+').explode()
    series_translated = translator.translate_batch(series.values.tolist())
    series_translated = pd.Series(index=series.index.tolist(), data=series_translated, name = series.name).to_frame().groupby(series.index)[series.name].agg(list).apply(lambda x: '+'.join(x))
    return series_translated

In [154]:
import requests

def translate_keywords(series, lang):

    series = series.str.split('+').explode()
    url = "https://translate.googleapis.com/translate_a/single"
    params = {
        "client": "gtx",
        "sl": "auto",
        "tl": lang,
        "dt": "t",
        "q": "\n".join(series.tolist())
    }
    response = requests.get(url, params=params)
    series_translated = [r[0].strip('\n').lower() for r in response.json()[0]]
    series_translated = pd.Series(index=series.index.tolist(), data=series_translated, name = series.name).to_frame().groupby(series.index)[series.name].agg(list).apply(lambda x: '+'.join(x))
    return series_translated

Read in list of words from the paper:

In [152]:
boss_words = pd.read_csv('boss_words.csv')['list']

In [155]:
translate_keywords_fast(boss_words, 'es')

0                               asesores+asesores
1                                          agente
2                                 extraterrestres
3      solicitante+solicitantes+solicitud+aplicar
4                                            cita
                          ...                    
187                                     bienestar
188                                   aflicciones
189                               visa de trabajo
190                                        obrero
191                                 empeoramiento
Name: list, Length: 192, dtype: object