## Getting Trends Data
This notebook requests trends data on topics, namely relevant terms like destination cities and destination countries. The topic ids have already been collected.

In [7]:
#%pip install pytrends
#pip install igraph
%pip install country_converter

Collecting country_converter
  Downloading country_converter-1.0.0-py3-none-any.whl (44 kB)
                                              0.0/44.5 kB ? eta -:--:--
     ---------                                10.2/44.5 kB ? eta -:--:--
     ---------                                10.2/44.5 kB ? eta -:--:--
     ----------------------------------     41.0/44.5 kB 326.8 kB/s eta 0:00:01
     -------------------------------------- 44.5/44.5 kB 311.8 kB/s eta 0:00:00
Installing collected packages: country_converter
Successfully installed country_converter-1.0.0
Note: you may need to restart the kernel to use updated packages.


These functions below is how we make reqeusts to google trends to return trends on keywords.

It is a bit of trial and error and a bit of help from [this post](https://stackoverflow.com/a/67199394/10006534).

It gathers one term at a time from a list of trends and then if an error occurs (which often happens due to the fact that Google Trends is rate-limited), it sleeps for a minute and repeats the request. If 20 requests are made in a row that result in an error, it will skip that particular request and move on to the next term.

In [1]:
import pandas as pd
from pytrends.request import TrendReq
import numpy as np
import time
from tqdm import tqdm

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/112.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
    'Accept-Language': 'en-US,en;q=0.5',
    # 'Accept-Encoding': 'gzip, deflate, br',
    'Referer': 'https://trends.google.com/',
    'Alt-Used': 'trends.google.com',
    'Connection': 'keep-alive',
    # 'Cookie': '__utma=10102256.699944976.1681467038.1683327769.1683363479.30; __utmz=10102256.1683363479.30.23.utmcsr=trends.google.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utmc=10102256; NID=511=GaXIe0Lwd1l8RAGkA2geWNynqviDUhjPBcVgHksJdTnugCvKuUPbm_bM-mT7DhT2jrBHT00aCt71oY7fZhydICB-HNWUzrDnonyPyOGmPTA75lOvpTiguXi3KiGJtRjK3BBH3e1ZcqQ_ywcsU5vHoxJFtH9HGhcLdOt7CL7AWKx8Jj9VSOI3cCwmjDl8gbj2PZ75BU_W4NqspBRMktcdhRitXCyOIqMdLMwZfSOOvFmRBTOJKg8M7UkUTwAVhXtxsKVlHfxPpiWx8HQ63Vr5SV_8qW9f4J0f8EbXWiofQLqpPKJzo0CMbyM-EcnRlR4YVqptEli6EgemOBUJAgH8951i7ANgVDSWy-vn3zXA5KPR5l0LtkriirFZPvsNAmV-_-Mtyuf6gYu8eYJL3g; CONSENT=PENDING+639; SID=WAjkbwUHGFuugy4Yy2rq46Op5ZjRIMvPaLQIAltzHSM35MU0x7YgYongisCrn5htv3RhAw.; __Secure-1PSID=WAjkbwUHGFuugy4Yy2rq46Op5ZjRIMvPaLQIAltzHSM35MU0YH1nUovIFt-jaEUUCx_SIQ.; __Secure-3PSID=WAjkbwUHGFuugy4Yy2rq46Op5ZjRIMvPaLQIAltzHSM35MU0Z8qXUg1LhjB4DtBZWFfNQg.; HSID=A8QJObb1Ve4vQOXFw; SSID=AJr3GRs7Jf_ctBT41; APISID=rNTBsHwZF0AVrKao/AoTWce3Qv8CyFykEc; SAPISID=vftmcyrgIFqWdYpV/AHlhj91rgxiQPlOq8; __Secure-1PAPISID=vftmcyrgIFqWdYpV/AHlhj91rgxiQPlOq8; __Secure-3PAPISID=vftmcyrgIFqWdYpV/AHlhj91rgxiQPlOq8; SIDCC=AP8dLtyjVDmjXvg3rEmTwoLfGyXkY0SDrIFQWqi1z9D1QOL5voioH1Uti_ANGJkiQCuzVd4Axww; __Secure-1PSIDCC=AP8dLtxxVvSKM2MgLGepw_20VZbYsJHar-zF5kvDajRKezVqui3YqxWUaT1e6meVcR9HTUP4lgo; __Secure-3PSIDCC=AP8dLtyyI8BLnakxZZ2OFmPTDfYzPW8jo13jnE34rpPuptgnFDFq-aKX5vfcZdtRDLLZswyAl3gv; 1P_JAR=2023-5-6-12; SOCS=CAISHAgCEhJnd3NfMjAyMjEwMDQtMF9SQzMaAmVuIAEaBgiAwY2aBg; AEC=AUEFqZchyarTzQblW5K5GOTGtYARrs8luJGdx84JVmSwETHSFqijMgs9FA; _ga_VWZPXDNJJB=GS1.1.1683458025.38.1.1683458061.0.0.0; _ga=GA1.3.699944976.1681467038; OTZ=6986051_48_52_123900_48_436380; ADS_VISITOR_ID=00000000-0000-0000-0000-000000000000/112727363205027642159; S=billing-ui-v3=wWfIrmncuOn4LfU6DArDU3LLPpCDgsAT:billing-ui-v3-efe=wWfIrmncuOn4LfU6DArDU3LLPpCDgsAT; __Secure-1PSIDTS=sidts-CjIBLFra0jgJEQyM4EqRZoyaN18X_Umt8M6GTvixMw1pDB_sj5P5XvQokN5dkVw1R2qAkRAA; __Secure-3PSIDTS=sidts-CjIBLFra0jgJEQyM4EqRZoyaN18X_Umt8M6GTvixMw1pDB_sj5P5XvQokN5dkVw1R2qAkRAA; _gid=GA1.3.1220682113.1683458025; _gat_gtag_UA_4401283=1',
    'Upgrade-Insecure-Requests': '1',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    # Requests doesn't support trailers
    # 'TE': 'trailers',
}

def pytrends_request(word_list, country, pytrends):
    
    pytrends.build_payload(kw_list=word_list, geo=country, timeframe='2005-01-01 2022-12-31')
    trends = pytrends.interest_over_time()
    if 'isPartial' in trends.columns:
        trends.drop('isPartial', axis=1, inplace=True)
    # print(word_list)
    return trends

def get_trends_data(country, keywords):
    pytrends = TrendReq(hl='en-US', tz=360, requests_args={'headers': headers})
    trends_df = pd.DataFrame()
    error_count = 0

    for keyword in keywords:
        while True:
            try:
                trends_df = pd.concat([trends_df, pytrends_request([keyword], country, pytrends)], axis=1)
                error_count = 0  # Reset error count if successful request
                break  # Exit the while loop if successful
            except:
                error_count += 1
                # print('Got an error. Trying again in 60 seconds.')
                time.sleep(60)

                if error_count == 20:
                    print('Reached maximum error count. Exiting loop.')
                    return trends_df  # Return the trends_df even if not complete

                continue

    return trends_df

### Semantic Link Topic Trends

In [6]:
semantic_topic_ids = pd.read_csv('topic_ids/semantic_topic_ids.csv')
countries = pd.read_csv('../../data/clean/unhcr.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [None]:
country_trends_list = []
for iso2country in tqdm(iso2_countries):
    a_country_trends = get_trends_data(iso2country, semantic_topic_ids.topic_id)
    a_country_trends['country'] = iso2country
    country_trends_list.append(a_country_trends)

 66%|██████▋   | 130/196 [15:39:10<22:20:28, 1218.61s/it]

In [None]:
semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()

semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(country_trends_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)

semantic_trends_df.to_csv('data/semantic_topic_trends.csv')

## Semantic Link Keyword trends

In [15]:
semantic_topic_ids = pd.read_csv('topic_ids/semantic_topic_ids.csv')
countries = pd.read_csv('../../data/data.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [20]:
country_trends_list = []
for iso2country in tqdm(iso2_countries):
    a_country_trends = get_trends_data(iso2country, semantic_topic_ids.keyword)
    a_country_trends['country'] = iso2country
    country_trends_list.append(a_country_trends)

100%|██████████| 13/13 [05:55<00:00, 27.34s/it]


In [22]:
semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()
semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(country_trends_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)



In [24]:
semantic_trends_df.to_csv('data/semantic_keywords_trends_EN_partial_3.csv')

## Semantic links - keywords - original lang

In [2]:
import pandas as pd

from country_abbrev import *
from country_language import *
from pytrends.request import TrendReq

import pycountry
import itertools

from googletrans import LANGCODES
import swifter

import trends_helpers
import numpy as np

In [3]:
semantic_topic_ids = pd.read_csv('topic_ids/semantic_topic_ids.csv')
countries = pd.read_csv('../../data/data.csv', engine='pyarrow').drop_duplicates('iso_o').Country_o
import country_converter as coco
iso2_countries = coco.convert(countries, to='iso2')

In [4]:
# list of all unique languages:
unique_languages = pd.Series(list(set(list(itertools.chain(*country_language_dict.values())))), name='language')

# list of language codes from googletrans
langcodes = pd.DataFrame.from_dict(LANGCODES, orient='index', columns=['code'])
langcodes.index = langcodes.index.str.capitalize()

refugee_lang = unique_languages.to_frame().merge(langcodes, left_on='language', right_index=True, how='left')

refugee_lang.dropna(inplace=True)

refugee_lang_not_en = refugee_lang[refugee_lang['code'] != 'en']

In [5]:
translated_keywords = refugee_lang_not_en['code'].swifter.apply(lambda x: trends_helpers.translate_keywords_list(lst = semantic_topic_ids.keyword, lang= x))

Pandas Apply:   0%|          | 0/83 [00:00<?, ?it/s]

In [6]:
columns = list(refugee_lang_not_en['code'])

df = pd.concat([pd.DataFrame(sublist, columns=[col]) for sublist, col in zip(translated_keywords, columns)], axis=1)

df['en']=semantic_topic_ids.keyword

df.head()

Unnamed: 0,bn,fi,ur,ru,fr,uz,ar,ms,fa,ro,...,tr,hy,sd,no,he,ceb,sw,lt,th,en
0,পাসপোর্ট,passi,پاسپورٹ,заграничный пасспорт,passeport,pasport,جواز سفر,pasport,گذرنامه,pașaport,...,pasaport,անձնագիր,پاسپورٽ,pass,דַרכּוֹן,passport,pasipoti,pasas,หนังสือเดินทาง,passport
1,অভিবাসন,maahanmuutto,امیگریشن,иммиграция,immigration,immigratsiya,الهجرة,imigresen,مهاجرت,imigrare,...,göçmenlik,ներգաղթ,اميگريشن,innvandring,עלייה,imigrasyon,uhamiaji,imigracija,การตรวจคนเข้าเมือง,Immigration
2,ভ্রমণ ভিসা,matkaviisumi,سفری ویزا,туристическая виза,visa de voyage,sayohat vizasi,تأشيرة السفر,visa perjalanan,ویزای مسافرتی,viza de calatorie,...,seyahat vizesi,ճամփորդական վիզա,سفر ويزا,reisevisum,ויזת נסיעות,travel visa,visa ya kusafiri,kelionės viza,วีซ่าท่องเที่ยว,Travel Visa
3,উদ্বাস্তু,pakolainen,پناہ گزین,беженец,réfugié,qochoq,لاجئ,pelarian,پناهنده,refugiat,...,mülteci,փախստական,پناهگير,flyktning,פָּלִיט,kagiw,mkimbizi,pabėgėlis,ผู้ลี้ภัย,Refugee
4,দ্বন্দ্ব,konflikti,تنازعہ,конфликт,conflit,mojaro,صراع,konflik,تعارض,conflict,...,anlaşmazlık,կոնֆլիկտ,تڪرار,konflikt,סְתִירָה,panagbangi,migogoro,konfliktas,ขัดแย้ง,Conflict


In [20]:
df

Unnamed: 0,bn,fi,ur,ru,fr,uz,ar,ms,fa,ro,...,tr,hy,sd,no,he,ceb,sw,lt,th,en
0,পাসপোর্ট,passi,پاسپورٹ,заграничный пасспорт,passeport,pasport,جواز سفر,pasport,گذرنامه,pașaport,...,pasaport,անձնագիր,پاسپورٽ,pass,דַרכּוֹן,passport,pasipoti,pasas,หนังสือเดินทาง,passport
1,অভিবাসন,maahanmuutto,امیگریشن,иммиграция,immigration,immigratsiya,الهجرة,imigresen,مهاجرت,imigrare,...,göçmenlik,ներգաղթ,اميگريشن,innvandring,עלייה,imigrasyon,uhamiaji,imigracija,การตรวจคนเข้าเมือง,Immigration
2,ভ্রমণ ভিসা,matkaviisumi,سفری ویزا,туристическая виза,visa de voyage,sayohat vizasi,تأشيرة السفر,visa perjalanan,ویزای مسافرتی,viza de calatorie,...,seyahat vizesi,ճամփորդական վիզա,سفر ويزا,reisevisum,ויזת נסיעות,travel visa,visa ya kusafiri,kelionės viza,วีซ่าท่องเที่ยว,Travel Visa
3,উদ্বাস্তু,pakolainen,پناہ گزین,беженец,réfugié,qochoq,لاجئ,pelarian,پناهنده,refugiat,...,mülteci,փախստական,پناهگير,flyktning,פָּלִיט,kagiw,mkimbizi,pabėgėlis,ผู้ลี้ภัย,Refugee
4,দ্বন্দ্ব,konflikti,تنازعہ,конфликт,conflit,mojaro,صراع,konflik,تعارض,conflict,...,anlaşmazlık,կոնֆլիկտ,تڪرار,konflikt,סְתִירָה,panagbangi,migogoro,konfliktas,ขัดแย้ง,Conflict
5,যুদ্ধ,sota,جنگ,война,guerre,urush,حرب,perang,جنگ,război,...,savaş,պատերազմ,جنگ,krig,מִלחָמָה,gubat,vita,karas,สงคราม,War
6,হিংসা,väkivalta,تشدد,насилие,violence,zo'ravonlik,عنف,keganasan,خشونت,violenţă,...,şiddet,բռնություն,تشدد,vold,אַלִימוּת,kapintasan,vurugu,smurtas,ความรุนแรง,Violence
7,সংকট,kriisi,بحران,кризис,crise,inqiroz,مصيبة,krisis,بحران,criză,...,kriz,ճգնաժամ,بحران,krise,מַשׁבֵּר,krisis,mgogoro,krizė,วิกฤติ,Crisis
8,মিলিশিয়া,miliisi,ملیشیا,милиция,milice,militsiya,ميليشيا,tentera,شبه نظامی,miliţie,...,milis,միլիցիա,مليشيا,milis,מִילִיצִיָה,milisya,wanamgambo,milicija,อาสาสมัคร,Militia
9,গণহত্যা,kansanmurha,نسل کشی,геноцид,génocide,genotsid,إبادة جماعية,pembunuhan beramai-ramai,قتل عام,genocid,...,soykırım,ցեղասպանություն,نسل ڪشي,folkemord,רֶצַח עַם,genocide,mauaji ya kimbari,genocido,การฆ่าล้างเผ่าพันธุ์,Genocide


In [7]:
# I'll only keep two original languages per country

country_language_dict_2 = {}

for key, values in country_language_dict.items():
    country_language_dict_2[key] = values[:2] 

max_length = max(map(len, country_language_dict_2.values()))

data_padded = {key: arr + [np.nan] * (max_length - len(arr)) for key, arr in country_language_dict_2.items()}

langs = pd.DataFrame(data_padded)
langs = langs.T
langs = langs.reset_index()

langs = langs.rename(columns={'index': 'Country', 0:'lang1', 1:'lang2', 2:'lang3'})

# Apply the function to the 'Country' column
langs['ISO2'] = langs['Country'].apply(trends_helpers.get_iso2_country_code)

langs = langs.drop(columns=["Country"])
langs_long = pd.melt(langs, id_vars=['ISO2'], var_name='numlang', value_name='lang')
langs_long = langs_long.dropna()

langs_long = pd.merge(langs_long, refugee_lang_not_en, left_on="lang", right_on="language")

In [8]:
# langs_long = langs_long[102:].reset_index()

In [10]:
country_trends_list = [] 

for i, country in tqdm(enumerate(langs_long["ISO2"])):
    languagecode= langs_long["code"][i]
    a_country_trends = get_trends_data(country, df[langs_long["code"][i]])
    a_country_trends['country'] = country
    country_trends_list.append(a_country_trends)

84it [1:41:35, 72.56s/it] 


In [12]:
import os
os.getcwd()

"h:\\Mi unidad\\BSE - DSDM\\Master's project\\refugees\\notebooks\\trends"

In [13]:
# Update column names to english, but having a choice to 
# keep track of what the language of search was (wide==True)

def update_column_names(data_list, original_lang_code, wide:bool):
    
    mapping_dict = dict(zip(df[original_lang_code], df['en']))

    new_data_list = []
    for data in data_list:
        new_data = data.copy()  # Make a copy of the DataFrame
        new_columns = []

        if wide==True:

            for column in new_data.columns:
                if column in mapping_dict:
                    new_column = f"{mapping_dict[column]}_{original_lang_code}"
                else:
                    new_column = column
                new_columns.append(new_column)
            new_data.columns = new_columns
        
        else:

            for column in new_data.columns:
                if column in mapping_dict:
                    new_column = f"{mapping_dict[column]}"
                else:
                    new_column = column
                new_columns.append(new_column)
            new_data.columns = new_columns


        new_data_list.append(new_data)  # Add the modified DataFrame to the new list


    return new_data_list


In [14]:
# Short version

updated_data_list = country_trends_list.copy()
for lang in langs_long["code"][:len(country_trends_list)]:
    updated_data_list = update_column_names(updated_data_list, lang, wide=False)

# Detailed version

updated_data_list_detailed = country_trends_list.copy()
for lang in langs_long["code"][:len(country_trends_list)]:
    updated_data_list_detailed = update_column_names(updated_data_list_detailed, lang, wide=True)


In [15]:
# Save short
semantic_dict = semantic_topic_ids[['keyword','topic_id']].set_index('topic_id')['keyword'].to_dict()
semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(updated_data_list):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)



In [17]:
semantic_trends_df.to_csv("data/semantic_keywords_OL_2.csv", index=False)

In [18]:
# Save detailed

semantic_trends_df = pd.DataFrame()
for idx, a_country_semantic_trends in enumerate(updated_data_list_detailed):
    a_country = a_country_semantic_trends.copy()
    if a_country.index.name == 'date':
        a_country.reset_index(inplace=True)
    if 'index' in a_country.columns.values:
        a_country.drop('index',axis=1, inplace=True)
    a_country = a_country.loc[:, ~a_country.columns.duplicated()]
    # a_country.set_index(['date','country'], inplace=True)
    a_country.rename(columns=semantic_dict, inplace=True)
    semantic_trends_df = pd.concat([semantic_trends_df, a_country], axis=0, ignore_index=True)

In [19]:
semantic_trends_df.to_csv("data/semantic_keywords_OL_2_detailed.csv", index=False)

In [32]:
# # Concatenating partial results

# df_1 = pd.read_csv("data/semantic_keywords_trends_EN_partial.csv")
# df_2 = pd.read_csv("data/semantic_keywords_trends_EN_partial.csv")
# df_3 = pd.read_csv("data/semantic_keywords_trends_EN_partial.csv")

# df = df_1.append(df_2, ignore_index=True).append(df_3, ignore_index=True)
# df = df.drop(columns=["Unnamed: 0"])
# df.to_csv("data/semantic_keywords_trends_EN.csv")


## Neighboring Countries

In [84]:
import pandas as pd
import igraph as ig
import country_converter as coco

# convert unhcr data to network format. To produce the unhcr.csv file, you will need to:
# # drag and drop the data.csv file from geraldine into the data/raw/ folder
# # open the clean_data.ipynb notebook in data/
# # run the section that cleans the unhcr data, which outputs unhcr.csv into data/clean/
unhcr = pd.read_csv('../../data/clean/unhcr.csv', engine='pyarrow').groupby(['iso_o','iso_d']).agg({'newarrival':'sum','contig':'first','Country_o':'first','Country_d':'first', 'island_o':'first'}).reset_index()

df_network = unhcr[unhcr.contig == 1]

graph = ig.Graph.TupleList(df_network[['Country_o','Country_d']].itertuples(index=False), directed=False)

# add island countries 
islands = unhcr.drop_duplicates('Country_o').sort_values('Country_o').Country_o[~unhcr.groupby('Country_o')['contig'].any().values].values

for i in islands:
    v = graph.add_vertex()
    # Set the name or other properties of the added vertex if needed
    v['name'] = i

graph.vs['name'] = coco.convert(graph.vs['name'], to='iso2')

In [113]:
# get country topic ids
country_topic_ids = pd.read_csv('topic_ids/country_topic_ids.csv')
country_topic_ids['iso2'] = coco.convert(country_topic_ids.search, to='iso2')
country_topic_dict = country_topic_ids[['topic_title', 'topic_mid']].set_index('topic_mid')['topic_title'].to_dict()

Get neighboring countries of order 1:

In [101]:
# list of countries
iso2_countries = coco.convert(unhcr.Country_o.unique(), to='iso2')

country_trends_list = []
last_index = iso2_countries.index('KG') + 1
for iso2country in tqdm(iso2_countries[last_index:]):
    # get neighbors of country
    neighboring_countries = graph.vs[graph.neighborhood(iso2country, order=1)]['name'][1:]

    order1_countries = country_topic_ids[country_topic_ids.iso2.isin(neighboring_countries)]

    a_country_trends = get_trends_data(iso2country, order1_countries.topic_mid)
    a_country_trends['country_o'] = iso2country
    # a_country_trends['country_d','city_d'] = order1_countries[['search_keyword','topic_title']]
    country_trends_list.append(a_country_trends)

100%|██████████| 105/105 [06:31<00:00,  3.73s/it]


In [102]:
# combine into a single dataframe
countries_trends_df = pd.DataFrame()
for _, a_country_trends in enumerate(country_trends_list):
    countries_trends_df = pd.concat([countries_trends_df, a_country_trends], axis=0)

# Before writing to a csv, make sure that the output makes sense, I think there should be a lot of nas and a lot of columns, one for each country/city topic. 
# The only ones that aren't na should be the neighboring countries for that country

# countries_trends_df = countries_trends_df.reset_index().rename({'index':'date'},axis=1).set_index(['country', 'date'])
countries_trends_df.to_csv('data/country_topic_trends_1.csv')

Then we can gather trends for countries of order 2, excluding order 1 

(I think we can skp this part for now. There needs to be some distance-based filter as well that omits far away countries, Looking at Afghanistan for example yields too many countries/cities).

This could be obtained by merging countries with the unhcr distance measurments between countries, and omitting countries above a certain threshold.

In [121]:
neighboring_countries_order2 = list(set(graph.vs[graph.neighborhood('AF', order=2)]['name']) - set(graph.vs[graph.neighborhood('AF', order=1)]['name']))

# too many countries for order 2.
country_topic_ids[country_topic_ids.iso2.isin(neighboring_countries_order2)]

Unnamed: 0,search,topic_title,topic_type,topic_mid,iso2
6,Armenia,Armenia,Country in Asia,/m/0jgx,AM
10,Azerbaijan,Azerbaijan,Country,/m/0jhd,AZ
19,Bhutan,Bhutan,Country in South Asia,/m/07bxhl,BT
73,Hong Kong SAR,Hong Kong,Special administrative regions of China,/m/03h64,HK
76,India,India,Country in South Asia,/m/03rk0,IN
79,Iraq,Iraq,Country in the Middle East,/m/0d05q4,IQ
86,Kazakhstan,Kazakhstan,Country in Central Asia,/m/047lj,KZ
92,Kyrgyz Republic,Kyrgyzstan,Country in Central Asia,/m/0jt3tjf,KG
93,Lao P.D.R.,Laos,Country in Asia,/m/04hhv,LA
101,Macao SAR,Macao,Special administrative regions of China,/m/04thp,MO


We can also gather trends for countries that are relevant but are not directly connected (i.e., Dominican Republic for Venezuela)

- For South American and Latin American countries, let’s say we add Spain, Chile, Argentina, USA, and Dominican Republic.
- For African countries + middle east, include the top 6 most receptive countries in Europe (Germany, France, Great Britain, Sweden, Austria, Hungary, Italy or something). In Carramia et al they also added likely countries that peopel in Africa would have to pass through to get to Europe which could be interesting to consider.
- China and India we can add the US.

There is room to refine this. Not sure how broad/specific this should be.

## Neighboring Cities

In [122]:
# get city topic ids
city_topic_ids = pd.read_csv('topic_ids/city_topic_id.csv')
city_topic_ids['iso2'] = coco.convert(city_topic_ids.search_country, to='iso2')

First we can gather all the cities of neighboring countries of order 1

In [123]:
# list of countries
iso2_countries = coco.convert(unhcr.Country_o.unique(), to='iso2')

city_trends_list = []
for iso2country in tqdm(iso2_countries):
    # get neighbors of country
    neighboring_countries = graph.vs[graph.neighborhood(iso2country, order=1)]['name'][1:]

    order1_cities = city_topic_ids[city_topic_ids.iso2.isin(neighboring_countries)]

    a_country_trends = get_trends_data(iso2country, order1_cities.topic_mid)
    a_country_trends['country_o'] = iso2country
    # a_country_trends['country_d','city_d'] = order1_cities[['search_keyword','topic_title']]
    city_trends_list.append(a_country_trends)

100%|██████████| 196/196 [2:30:52<00:00, 46.19s/it]   


In [148]:
city_topic_ids['citycountry'] = city_topic_ids.topic_title + ', ' + city_topic_ids.search_country
city_dict = city_topic_ids.set_index('topic_mid')['citycountry'].to_dict()

In [153]:
city_trends_df = pd.DataFrame()
for idx, city_trends in enumerate(city_trends_list):
    city_trends = city_trends.loc[:, ~city_trends.columns.duplicated()].copy()
    city_trends.rename(columns=city_dict, inplace=True)
    city_trends_df = pd.concat([city_trends_df, city_trends], axis=0)
#semantic_trends_df = semantic_trends_df.reset_index().rename({'index':'date'},axis=1).set_index(['country', 'date'])

# Before writing to a csv, make sure that the output makes sense, I think there should be a lot of nas and a lot of columns, one for each country/city topic. 
# The only ones that aren't na should be the neighboring countries for that country

city_trends_df.to_csv('data/city_topic_trends_1.csv')

Then we can gather trends for countries of order 2, excluding order 1 

(I think we can skp this part for now. There needs to be some distance-based filter as well that omits far away countries, Looking at Afghanistan for example yields too many countries/cities).

this could be obtained by using the location coordinates from geonames-all-cities-with-a-population-1000.csv

In [62]:
neighboring_countries = list(set(graph.vs[graph.neighborhood('AF', order=2)]['name']) - set(graph.vs[graph.neighborhood('AF', order=1)]['name']))
order2_cities = city_topic_ids[city_topic_ids.iso2.isin(neighboring_countries)]

Unnamed: 0,search_country,search_keyword,topic_title,topic_type,topic_mid,iso2
3,India,Mumbai,Mumbai,City in India,/m/04vmp,IN
7,India,Delhi,Delhi,City in India,/m/09f07,IN
8,Russian Federation,Moscow,Moscow,Capital of Russia,/m/04swd,RU
12,Viet Nam,Ho Chi Minh City,Ho Chi Minh City,City in Vietnam,/m/0hn4h,VN
16,Viet Nam,Hanoi,Hanoi,Capital of Vietnam,/m/0fnff,VN
20,Iraq,Baghdad,Baghdad,Capital of Iraq,/m/01fqm,IQ
27,Russian Federation,Saint Petersburg,Saint Petersburg,City in Russia,/m/06pr6,RU
41,Turkey,Ankara,Ankara,Capital of Turkey,/m/0jyw,TR
67,Kazakhstan,Almaty,Almaty,City in Kazakhstan,/m/0151s1,KZ
86,Nepal,Kathmandu,Kathmandu,Capital of Nepal,/m/04cx5,NP


In [None]:
# list of countries
iso2_countries = coco.convert(unhcr.Country_o.unique(), to='iso2')

city_trends_list = []
for iso2country in tqdm(iso2_countries):
    # get neighbors of country of order 2
    neighboring_countries = list(set(graph.vs[graph.neighborhood(iso2country, order=2)]['name']) - set(graph.vs[graph.neighborhood(iso2country, order=1)]['name']))
    order2_cities = city_topic_ids[city_topic_ids.iso2.isin(neighboring_countries)]

    a_country_trends = get_trends_data(iso2country, order1_cities.topic_mid)
    a_country_trends['country_o'] = iso2country
    a_country_trends['country_d','city_d'] = order1_cities[['search_keyword','topic_title']]
    city_trends_list.append(a_country_trends)