# Getting the Geocodes For The Tweets

The purpose of this notebook is to create the mapping dictionaries used to recover the tweets that were pickled and to map them to a specific country. The tweets were recovered with their location (when provided by the user). If the tweet did not have a provided location, the location of the user was scraped. However not all users provide this information on their page. That is why all the tweets which did not have either information were dropped. Then the locations provided were mapped to the countries. 

To map the tweets to their locations we used in order : 
- Automatic verification of whether the country name or a capital name was contained in the string. This was possible using the data obtained from : https://mledoze.github.io/countries/ and https://datahub.io/core/country-codes. The first links the country iso codes to country names in multiple languages with not only the official but also the common names of a country. The latter links the country iso codes to country names in different languages (arabic, chinese, english, spanish, french, russian). 
- A city to country mapper from which we removed duplicate cities taken from : https://github.com/lutangar/cities.json 
- A city to country mapper extracted from : http://www.geonames.org/export/ and http://download.geonames.org/export/dump/. The issue with this dataframe is that the duplicate cities were not handled. They were progressively overwritten. The advantage of this mapper however is that it is more extensive than the previous one, contaning a larger number of cities as well as alternative spellings and different languages. Ideally, what should have been done in the case of multiple cities with same name would be to select based on the population of the cities. 
- If none of the above yielded any results we queried an API based on the works of http://www.geonames.org/export/, http://geocoder.readthedocs.io/results.html which outputs the most probable location to which the user selected location corresponds to. From that we can recover the ISO country code which can directly be used in the Chloropleth maps. Note that we could not query the API for all the locations as this takes around 1 second per tweet. Given that the number of tweets is in the order of magnitude of the millions this would not have been feasible on the entire dataset.

All of this was done using dictionaries to speed up the identification process. Currently, for pickles containing around 2000 tweets, we require under 10 seconds of processing. 

To create the dictionaries the given locations were set as keys with alternative spellings as well as string formatting to maximize the chance of identifying the country. What is time consuming however is creating the dictionaries themselves which is why the dictionaries were pickled once the process was finished. 


In [1]:
import os
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm
import time
import unicodedata

String formatting functions, these are the same ones which were used when creating the mappings in the constructing mappings notebook

In [2]:
#https://stackoverflow.com/questions/8694815/removing-accent-and-special-characters
def remove_accents(data):
    if data is None:
        return None
    else :
        clean = ''.join(x.lower().strip() for x in unicodedata.normalize('NFKD', data) if \
                unicodedata.category(x)[0] == 'L').lower()
        return clean

def string_formatting(string):
    string = string.replace("-", " ").replace(" ", ",").split(",")
    formatted_string = [remove_accents(x) for x in string]
    return string,formatted_string


Load all the different mappings to speed up the geolocalization. Requires about 5GB of RAM. 

In [3]:
cwd = os.getcwd()

folders_path = os.path.join(cwd,"../../../Project Data")
full_city_mapping_files = [x for x in os.listdir(folders_path) if "full_city_mapping" in x]

full_city_mappings = list()
for file in full_city_mapping_files:
    pkl_file = open(os.path.join(folders_path,file), 'rb')
    full_city_mappings.append(pickle.load(pkl_file))
    pkl_file.close()


pkl_file = open("country_mapping.pickle", 'rb')
country_mapping = pickle.load(pkl_file)
pkl_file.close()

pkl_file = open("city_mapping.pickle", 'rb')
city_mapping = pickle.load(pkl_file)
pkl_file.close()

Function used to go through the 3 main mappings loaded above and determine the geolocation of the tweet

In [4]:
dicts_map = {0:"Country/Capital", 1:"City"}

for i in range(len(full_city_mapping_files)):
    idx = i+2
    dicts_map.update({idx:"Full City Mapping"})

def location_in_string(string, do_prints = False):
    t = time.time()

    if do_prints : print(string)
        
    words,formatted_words = string_formatting(string)
    
    words = [x for x in words if len(x)>2]
    formatted_words = [x for x in formatted_words if len(x)>2]
    
    word_combinations = [" ".join(words[i:j]) for j in range(len(words)+1) for i in range(j)]
    word_combinations += [" ".join(words[i:j]) for j in range(len(formatted_words)+1) for i in range(j)]
    
    if do_prints : print(words, formatted_words)
    if do_prints : print(word_combinations)
    
    # Test whether the country name and variants is in the string
    # Test whether one of the capital names is in the string
    # Test whether the name of one of the mapped cities in the string
    # All this in the order of priority given 
    
    mappings = [country_mapping, city_mapping] + full_city_mappings
    
    for m, mapping in enumerate(mappings):

        maps = mapping
        
        for word in word_combinations:
            if do_prints : print("Testing: ", word)
                
            if word in maps:
                if do_prints : print("Found word: ", word,time.time()-t)
                return maps[word], dicts_map[m]
            
            if remove_accents(word) in maps:
                if do_prints : print("Found word without accents: ", remove_accents(word),time.time()-t)
                return maps[remove_accents(word)], dicts_map[m]

        
    if do_prints : print("Nothing found", time.time()-t)
    
    return None, None

Testing the result of the different functions and mappings

In [5]:
#location_in_string("Milan / Bruxelles")
print("Nantes in :", location_in_string("Nantes"))
print("Lausanne in :", location_in_string("Lausanne"))
print("Abu Dhabi in :", location_in_string("Abu Dhabi"))
print("Shanghai in :", location_in_string("Shanghai"))
print("Beijing in :", location_in_string("Beijing"))
print("Tokyo in :", location_in_string("Tokyo"))
print("Beijing in ", location_in_string("Beijing"))
print("Cairo in ", location_in_string("Cairo"))
print("Paris in ", location_in_string("Paris"))
print("Lausanne in ", location_in_string("Lausanne"))
print("Morges in ", location_in_string("Morges"))
print("Ontario in ", location_in_string("Ontario"))
print("Oxford in ",location_in_string("Oxford"))
print("Shanghai in ", location_in_string("Shanghai"))
print("New Castle in ", location_in_string("New Castle"))
print("Edinburgh in ", location_in_string("Edinburgh"))
print("Amsterdam in ", location_in_string("Amsterdam"))
print("Brussels in ", location_in_string("Brussels"))
print("Athens in ", location_in_string("Athens"))
print("Cork in ", location_in_string("Cork"))
print("Nice in ", location_in_string("Nice"))
print("Dublin in ", location_in_string("Dublin"))
print("Kuala Lumpur in ", location_in_string("Kuala Lumpur"))
print("Madrid in ", location_in_string("Madrid"))
print("Budapest in ", location_in_string("Budapest"))
print("Zealand: ", location_in_string("Zealand"))
print("Washington : ", location_in_string("Washington"))
print("cairo : ", location_in_string("cairo"))
print("Alexandria : ", location_in_string("Alexandria"))
print("autriche : ", location_in_string("autriche"))
print("oesterreich : ", location_in_string("oesterreich"))
print("osterreich : ", location_in_string("osterreich"))
print("austria : ", location_in_string("austria"))
print("vienna : ", location_in_string("vienna"))
print("Brugges : ", location_in_string("brugges"))
print("---------------------------------------------")
print("---------------------------------------------")
print("---------------------------------------------")
print("أفغانستان hello my name is bloop : ", location_in_string("أفغانستان hello my name is bloop"))
print("أفغانستان location is going great : ", location_in_string("أفغانستان is a great place to be"))
print("España going to be fun : ", location_in_string("España going to be fun"))
print("Hello New Zealand : ", location_in_string("Hello New Zealand"))

Nantes in : ('FR', 'City')
Lausanne in : ('CH', 'City')
Abu Dhabi in : ('AE', 'Country/Capital')
Shanghai in : ('CN', 'City')
Beijing in : ('CN', 'Country/Capital')
Tokyo in : ('JP', 'Country/Capital')
Beijing in  ('CN', 'Country/Capital')
Cairo in  ('EG', 'Country/Capital')
Paris in  ('FR', 'Country/Capital')
Lausanne in  ('CH', 'City')
Morges in  ('CH', 'City')
Ontario in  ('ES', 'Full City Mapping')
Oxford in  ('AU', 'Full City Mapping')
Shanghai in  ('CN', 'City')
New Castle in  ('AL', 'Full City Mapping')
Edinburgh in  ('AU', 'Full City Mapping')
Amsterdam in  ('NL', 'Country/Capital')
Brussels in  ('BE', 'Country/Capital')
Athens in  ('GR', 'Country/Capital')
Cork in  ('IE', 'City')
Nice in  ('AL', 'Full City Mapping')
Dublin in  ('IE', 'Country/Capital')
Kuala Lumpur in  ('MY', 'Country/Capital')
Madrid in  ('ES', 'Country/Capital')
Budapest in  ('HU', 'Country/Capital')
Zealand:  ('DK', 'Full City Mapping')
Washington :  ('US', 'Country/Capital')
cairo :  ('EG', 'Country/Capita

Function used to extract the location for all elements in a pickled dataframe and store them in a new folder. This uses all the previously coded functions

In [6]:
failures = list()
def extract_geocodes_for_pickle(folder, pickle_file, do_prints = False):
    try:
        if do_prints : print(pickle_file);
        df = pd.read_pickle(os.path.join(path, folder,pickle_file))
        if do_prints : print("Successfully loaded", pickle_file);
        # Putting the location in the correct format
        df["location"] = df["location"].apply(lambda x: ' '.join(x))
        # Dropping rows without locations
        df = df[df['location'].map(len) > 0] 
        if do_prints : print("Dropped rows without locations", pickle_file);
        # Mapping the locations to countries
        df["number"] =  1
        df["country"] = df["location"].apply(lambda x: location_in_string(x)[0])
        df["source"] = df["location"].apply(lambda x: location_in_string(x)[1])
        # Pickling the dataframe
        df.to_pickle(os.path.join(path, folder, "Geocoded", pickle_file))
        if do_prints : print(df[["location", "country"]].head(10))
        if do_prints : print(df.groupby("country").count()["text"])
    except:
        print("Failure :", folder, pickle_file)
        failures.append([folder, pickle_file])

Get all the folders containing the pickle files and load the mappings and calling the function above to extract the dataframe from the pickle file, map all the locations in the dataframe and save the result in a new dataframe. 

In [7]:
cwd = os.getcwd()
path = os.path.join(cwd, "../../../Project Data","Tweets")
# Get all the files in the current working directory
folders = os.listdir(path)
# Keep only the folders excluding the checkpoints folder -> event folders
folders = [x for x in folders if os.path.isdir(os.path.join(path, x)) if "checkpoints" not in x]

do_prints = False
if do_prints: print(folders)
    
for folder in folders:
    # Get all the files in the event folder
    files = os.listdir(os.path.join(path, folder))

    # If the geocoded folder does not exist create one for the given event
    if not os.path.exists(os.path.join(path, folder, "Geocoded")):
        os.makedirs(os.path.join(path, folder, "Geocoded"))
    
    # exclude the log file
    files = [x for x in files if "log" not in x if "Geocoded" not in x if "DS_Store" not in x if "Located" in x]
    
    if do_prints: print(files)
    
    # Go through all the different files in the folder and process them.
    for file in tqdm(files):
        extract_geocodes_for_pickle(folder, file)

  0%|          | 0/4 [00:00<?, ?it/s]
  0%|          | 0/333 [00:00<?, ?it/s][A
  0%|          | 1/333 [00:02<13:46,  2.49s/it][A
  1%|          | 2/333 [00:05<13:56,  2.53s/it][A
  1%|          | 3/333 [00:07<13:24,  2.44s/it][A
  1%|          | 4/333 [00:09<13:04,  2.39s/it][A
  2%|▏         | 5/333 [00:12<13:10,  2.41s/it][A
  2%|▏         | 6/333 [00:14<13:27,  2.47s/it][A
  2%|▏         | 7/333 [00:17<13:16,  2.44s/it][A
  2%|▏         | 8/333 [00:19<13:07,  2.42s/it][A
  3%|▎         | 9/333 [00:21<12:58,  2.40s/it][A
  3%|▎         | 10/333 [00:23<12:48,  2.38s/it][A
  3%|▎         | 11/333 [00:25<12:33,  2.34s/it][A
  4%|▎         | 12/333 [00:28<12:40,  2.37s/it][A
  4%|▍         | 13/333 [00:31<12:58,  2.43s/it][A
  4%|▍         | 14/333 [00:33<12:52,  2.42s/it][A
  5%|▍         | 15/333 [00:35<12:39,  2.39s/it][A
  5%|▍         | 16/333 [00:38<12:34,  2.38s/it][A
  5%|▌         | 17/333 [00:40<12:30,  2.38s/it][A
  5%|▌         | 18/333 [00:42<12:21,  2.35s

 47%|████▋     | 156/333 [05:45<06:32,  2.22s/it][A
 47%|████▋     | 157/333 [05:48<06:30,  2.22s/it][A
 47%|████▋     | 158/333 [05:50<06:28,  2.22s/it][A
 48%|████▊     | 159/333 [05:52<06:26,  2.22s/it][A
 48%|████▊     | 160/333 [05:55<06:23,  2.22s/it][A
 48%|████▊     | 161/333 [05:57<06:21,  2.22s/it][A
 49%|████▊     | 162/333 [05:59<06:19,  2.22s/it][A
 49%|████▉     | 163/333 [06:01<06:17,  2.22s/it][A
 49%|████▉     | 164/333 [06:03<06:14,  2.22s/it][A
 50%|████▉     | 165/333 [06:05<06:12,  2.21s/it][A
 50%|████▉     | 166/333 [06:07<06:09,  2.21s/it][A
 50%|█████     | 167/333 [06:09<06:06,  2.21s/it][A
 50%|█████     | 168/333 [06:10<06:04,  2.21s/it][A
 51%|█████     | 169/333 [06:12<06:01,  2.20s/it][A
 51%|█████     | 170/333 [06:14<05:58,  2.20s/it][A
 51%|█████▏    | 171/333 [06:15<05:56,  2.20s/it][A
 52%|█████▏    | 172/333 [06:17<05:53,  2.20s/it][A
 52%|█████▏    | 173/333 [06:19<05:51,  2.19s/it][A
 52%|█████▏    | 174/333 [06:21<05:48,  2.19s/

 93%|█████████▎| 311/333 [10:52<00:46,  2.10s/it][A
 94%|█████████▎| 312/333 [10:54<00:44,  2.10s/it][A
 94%|█████████▍| 313/333 [10:56<00:41,  2.10s/it][A
 94%|█████████▍| 314/333 [10:59<00:39,  2.10s/it][A
 95%|█████████▍| 315/333 [11:01<00:37,  2.10s/it][A
 95%|█████████▍| 316/333 [11:04<00:35,  2.10s/it][A
 95%|█████████▌| 317/333 [11:06<00:33,  2.10s/it][A
 95%|█████████▌| 318/333 [11:08<00:31,  2.10s/it][A
 96%|█████████▌| 319/333 [11:11<00:29,  2.10s/it][A
 96%|█████████▌| 320/333 [11:13<00:27,  2.10s/it][A
 96%|█████████▋| 321/333 [11:14<00:25,  2.10s/it][A
 97%|█████████▋| 322/333 [11:16<00:23,  2.10s/it][A
 97%|█████████▋| 323/333 [11:18<00:21,  2.10s/it][A
 97%|█████████▋| 324/333 [11:20<00:18,  2.10s/it][A
 98%|█████████▊| 325/333 [11:22<00:16,  2.10s/it][A
 98%|█████████▊| 326/333 [11:24<00:14,  2.10s/it][A
 98%|█████████▊| 327/333 [11:26<00:12,  2.10s/it][A
 98%|█████████▊| 328/333 [11:28<00:10,  2.10s/it][A
 99%|█████████▉| 329/333 [11:30<00:08,  2.10s/