# Geocoding Tweets 

The purpose of this notebook is to create the mapping dictionaries used to recover the tweets that were pickled and to map them to a specific country. The tweets were recovered with their location (when provided by the user). If the tweet did not have a provided location, the location of the user was scraped. However not all users provide this information on their page. That is why all the tweets which did not have either information were dropped. Then the locations provided were mapped to the countries. 

To map the tweets to their locations we used in order : 
- Automatic verification of whether the country name or a capital name was contained in the string. This was possible using the data obtained from : https://mledoze.github.io/countries/ and https://datahub.io/core/country-codes. The first links the country iso codes to country names in multiple languages with not only the official but also the common names of a country. The latter links the country iso codes to country names in different languages (arabic, chinese, english, spanish, french, russian). 
- A city to country mapper from which we removed duplicate cities taken from : https://github.com/lutangar/cities.json 
- A city to country mapper extracted from : http://www.geonames.org/export/ and http://download.geonames.org/export/dump/. The issue with this dataframe is that the duplicate cities were not handled. They were progressively overwritten. The advantage of this mapper however is that it is more extensive than the previous one, contaning a larger number of cities as well as alternative spellings and different languages. Ideally, what should have been done in the case of multiple cities with same name would be to select based on the population of the cities. 
- If none of the above yielded any results we queried an API based on the works of http://www.geonames.org/export/, http://geocoder.readthedocs.io/results.html which outputs the most probable location to which the user selected location corresponds to. From that we can recover the ISO country code which can directly be used in the Chloropleth maps. Note that we could not query the API for all the locations as this takes around 1 second per tweet. Given that the number of tweets is in the order of magnitude of the millions this would not have been feasible on the entire dataset.

All of this was done using dictionaries to speed up the identification process. Currently, for pickles containing around 2000 tweets, we require under 10 seconds of processing. 

To create the dictionaries the given locations were set as keys with alternative spellings as well as string formatting to maximize the chance of identifying the country. What is time consuming however is creating the dictionaries themselves which is why the dictionaries were pickled once the process was finished. 

In [None]:
import os
import numpy as np
import pandas as pd
import geocoder, geopy
import time
import unicodedata
import pickle
import contextlib
from tqdm import tqdm

## 0. Helper Functions

String formatting functions

In [None]:
#https://stackoverflow.com/questions/8694815/removing-accent-and-special-characters
def remove_accents(data):
    if data is None:
        return None
    else :
        clean = ''.join(x.lower().strip() for x in unicodedata.normalize('NFKD', data) if \
                unicodedata.category(x)[0] == 'L').lower()
        return clean

def string_formatting(string):
    string = string.replace("-", " ").replace(" ", ",").split(",")
    formatted_string = [remove_accents(x) for x in string]
    return string,formatted_string


Functions used to apply transformations to elements in lists

In [None]:
def clean_sublist(x):
    return list(set(filter(None, np.hstack(x))))

def remove_accents_in_sublist(l):
    return list(map(lambda x:remove_accents(x),l))
    
def remove_accents_in_list(lists):
    return list(map(lambda x:remove_accents_in_sublist(x),lists))

def clean_and_remove_accents_in_list(lists):
    return list(map(lambda x:clean_sublist(remove_accents_in_sublist(x)),lists))

In [None]:
test_list = [['أفغانستان', 'afganistán', '阿富汗', 'афганистан', 'Kabul', 'afghanistan'], ['阿尔巴尼亚', 'албания', 'Tirana', 'ألبانيا', 'albania', 'albanie']]
clean_and_remove_accents_in_list([["édjndfu","édjndfu", "àoinidè"],["édjndfu", "àoinidè"]])
 

Convert dataframe to dictionary

In [None]:
def convert_df_to_dict(df, do_prints = False):
    
    t = time.time()
    df_list = list(map(lambda x:clean_sublist(x),df.values.tolist()))
    if do_prints : print("Converting to list :", time.time()-t)

    t = time.time()
    df_variants = clean_and_remove_accents_in_list(df_list)
    if do_prints : print("Getting variants :", time.time()-t)
    
    t = time.time()
    df_all =  list(map(lambda x: list(set(df_list[x] + df_variants[x])),range(len(df))))
    if do_prints : print("Combining Lists :", time.time()-t)
        
    t = time.time()
    keys = list(map(lambda x: [df.index[x]]*(len(df_all[x])),range(len(df_all))))
    if do_prints : print("Getting all keys :", time.time()-t)
        
    t = time.time()
    mapping = dict(zip(sum(df_all, []),sum(keys, [])))
    if do_prints : print("Converting to dict :", time.time()-t)
        
    return mapping


## 1. Country and Capitals Mapping

### 1.1 Creating the Mappings

**Mapping 1**

In [None]:
# Load the country names in different languages mapping
country_codes = pd.read_csv("Mapping Files/country-codes.csv")
keep_columns = ['official_name_ar', 'official_name_cn', 'official_name_en',
                'official_name_es', 'official_name_fr', 'official_name_ru',
                'ISO3166-1-Alpha-2', 'ISO3166-1-Alpha-3', 'ISO3166-1-numeric',
                'Capital', 'Continent', 'Region Name','Sub-region Name']       
country_codes = country_codes[keep_columns]
country_codes.rename(inplace = True, index=str, columns={"official_name_ar": "arabic", "official_name_cn":"chinese", "official_name_en":"english", 
                                                         "official_name_es":"spanish", "official_name_fr":"french", "official_name_ru":"russian",
                                                         "ISO3166-1-Alpha-2":"ISO2", "ISO3166-1-Alpha-3":"ISO3", "ISO3166-1-numeric":"ISONum"})
country_codes = country_codes.iloc[1:]
languages = ["english", "french", "spanish", "chinese", "russian", "arabic"]

country_names = dict()
for lan in languages:
    country_codes[lan] = country_codes[lan].apply(lambda x: x.lower())
    country_names[lan] = dict(zip(country_codes[lan].tolist(), country_codes["ISO2"]))

country_codes.set_index("ISO2", inplace = True)


country_codes.head()

In [None]:
country_codes.loc["BE"]

In [None]:
col = languages + ["Capital"]
country_mapping1 = convert_df_to_dict(country_codes[col])

**Mapping 2**

https://raw.githubusercontent.com/mledoze/countries/master/countries.json


In [None]:
def extract_native_name(x):
    try:
        return x["native"][list(x["native"].keys())[0]]["common"]
    except:
        return 

def extract_translations(x):
    val = x.values()
    try:
        return[name["official"] for name in x.values()]
    except:
        return 
    
country_df = pd.read_json("Mapping Files/countries.json")
country_df = country_df[["altSpellings", "capital", "cca2", "name", "translations"]]
country_df.rename(inplace = True, index=str, columns={"cca2": "ISO2"})
country_df.set_index("ISO2", inplace = True)
country_df["common"] = country_df["name"].apply(lambda x: x["common"])
country_df["official"] = country_df["name"].apply(lambda x: x["official"])
country_df["native"] = country_df["name"].apply(lambda x: extract_native_name(x))
country_df["common translations"] = country_df["translations"].apply(lambda x: extract_translations(x))
country_df["altSpellings"] = country_df["altSpellings"] .apply(lambda x: x[1:] if len(x)>1 else [])

country_df.drop(["name","translations"], axis = 1, inplace = True)

country_mapping2 = convert_df_to_dict(country_df)

**Merging Both Country Mappings and Pickling**

In [None]:
country_mapping = {**country_mapping1, **country_mapping2}

file = open("country_mapping.pickle", 'wb')
pickle.dump(country_mapping, file, protocol=4)
file.close()

**Testing the Country Mappings**

Function used to test whether the name of a country is in a string. A similar version is used with the different mappings in the final method for the tweets

In [None]:
def country_in_string(loc, do_prints = False): 
    t = time.time()
    words, formatted_words = string_formatting(loc)
    if do_prints : print(words)
    words = [x for x in words if len(x)>2]
    formatted_words = [x for x in formatted_words if len(x)>2]
    
    word_combinations = [" ".join(words[i:j]) for j in range(len(words)+1) for i in range(j)]
    word_combinations += [" ".join(words[i:j]) for j in range(len(formatted_words)+1) for i in range(j)]
    if do_prints : print(word_combinations)
    
    matching = []
    for word in word_combinations:
        if do_prints : print("Testing: ", word)
        if word in country_mapping:
            print(time.time()-t)
            return country_mapping[word]

    print(time.time()-t)
    return None


Verifying that the function works properly as well as the execution times

In [None]:
print(country_in_string("أفغانستان hello my name is bloop"))
print(country_in_string("أفغانستان hello my name Japan"))
print(country_in_string("España hello my name Japan"))
print(country_in_string("autriche"))
print(country_in_string("oesterreich"))
print(country_in_string("osterreich"))
print(country_in_string("austria"))
print(country_in_string("vienna"))
print(country_in_string("Hello New Zealand"))
print(country_in_string("Hello New Zealand"))
print(country_in_string("Washington"))
print(country_in_string("cairo"))

## 2. City Mapping

### Method 1 : GEODATASOURCE

Testing the mapping taken from the GEODATASOURCE-CITIES-FREE.TXT from https://www.geodatasource.com/file-download. As we can see with a few simple tests,the output is almost always wrong. 

In [None]:
cities = pd.read_csv("Mapping Files/GEODATASOURCE-CITIES-FREE.TXT", sep = "\t")
cities.head()
city_mapping = dict(zip(cities["FULL_NAME_ND"].tolist(), cities["CC_FIPS"].tolist()))

In [None]:
print("Beijing in ", city_mapping["Beijing"])
print("Cairo in ", city_mapping["Cairo"])
print("Paris in ", city_mapping["Paris"])
print("Lausanne in ", city_mapping["Lausanne"])
print("Morges in ", city_mapping["Morges"])
print("Ontario in ", city_mapping["Ontario"])
print("Oxford in ", city_mapping["Oxford"])
print("Shanghai in ", city_mapping["Shanghai"])

### Method 2 : Cities of the world in Json, based on GeoNames Gazetteer
https://github.com/lutangar/cities.json

In [None]:
city_df = pd.read_json("Mapping Files/cities.json")
city_df.drop(["lat", "lng"], axis = 1, inplace = True)
city_df.rename(inplace = True, index=str, columns={"country": "ISO2", "name":"city"})
city_df.set_index("city", inplace = True)
city_df.head()


The issue with this mapping is that there are multiple cities with the same name in different countries. As we have no way of determining which city is the most likely, we drop those rows from the dataframe and store them in a second one. 

In [None]:
doublons = city_df.copy()
doublons["num"] = 1
doublons = doublons.groupby("city").sum()
doublons = doublons[doublons.num>1]
doublons = doublons.index.tolist()
print(len(doublons))

Here we have an example of why the mapping provided is problematic, especially since we cannot rely on language to determine to which country the city belongs to. 

In [None]:
city_df.loc["Toronto","ISO2"]

Dropping all problematic cities from the mapping and creating a dictionary from the remaining cities. 

In [None]:
reduced_city_df = city_df.drop(doublons)
city_mapping = dict(zip(reduced_city_df.index, reduced_city_df.ISO2))

alt_names = [remove_accents(x) for x in reduced_city_df.index]
city_mapping = {**dict(zip(alt_names, reduced_city_df.ISO2)), **city_mapping}

file = open("city_mapping.pickle", 'wb')
pickle.dump(city_mapping, file, protocol=4)
file.close()

Unfortunately, this mapping is far from complete and is missing many cities, especially after having removed the cities with identical names. However we can quickly check a few of the cities

In [None]:
print("Nantes in :", city_mapping[remove_accents("Nantes")])
print("Lausanne in :", city_mapping[remove_accents("Lausanne")])
print("Abu Dhabi in :", city_mapping[remove_accents("Abu Dhabi")])
print("Shanghai in :", city_mapping[remove_accents("Shanghai")])
print("Beijing in :", city_mapping[remove_accents("Beijing")])
print("Tokyo in :", city_mapping[remove_accents("Tokyo")])

### Method 3 : Using APIs - https://github.com/geopy/geopy

Multiple APIs were considered in order to map the cities which were neither in the Country/Capital mapping nor the city mapping where the duplicates were removed. 

http://geocoder.readthedocs.io/providers/GeoNames.html

https://github.com/geopy/geopy

https://github.com/dsoprea/GeonamesRdf

Unforturnately there were multiple issues with this method. First the results are not consistent. Running the query multiple times does not always lead to the same result. Then the APIs are limited in number of queries without actually subscribing to their services. That is why this method was not kept for the final geolocalisation method. 


In [None]:
from geopy.geocoders import Nominatim

geolocator = Nominatim()
location = geolocator.geocode("stalingrad")
print(location.raw['display_name'].split(",")[-1])    

def query_geocoder_api(loc):
    g = geocoder.google(loc)
    try:
        country = g.json["country"]
        return country
    except:
        return 
    
for i in range(10):
    print("Stalingrad", query_geocoder_api("Stalingrad"))
    print("5th Avenue", query_geocoder_api("5th Avenue"))
    print("Morges", query_geocoder_api("Morges"))
    print("Champs élysées", query_geocoder_api("Champs élysées"))
    print("Alexandria", query_geocoder_api("Alexandria"))
    print("Zurich", query_geocoder_api("Zurich"))

### Method 4 : Using the Geonames Database 
http://download.geonames.org/export/dump/

This database contains a zip file for each country with a textfile containing the different cities as well as alternate names. The functions below are used to load and process the text files. 

In [None]:
def extract_alternate_names(x):
    try:
        out = x.split(",")
        return out
    except:
        return []
        
def process_dataframe(full_filename):
    # Load the text file as a csv
    
    dtypes = [int,str, str,str,float,float,str,str,\
             str,str,str,str, str,str,int,str,\
             str,str,str]
    
    columns = ["geonameid","name", "asciiname","alternatenames",\
               "latitude","longitude","feature class","feature code",\
               "country code","cc2","admin1 code","admin2 code",\
               "admin3 code","admin4 code","population","elevation",\
               "dem","timezone","modification date"]
    
    cities = pd.read_csv(full_filename, sep = "\t", header=None, names=columns, dtype = dict(zip(columns,dtypes)))
        
    #print("Loaded")
    
    cities = cities[["name","asciiname", "alternatenames", "country code","population"]]
    
    cities["name"] = cities["name"].apply(lambda x: extract_alternate_names(x))
    cities["asciiname"] = cities["asciiname"].apply(lambda x: extract_alternate_names(x))
    cities["alternatenames"] = cities["alternatenames"].astype("object")
    cities["alternatenames"] = cities["alternatenames"].apply(lambda x: extract_alternate_names(x))
    
    #print("Processed")
    pop = cities.copy()
    
    pop.drop(["country code"], axis = 1, inplace = True)
    cities.drop(["population"], axis = 1, inplace = True)
    
    cities.set_index("country code", inplace = True)
    pop.set_index("population", inplace = True)
    
    #print("Indexed")
    
    return cities, pop

def modulo(i,l):
    return i%l

def writeline(fd_out, line):
    fd_out.write('{}\n'.format(line))

def split_large_files(file_path, file_large):
    l = 15*10**2  # lines per split file
    idx = 0
    new_files = []
    split_file_path = os.path.join(file_path, "split")
    #print(file_path)
    
    with contextlib.ExitStack() as stack:
        with open(os.path.join(file_path,file_large)) as open_file:
            with stack.enter_context(open_file) as fd_in:
                for i, line in enumerate(fd_in):
                    if not modulo(i,l):
                        if not os.path.isdir(split_file_path):
                            os.makedirs(split_file_path)
                        file_split = '{}{}.txt'.format(os.path.join(split_file_path,file.split(".")[0]),idx)
                        new_files.append(file_split)
                        idx +=1
                        try: 
                            fd_out.close()
                            fd_out = stack.enter_context(open(file_split, 'w'))
                        except:
                            fd_out = stack.enter_context(open(file_split, 'w'))
                    fd_out.write('{}\n'.format(line))
            
    return new_files

def process_text_files_and_pickle(folders_path, folder, file, not_processed):
    statinfo = os.stat(os.path.join(folders_path, folder, file))
    print(file,statinfo.st_size//10**6 )
    
    if statinfo.st_size>2*10**6:
        new_files = split_large_files(os.path.join(folders_path, folder),file)
        #print(new_files)
        #not_processed.append(os.path.join(folders_path, folder, file))
    else :
        new_files = [file]
        
    save_path = os.path.join(folders_path, folder)
    
    for file in tqdm(new_files):
        try : 
            if len(new_files) == 1:
                path = save_path
                full_filename = os.path.join(path, file)
            else:
                full_filename = file
                path = "/".join(file.split("/")[:-1])
                file = file.split("/")[-1]
                file = file.split(".")[0]
            
            cities, pop = process_dataframe(full_filename)
            #print("Done Processing")
            city_mapping2 = convert_df_to_dict(cities)
            #print("City to Dict")
            pop_mapping = convert_df_to_dict(pop)
            #print("Pop to Dict")

            pickle_file = open(os.path.join(save_path, file+"_city_map.pickle"), 'wb')
            pickle.dump(city_mapping2, pickle_file, protocol=4)
            pickle_file.close()
            #print("Pickled City")
            pickle_file = open(os.path.join(save_path, file+"_pop_map.pickle"), 'wb')
            pickle.dump(pop_mapping, pickle_file, protocol=4)
            pickle_file.close()
        
        except:
            not_processed.append(os.path.join(folders_path, folder, file))
    
    return not_processed
    

In [None]:
cwd = os.getcwd()
# Get all the files in the Cities Folder
folders_path = os.path.join(cwd,"../../../Project Data", "Cities")
folders = os.listdir(folders_path)
folders = [x for x in folders if len(x)==2]

repeat = False
not_processed = list()

# Go through all the folders
for folder in tqdm(folders):
    # Extract only the country text file
    file = [x for x in os.listdir(os.path.join(folders_path, folder)) \
            if "readme" not in x if "DS_Store" not in x if "pickle" not in x\
            if "split" not in x][0]
    
    if do_prints: print(file)
        
    if not repeat: 
        pickle_files = [x for x in os.listdir(os.path.join(folders_path, folder)) \
                        if "pickle" in x]
        if len(pickle_files):
            continue
    
    not_processed = process_text_files_and_pickle(folders_path, folder, file, not_processed)


In [None]:
not_processed
#pickle_file = open("not_processed.pickle", 'wb')
#pickle.dump(not_processed, pickle_file, protocol=4)
#pickle_file.close()

Creating the dictionaries from the city map pickles

In [None]:
cwd = os.getcwd()

# Get all the files in the Cities Folder
folders_path = os.path.join(cwd,"../../../Project Data", "Cities")
folders = os.listdir(folders_path)
folders = [x for x in folders if len(x)==2]

full_city_mapping = dict()
dict_not_processed = list()

idx_mapping = 0
num_processed_files = 0
do_prints = False

for folder in tqdm(folders):
    # Extract only the country text file
    files = [x for x in os.listdir(os.path.join(folders_path, folder)) \
            if "city_map.pickle" in x]
    if do_prints : print("Processing :", files)
    
    for file in files:
        
        try : 
            file_path = os.path.join(folders_path, folder, file)
            if do_prints : print(file_path)
            pkl_file = open(file_path, 'rb')
            country_city_dict = pickle.load(pkl_file)
            if do_prints : print(len(country_city_dict))
            
            full_city_mapping = {**full_city_mapping, **country_city_dict}
            
            if do_prints : print(file, len(full_city_mapping), (set(list(full_city_mapping.values()))))
            
            if (num_processed_files+1)%100 == 0:
                filename = os.path.join(folders_path, "full_city_mapping_{}.pickle".format(idx_mapping))
                pkl_file = open(filename, 'wb')
                pickle.dump(full_city_mapping, pkl_file, protocol=4)
                pkl_file.close()
                full_city_mapping = dict()
                idx_mapping +=1        
            
            if do_prints : print("Finished :", file)
            num_processed_files += 1
            
        except:
            dict_not_processed.append(file)
            filename = os.path.join(folders_path, "dicts_not_processed.pickle")
            pkl_file = open(filename, 'wb')
            pickle.dump(dict_not_processed, pkl_file, protocol=4)
            pkl_file.close()
            if do_prints : print("Failed :", file)

    if do_prints : print("Pickled Pop")
        
filename = os.path.join(folders_path, "full_city_mapping_{}.pickle".format(idx_mapping))
pkl_file = open(filename, 'wb')
pickle.dump(full_city_mapping, pkl_file, protocol=4)
pkl_file.close()

if do_prints : 
    print(dict_not_processed)
    pickles = os.listdir(folders_path)
    pickles = [x for x in pickles if "full_city_mapping" in x]

    for pkl in pickles:
        pkl_file = open(pkl, 'rb')
        interm_dict = pickle.load(pkl_file)
        print(pkl, len(interm_dict), (set(list(interm_dict.values()))))
        pkl_file.close()