## Geocoding Tweets
This script showcases how we geocoded (e.g. added coordinates) to tweets containing a placename in the Netherlands. For this, we made use of cbsodata and Nominatim from geopy, which uses the OSM database for geocoding strings.

In [None]:
# Import needed libraries
import pandas as pd
import numpy as np
import re
import geopy
from geopy.geocoders import Nominatim
import cbsodata

# Load cleaned locations from previous script
df = pd.read_csv('cleaned_geo_tweets.csv')

#### Final check for geocoding
In order to be sure that the geolocator will pick up the names from our tweets, we run the tweets through a loop that compares them with an official list of residences (e.g. villages, cities, hamlets, etc.) from [CBS](https://opendata.cbs.nl/statline/portal.html?_la=nl&_catalog=CBS&tableId=84992NED&_theme=238), so that we are sure the geolocater won't crash.

In [None]:
# Retrieve metadata from cbsodata
metadata = pd.DataFrame(cbsodata.get_meta('84992NED', 'DataProperties'))

# Save placenames as dataframe
places = pd.DataFrame(cbsodata.get_data('84992NED', select = 'Woonplaatsen'))

# Read the places csv file 
#places = pd.read_csv('Woonplaatsen_in_Nederland.csv',sep = ';')

# Make sure the names are in lower case to match our names 
places['Woonplaatsen'] = places['Woonplaatsen'].str.lower()

# Create an empty list for the place names to be added to
legit_locs = []
# Create the loop
for i in df['location']:
    for j in places['Woonplaatsen']:
        if i == j:
            legit_locs.append(i)
        else:
            pass

# Create dataframe of location and count
fnl_df = pd.DataFrame(legit_locs)

# Name the column for clarity
fnl_df.columns =['Location']

# Add a count column
fnl_df['count'] = fnl_df.groupby('Location')['Location'].transform('count')

# Remove the duplicates
fnl_df.drop_duplicates(subset=['Location'], keep = 'first', inplace=True)

#### Geocoding
Now, it's time for the actual geocoding. Be aware that this line takes around 10 minutes to locate all the placenames. The code was inspired by [this](https://medium.com/analytics-vidhya/exploring-twitter-data-using-python-part-iii-analyzing-the-data-e883aa340dff) tutorial. Alternatively, the 'tweets_with_location' file is also provided in the next notebook, so the user doesn't have to run the next part.

In [None]:
# Inititiate user
geolocator = Nominatim(user_agent='twitter-analysis')
# note that user_agent is a random name

# Convert locations to a list
fnl_locs = list(fnl_df.Location)

# This line takes about 10 minutes to run!
geolocated = list(map(lambda x: [x,geolocator.geocode(x)[1] if geolocator.geocode(x) else None],fnl_locs))

# Check the result
geolocated.head(5)

# Transform to lat and long
geolocated = pd.DataFrame(geolocated)
geolocated.columns = ['locat','latlong']
geolocated['lat'] = geolocated.latlong.apply(lambda x: x[0])
geolocated['lon'] = geolocated.latlong.apply(lambda x: x[1])
geolocated.drop('latlong',axis=1, inplace=True)

# Procedure to merge the sentiment and spatial analysis
tweets_with_location = df.join(geolocated.set_index('locat'), on = 'location')

# Export to csv for the final notebook!
tweets_with_location.to_csv('tweets_with_location.csv', header=True, index=False)