## Cleaning the locations
After we have tagged all tweets with a sentiment score, we want to map their locations over the Netherlands. Before we can do this, we need to clean the user locations to locations that can be picked up by the geocoder of OSM: [Nominatim](https://nominatim.openstreetmap.org/ui/search.html). In this notebook, we will showcase how we cleaned our tweets before geolocating. For this, we mainly made use of the regular expression, cbsodata and pandas modules.

In [12]:
# Importing modules
import pandas as pd
import numpy as np
import re

# Load in data from previous notebook
df = pd.read_csv('final_sentiment_tweets.csv')
df.shape

(18684, 11)

#### Cleaning for geocoding
The cleaning for geocoding consists of dropping unnecessary columns, and rows with invalid locations that the user inputted as their user location. Since we use user profile locations and not geotagged tweets, there is a lot of non-sensical locations that users used as input (e.g. 'The moon', 'Everywhere') or locations that are not useful for our project (e.g. 'North-Holland', 'The Netherlands'). Therefore, we have to sift through the locations, so that we end up with a clean list of tweets with residency places.

In [13]:
# Inspect the data
df.shape

# Select the column 'location' and delete the rows with empty cells in this location column
df['location'].replace('', np.nan, inplace=True)

# delete the NaN cells in location
df.dropna(subset=['location'], inplace=True)

# Inspect the data 
df.shape



(12844, 11)

#### Filtering the invalid locations
Here the invalid locations are filtered, such as provinces or regions. Some unique cases (e.g. 's-Gravenhage, Jordaan) had enough cases to make changing by hand worthwile.

In [14]:
# Get rid of invalid words and locations, replace with empty string or correct version
df['location'] = (df['location']
                  .str.replace("Netherlands", '', case=False)
                  .str.replace("The Netherlands", '', case=False)
                  .str.replace("Nederland", '', case=False)
                  .str.replace("NL", '', case=False)
                  .str.replace("the", '', case=False)
                  .str.replace("[0-9]*", '', case = False)
                  .str.replace("Drenthe", '', case = False)
                  .str.replace("Flevoland", '', case = False)
                  .str.replace("Fryslân", '', case = False)
                  .str.replace("Friesland", '', case = False)
                  .str.replace("Gelderland", '', case = False)
                  .str.replace("Groningen", '', case = False)
                  .str.replace("Limburg", '', case = False)
                  .str.replace("Noord Brabant", '', case = False)
                  .str.replace("Noord-Brabant", '', case = False)
                  .str.replace("Noord Holland", '', case = False)
                  .str.replace("Noord-Holland", '', case = False)
                  .str.replace("Overijssel", '', case = False)
                  .str.replace("Zuid Holland", '', case = False)
                  .str.replace("Zuid-Holland", '', case = False)
                  .str.replace("Zeeland", '', case = False)
                  .str.replace("Achterhoek", '', case = False)
                  .str.lower()
                  .str.replace('([^0-9a-z-\' \t])',' ')
                  .str.replace(' +',' ')
                  .str.replace(' hague', 'den haag')
                  .str.replace(' north-holland','')
                  .str.replace(' north holland','')
                  .str.replace('den haag','\'s-gravenhage')
                  .str.replace(' nootdorp','')
                  .str.replace('jordaan amsterdam','amsterdam')
                  .str.replace('amsterdam area','amsterdam')
                  .str.replace('\'s heerenberg','\'s-heerenberg')
                  .str.replace('\'s-heerenberg montferland','\'s-heerenberg')
                  .str.replace('den bosch', '\'s-hertogenbosch'))

# Drop again the cells that are left empty 
df['location'].replace('', np.nan, inplace=True)
df.dropna(subset=['location'], inplace=True)

  df['location'] = (df['location']


#### Checking results and exporting

In [15]:
# Strip whitespaces
df['location'] = df['location'].str.strip()

# Check the locations (locs should be cities/villages etc)
locs = df['location'].value_counts()
locs

# Save location for next script
df.to_csv('cleaned_geo_tweets.csv', header=True, index=False)

#Check data structure
df.head(5)

Unnamed: 0,id,created_at,screen_name,location,text,processed_text,translation,processed_text_en,google,party,lijsttrekker
0,1.37e+18,Tue Mar 16 23:59:54 +0000 2021,Rechtsevrouwen,ergens in,Kom kom zo’n powervrouw van #D66 kan toch wel ...,kom kom n powervrouw d66 wel maar woordje verr...,come come such a power woman from d66 can sure...,come come power woman d66 surely something wor...,0.3,d66,kaag
3,1.37e+18,Tue Mar 16 23:59:35 +0000 2021,Rob4005,tiel,Bij de plannen van GroenLinks groeien bomen ni...,plannen groenlinks groeien bomen niet gekapt z...,"in the plans of green links, trees do not grow...","plans green links, trees grow, cut pathetic bl...",-0.8,groenlinks,
4,1.37e+18,Tue Mar 16 23:59:16 +0000 2021,KaagPremier,jeruzalem palestina,Op het Museumplein in Amsterdam verzamelen zic...,museumplein amsterdam verzamelen eerste aanhan...,the first supporters gather on the museum squa...,first supporters gather museum square amsterda...,0.6,,
5,1.37e+18,Tue Mar 16 23:59:05 +0000 2021,RenseSijbesma,dronrijp,ga stemmen morgen mensen en stem #FVD,ga stemmen morgen mensen stem fvd,go vote tomorrow folks and vote fvd,go vote tomorrow folks vote fvd,0.2,fvd,
6,1.37e+18,Tue Mar 16 23:58:46 +0000 2021,PascalyLilia,m xico,Langere wachttijden voor een woning geen betaa...,langere wachttijden woning betaalbare huizen i...,longer waiting times for a house no affordable...,longer waiting times house affordable houses d...,-0.6,,


As can be seen above, the locations have been somewhat filtered. However, there are still a lot of invalid locations in there. Therefore some more thorough cleaning is needed in the next notebook.

In [16]:
# Count of locations
locs = df['location'].value_counts()
locs

                               1180
amsterdam                      1094
rotterdam                       402
's-gravenhage                   382
nijmegen                        139
                               ... 
rosmalen 's-gravenhage            1
rheden tweets op pers titel       1
mecca -                           1
moraira                           1
veganzones espa a                 1
Name: location, Length: 3483, dtype: int64