### In this notebook we deal with cleaning up the twitter data with given sentiments and finding the coordinates for the mentioned locations. In the end we produce some simple folium maps.



In [20]:
import pandas as pd
import json
import numpy as np
import folium
import nltk
import time
import matplotlib.pyplot as plt
from geopy.geocoders import Nominatim
import math
from geopy.geocoders import Nominatim
import re
from mpl_toolkits.basemap import Basemap


Creating a dataframe from the combined twitter sentiments data, which has the given sentiment field.

In [21]:
df = pd.read_csv('processed/sentiments/twitter_sentiments_combined.csv')
df.head()



Unnamed: 0,City,Sentiment,Count,Month,Language
0,Luzern,NEUTRAL,3069,april,de
1,Switzerland,POSITIVE,2246,april,de
2,Uster,POSITIVE,8,april,de
3,Winterthur,POSITIVE,83,april,de
4,Sursee,POSITIVE,3,april,de


List of words, where Switzerland is in different forms. We will remove records, which city field contains those, as we are interested in the locations such as cities, not countries.

In [22]:
switzerland = ['schweiz', 'schwyz', 'سويسرا', 'สวิตเซอร์แลนด์', 'スイス','thuy si','ch',
       'suica', 'suisse', 'suiza', 'suíça', 'suïssa', 'svajcarska',
       'sveitsi', 'svizzera', 'swiss', 'switzerland',
       'zwitserland', 'isviçre', 'Швейцария']

Cleaning up the city field. 

In [23]:
df['City'] = df['City'].map(lambda x: re.sub(r'[^áâçàéèa-zA-ZÖöÜüÄä]+', ' ', x)) #Only latin characters
df['City'] = df['City'].map(lambda x: x.split(' ', 1)[0]) #Selecting the first word
df['City'] = df['City'].map(lambda x: x.lower()) #Converting to lowercase
df = df.loc[~df['City'].isin(switzerland)] #Removing if conatins a word representing Switzerland
df['City'].replace('', np.nan, inplace=True) #Replacing just a whitespace with nan value for later removal
df = df[pd.notnull(df['City'])] #Keeping only records which have a city field

#Keeping only columns of interest, we will aggregate all 10 months data so language and month field get removed.
df = df[['City', 'Sentiment', 'Count']]
df

Unnamed: 0,City,Sentiment,Count
0,luzern,NEUTRAL,3069
2,uster,POSITIVE,8
3,winterthur,POSITIVE,83
4,sursee,POSITIVE,3
5,saanen,NEUTRAL,74
6,baden,NEUTRAL,1398
7,meilen,NEUTRAL,8
9,waldenburg,NEUTRAL,659
10,schleitheim,NEUTRAL,2
11,willisau,NEUTRAL,17


Keeping only the records for which we have positive or negative sentiments by discarding neutral. As there are many neutral sentiments, we believe it gives us a better idea about cities which tend to have positive or negative sentiments.

In [24]:
df = df.loc[df['Sentiment'].isin(['POSITIVE','NEGATIVE'])]
df.head()





Unnamed: 0,City,Sentiment,Count
2,uster,POSITIVE,8
3,winterthur,POSITIVE,83
4,sursee,POSITIVE,3
15,biel,POSITIVE,5
17,sarnen,POSITIVE,1


In [25]:
#df['City'] = pd.core.strings.str_strip(df['City'])
#df.head()

Grouping by city and sentiment and adding up the counts.

In [26]:
df = df.groupby(['City','Sentiment'])[('Count')].sum()
df

City              Sentiment
a                 NEGATIVE         9
                  POSITIVE        36
aachen            POSITIVE         3
aadorf            NEGATIVE         2
                  POSITIVE        47
aarau             NEGATIVE       389
                  POSITIVE      1164
aarberg           NEGATIVE         1
                  POSITIVE         4
aarburg           NEGATIVE        53
                  POSITIVE        84
aargau            NEGATIVE       978
                  POSITIVE      1231
aarhus            POSITIVE         1
abadons           NEGATIVE         2
                  POSITIVE         1
aberdeen          NEGATIVE         2
                  POSITIVE         7
aberdeenshire     POSITIVE         1
about             POSITIVE         2
abrantes          POSITIVE         2
abu               NEGATIVE        81
                  POSITIVE       119
abudhabi          NEGATIVE         6
                  POSITIVE         3
abuja             NEGATIVE         7
          

Reindexing to get a nice dataframe.

In [27]:
df = df.reset_index()
df

Unnamed: 0,City,Sentiment,Count
0,a,NEGATIVE,9
1,a,POSITIVE,36
2,aachen,POSITIVE,3
3,aadorf,NEGATIVE,2
4,aadorf,POSITIVE,47
5,aarau,NEGATIVE,389
6,aarau,POSITIVE,1164
7,aarberg,NEGATIVE,1
8,aarberg,POSITIVE,4
9,aarburg,NEGATIVE,53


Selecting for each city the record, in which the count is the highest, so that the most frequent sentiment is left for each city.

In [29]:
a = df.loc[df.groupby("City")["Count"].idxmax()]
a.head(20)
a

Unnamed: 0,City,Sentiment,Count
1,a,POSITIVE,36
2,aachen,POSITIVE,3
4,aadorf,POSITIVE,47
6,aarau,POSITIVE,1164
8,aarberg,POSITIVE,4
10,aarburg,POSITIVE,84
12,aargau,POSITIVE,1231
13,aarhus,POSITIVE,1
14,abadons,NEGATIVE,2
17,aberdeen,POSITIVE,7


### The below algorithm runs very long for a large dataframe!
The solution is in file: coordinatesForGivenSentiments10m.csv

Using the GeoPy package to find the coordinates for all the locations. Those cities or names in the city field which don't have coordinates, will be inserted with nan values.

In [10]:
geolocator = Nominatim()

def getCoordinates (row):
    #time.sleep(0.01)
    print(row['City'])
    try:
        address, (x, y) = geolocator.geocode(row['City'])
        address = address.split(",")
        country = address[-3]
        return x, y, country
    except Exception:
        return np.nan, np.nan, " "
     


Outputing the file with all the coordinates for all the records.

In [16]:
#a["Latitude"], a["Longitude"], a["Country"]= zip(*a.apply (lambda row: getCoordinates (row),axis=1))
#a

In [12]:
#outFile = "coordinatesForGivenSentiments10m.csv"
#a.to_csv(outFile,index=None)

Reading the file with coordinates.

In [31]:
a = pd.read_csv('coordinatesForGivenSentiments10m.csv')
a.head()

Unnamed: 0,City,Sentiment,Count,Latitude,Longitude,Country
0,.ch,POSITIVE,14,46.798562,8.231974,Suisse
1,aadorf,POSITIVE,47,47.491578,8.902953,Suisse
2,aarau,POSITIVE,1164,47.392715,8.044445,Suisse
3,aarberg,POSITIVE,4,47.044335,7.2753,Suisse
4,aarburg,POSITIVE,84,47.320642,7.89936,Suisse


### Find only the data for Switzerland

We only want the records for Switzerland, for which the country field is 'Suisse'.

In [32]:
a['Country'] = pd.core.strings.str_strip(a['Country'])
a['City'] = pd.core.strings.str_strip(a['City'])
allSuisse = a.loc[a['Country'] == 'Suisse']
allSuisse

Unnamed: 0,City,Sentiment,Count,Latitude,Longitude,Country
0,.ch,POSITIVE,14,46.798562,8.231974,Suisse
1,aadorf,POSITIVE,47,47.491578,8.902953,Suisse
2,aarau,POSITIVE,1164,47.392715,8.044445,Suisse
3,aarberg,POSITIVE,4,47.044335,7.275300,Suisse
4,aarburg,POSITIVE,84,47.320642,7.899360,Suisse
5,aargau,POSITIVE,1231,47.412396,8.194832,Suisse
6,adelboden,POSITIVE,60,46.492721,7.558762,Suisse
7,adligenswil,POSITIVE,10,47.070535,8.368244,Suisse
8,adliswil,POSITIVE,33,47.311762,8.524910,Suisse
9,aegerten,NEGATIVE,2,47.120241,7.288497,Suisse


In [20]:
#outFile = "swissTwitterSentimentsAndCoordinates10m.csv"
#allSuisse.to_csv(outFile,index=None)

In [45]:
allSuisse = pd.read_csv('swissTwitterSentimentsAndCoordinates10m.csv')
allSuisse.head()

Unnamed: 0,City,Sentiment,Count,Latitude,Longitude,Country
0,.ch,POSITIVE,14,46.798562,8.231974,Suisse
1,aadorf,POSITIVE,47,47.491578,8.902953,Suisse
2,aarau,POSITIVE,1164,47.392715,8.044445,Suisse
3,aarberg,POSITIVE,4,47.044335,7.2753,Suisse
4,aarburg,POSITIVE,84,47.320642,7.89936,Suisse


In [33]:
allSuisse['City']= allSuisse['City'].map(lambda x: x.title())
allSuisse.head()

Unnamed: 0,City,Sentiment,Count,Latitude,Longitude,Country
0,.Ch,POSITIVE,14,46.798562,8.231974,Suisse
1,Aadorf,POSITIVE,47,47.491578,8.902953,Suisse
2,Aarau,POSITIVE,1164,47.392715,8.044445,Suisse
3,Aarberg,POSITIVE,4,47.044335,7.2753,Suisse
4,Aarburg,POSITIVE,84,47.320642,7.89936,Suisse


Function which sets the colour column value to reprsent the positive with green colour and negative sentiment with red colour.

In [34]:
def getColour (row):
   if row['Sentiment'] == 'NEUTRAL' :
      return 'blue'
   if row['Sentiment'] == 'POSITIVE' :
      return 'green'
   if row['Sentiment'] == 'NEGATIVE':
      return 'red'
   return 'black'

In [35]:
allSuisse['Colour'] = allSuisse.apply (lambda row: getColour (row),axis=1)
allSuisse.head()

Unnamed: 0,City,Sentiment,Count,Latitude,Longitude,Country,Colour
0,.Ch,POSITIVE,14,46.798562,8.231974,Suisse,green
1,Aadorf,POSITIVE,47,47.491578,8.902953,Suisse,green
2,Aarau,POSITIVE,1164,47.392715,8.044445,Suisse,green
3,Aarberg,POSITIVE,4,47.044335,7.2753,Suisse,green
4,Aarburg,POSITIVE,84,47.320642,7.89936,Suisse,green


Simple folium map of Switzerland.

In [39]:
cantons_geo = r'ch-cantons.topojson.json'
canton_map = folium.Map(location=[46.8, 8.33],tiles='OpenStreetMap', zoom_start=8)
canton_map.choropleth(geo_path = cantons_geo, topojson='objects.cantons', fill_color='#3186cc')
canton_map

Plotting all the cities by using the coordinates and indicating the sentiments with balloon colour.

In [37]:
for index, row in allSuisse.iterrows():
    folium.Marker([row['Latitude'], row['Longitude']], popup=row["City"],icon=folium.Icon(color=row["Colour"], icon='ok-sign')).add_to(canton_map)

#canton_map.save( 'mymap.html')


In [38]:
canton_map

In [None]:
#canton_map.save( 'mapForTwitterGivenSentiments10months.html')

Map with circles around locations, where the radius of the circle correspond with the count of tweets with that sentiment.

In [30]:
for index, row in allSuisse.iterrows():
    folium.CircleMarker([row['Latitude'], row['Longitude']], radius=row["Count"],popup=row["City"],color=row["Colour"],fill_color=row["Colour"]).add_to(canton_map)


In [31]:
canton_map

In [None]:
#canton_map.save( 'mapForTwitterGivenSentiments10monthsRadius.html')

As the circles for some cities like Zürich and Bern are covering the whole map, we decided to try to divide the count by 5 to get a more readable map.

In [40]:
allSuisse['Count/5'] = allSuisse['Count'].map(lambda x: x/5)
allSuisse.head()

Unnamed: 0,City,Sentiment,Count,Latitude,Longitude,Country,Colour,Count/5
0,.Ch,POSITIVE,14,46.798562,8.231974,Suisse,green,2.8
1,Aadorf,POSITIVE,47,47.491578,8.902953,Suisse,green,9.4
2,Aarau,POSITIVE,1164,47.392715,8.044445,Suisse,green,232.8
3,Aarberg,POSITIVE,4,47.044335,7.2753,Suisse,green,0.8
4,Aarburg,POSITIVE,84,47.320642,7.89936,Suisse,green,16.8


In [42]:
for index, row in allSuisse.iterrows():
    folium.CircleMarker([row['Latitude'], row['Longitude']], radius=row["Count/5"],popup=row["City"],color=row["Colour"],fill_color=row["Colour"]).add_to(canton_map)


In [43]:
canton_map

On that map we see that Zürich has a large positive circle and also Geneva and a smaller one is around Bern. These are some of the most populated cities in Switzerland, which means more tweets.

In [53]:
#canton_map.save( 'mapForTwitterGivenSentiments10monthsRadius.html')