In order to compute the location information we need to the visualizations, we will used the *district* method from the previous method. The idea is to compute the nearest district capital for each tweet, and based on this value, we can also retrieve the tweet location canton.

We have choosen to run this code on a "simple" laptop. It took a full day and a half to run it for the entire dataset. Since python is not multicore, we run two instance of the scipt at the same time, each one with a half of the data. Each chunk of the data save its result in a csv file, that we append all together after the computation was done.

Note that we decided to not us Spark for time consideration. Indeed, it would have probably requires us at least half a day to understand how to use our code with Spark and Hadoop. But in the middle of the exam session, every second counts! Therefore, we decided to let the computer run for a day while we prepare less funny exams.

The python scipt we used is copy/paste on the following cell.

In [None]:
# coding: utf-8


import pandas as pd
import numpy as np
import csv
import folium
import json
import pickle
from geopy.distance import vincenty


#
## First: merge the districts and canton information
#


# Load the districts information
df_districts = pd.read_table('data/ch_districts_capital.txt', sep=',')
df_districts.index = df_districts.id_ofs

# Load the cities data, containing the canton of each city
switzerland_cities = pd.read_csv("data/switzerland_cities.txt")
switzerland_cities = switzerland_cities.sort_values('Population', ascending=False)

# Combine both datasets
district_idToLocation = list()

for i,district in df_districts.iterrows():
    capital_location = switzerland_cities[switzerland_cities['ASCII City Name'] == district.capital.lower()]
    district_location = str(capital_location.Longitude.values[0]) + "," + str(capital_location.Latitude.values[0])
    district_idToLocation.append([[district.id_ofs,district.canton], district_location])





#
## Now we can deal with the tweets
#


# Tweets data schema_rawfile
schema_rawfile = pd.read_csv("twitter-swisscom/schema_home.txt", header=None, sep='\s+')
data_columns = schema_rawfile[1].values


def findLocation(x):
    """
        Return the location of the tweet. The location is a string composed of the longitude and the latitude.
        The coordinates are found out by the 'longitude' and 'latitude' columns of the data. If these values are null, 
        we try to retrieve the location through the 'placeLongitude' and 'placeLatitude' columns.
        If this second location is also null, we return a null locaton ("nan, nan")

        INPUT
                x:      A row of the tweets data

        OUTPUT
                loc1:   The longitude-latitude tuple, or 'nan,nan' is case of null location
    """
    loc1 = str(x.longitude) + "," + str(x.latitude)
    if (loc1 == "nan,nan" or loc1 == r"\N,\N"):
        loc1 = str(x.placeLongitude) + "," + str(x.placeLatitude)
    if (loc1 == r"\N,\N"):
        loc1 = "nan,nan"
    return loc1




def findNearestCity(x):
    """
        Return the district capital which is nearest to the tweet location. 
        If the nearest district capital is more that 50 kilometers way, return a NULL value
    """

    # Compute the nearest district capital
    min_city = min(district_idToLocation, key=lambda x: vincenty(x[1], row.Location).miles)

    # Assert the distance is smaller that 50 km
    min_distance = vincenty(min_city[1],row.Location).km
    if min_distance > 50:
        return "NULL"               # Foreign tweet
    else:
        return min_city[0]          # All good baby !



# Variables used in the 'for loop'
i = 0                                   # the current chunk
removedTot=0                            # Number of removed tweets with a null location
removedIds = list()                     # List of the index of the removed tweets with a null location
removedRows = 0                         # Number of removed tweets by an invalid index
removedRowsIds = list()                 # List of the index of the removed tweets with an invalid location


usecoll = ['id','text','longitude','latitude','placeLongitude','placeLatitude']    # Columns to load with the data
for data in pd.read_table(open("twitter-swisscom/twex.tsv", 'rU'),sep='\t',encoding='utf-8',escapechar="\\",na_values='N', index_col=0,quoting=csv.QUOTE_NONE, header=None, names=data_columns, chunksize=10000, engine='c', usecols=usecoll):

    # Range of chunks to take "jump"
    if (i > 1000):
        if (i%10 == 0):
            print("---------------",i)
        i += 1
        continue
    


    # Compute the location information for each tweet
    data['Location'] = data.apply(findLocation, axis=1)
    a = data.Location == "nan,nan"                                                  # If we weren't able to retrieve the location information, we can't use those tweets => remove them
    removedTot += len(data[a])
    removedIds.append(data[a].index)
    data = data[data.Location != "nan,nan"]                                         # Update data

    
    # Assert the validity of the tweets by asserting that the index is a number
    data['idx'] = data.index
    data['isIdxValid'] = data.apply(lambda row: str(row.idx).isdigit(), axis=1)
    removedRows += (data.isIdxValid == False).sum()                                 # If the index is not valid, remove those tweets as well
    removedRowsIds.append(data[data.isIdxValid == False].index)
    data = data[data.isIdxValid == True]                                            # Update date

    if (len(data) == 0):
        i += 1
        continue

    
    # Compute the closes district
    data['LocInfo'] = data.apply(findNearestCity, axis=1)
    data = data[data.LocInfo != "NULL"]                                             # Remove the "foreign" tweets

    data['District'] = data.apply(lambda r: r.LocInfo[0], axis=1)
    data['Canton'] = data.apply(lambda r: r.LocInfo[1], axis=1)
    

    # Export the triplete "tweet id"+"District"+"Canton"
    data_to_export = data[['District', 'Canton']]
    name = 'data_district_canton/data_'+str(i)+'.csv'
    data_to_export.to_csv(name, header=False)
    
    
    # Go to the next chunk
    if (i%10 == 0):
        print("---------------",i)
    i+=1



print("DONE")
print("RemovedTot: ",removedTot)
print("RemovedIDS: ", removedIds)


filehandler = open("out/removedIds","wb")
pickle.dump(removedIds,filehandler)
filehandler.close()

filehandler = open("out/removedRowsIds","wb")
pickle.dump(removedRowsIds,filehandler)
filehandler.close()
