In [1]:
# ADA Project: Tweet sentiments classification

# Purpose of this notebook: 
Before diving to the huge daasets provided by Swisscom, need to understand it and understand want we want to do with it. So we will use this notebook with a small sample of the datasets, to simulate the steps of our project. The final goal is to create a python scipt, so we just need to launch it on the cluster to obtain the desired results.

In [2]:
import pandas as pd

# 0. Data exploration

## 0.1 Start by loading the schema of the data.

Note that *schema.txt* file given in the epfl cluster was manually modified in order to import it easily with pandas.

In [3]:
schema_rawfile = pd.read_csv("twitter-swisscom/schema_home.txt", header=None, sep='\s+')
twitter_data_columns = schema_rawfile[1].values
print("Nbr features:",len(twitter_data_columns))
list(twitter_data_columns)

Nbr features: 20


['id',
 'userId',
 'createdAt',
 'text',
 'longitude',
 'latitude',
 'placeId',
 'inReplyTo',
 'source',
 'truncated',
 'placeLatitude',
 'placeLongitude',
 'sourceName',
 'sourceUrl',
 'userName',
 'screenName',
 'followersCount',
 'friendsCount',
 'statusesCount',
 'userLocation']

In [4]:
data_columns = schema_rawfile[1].values

## 0.2 Import sample of the data

We will not work directly with the entire datasets. Instead, we prepare our code and function here, with a small sample of the data. Then we create a script to be executed in the EPFL cluster.

**First step:** "Clean" the tsv file. The files we are given are in a *.tsv* format: there is one tweet per line and each features within a tweet are separated by a tab. Unfortunately, some tweets text contain break line (*\n*). If we try to import them like that, python will take those break line *\n* as a new line, therefore as a new tweet. 

But the break line inside the tweets are preceed by a backslash. Therefore, we run a bash command to replace all combination of *\ + \n* by a simple white space.

>     sed -e :a -e '/\\$/N; s/\\\n/ /; ta' twitter-swisscom/sample.tsv  > data_clean/sample.tsv

**Second step**: The file *data_clean/sample.tsv* has the following attributes:

    *application/octet-stream; charset=binary*

Therefore, we cannot load it *as is* into a pandas Dataframe. We will convert it to a simple *text/plain* with encoding *utf-8*, that pandas can perfectly handle.

>     cat data_clean/sample.tsv | tr -d '\0' > data_clean/sample_formatted.tsv

**Third step**: Now we can import the data to a pandas dataframe.

In [5]:
import csv
sample_data = pd.read_table(r"data_clean/sample_formatted.tsv",sep='\t', quoting=csv.QUOTE_NONE, header=None, names=data_columns, index_col=0)
sample_data.head()

Unnamed: 0_level_0,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,\N,\N,51c0e6b24c64e54e,\N,1,,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.
776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,\N,\N,4e7c21fd2af027c6,\N,1,,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse
776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,\N,\N,12eb9b254faf37a3,776522113859608576,5,,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain
776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,\N,\N,30bcd7f767b4041e,776521597515624448,1,,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah
776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,\N,3,,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada


## 0.3 Data stats

In [6]:
len(sample_data)

8790

### 0.3.1 Percentage with geolocation

In [21]:
long = sample_data.longitude
long_not_null = (long != r'\N').sum()
print("Percentage of data with longitude info:", 100*(long_not_null/len(sample_data)))
print("Nbr: ", long_not_null)

Percentage of data with longitude info: 17.6905574516
Nbr:  1555


In [22]:
lat = sample_data.latitude
lat_not_null = (lat != r'\N').sum()
print("Percentage of data with latitude info:", 100*(lat_not_null/len(sample_data)))
print("Nbr: ", lat_not_null)

Percentage of data with latitude info: 17.6905574516
Nbr:  1555


In [26]:
long_not_null = long!=r'\N'
lat_not_null = lat!=r'\N'
diff = (long_not_null != lat_not_null).any()
print("Is there null long/lat with valid lat/long ? :", diff)

Is there null long/lat with valid lat/long ? : False


That's for the latitude/longitude... what ? The user ? 
Because we also have the fields ***placeLongitude*** and ***placeLatitude***.

From the twitter api:
> Places are specific, named locations with corresponding geo coordinates. They can be attached to Tweets by specifying a place_id when tweeting. Tweets associated with places are not necessarily issued from that location but could also potentially be about that location.

Since we want to map the sentiment of tweet with a given location, we will use fields ***placeLongitude*** and ***placeLatitude*** for this project.

In [79]:
xx = sample_data.groupby(['placeId']).size()
xx.sort(ascending=False)
xx

  from ipykernel import kernelapp as app


placeId
3acb748d0f1e9265    835
c3a6437e1b1a726d    825
04578354cff6f4b1    365
8b3e53628223753a    318
6c07f3233c333f95    251
bbf6c74e4f26f23d    208
12eb9b254faf37a3    208
51c0e6b24c64e54e    169
1ec449abab9e0ec8    125
3cb0557357370cbf    110
e38a1a641d02f8db    109
4e7c21fd2af027c6    106
bec98271815d2329     98
cd661902b07eb657     96
1d4e1f5605836450     92
3762f0997bd0c84b     90
dfbf4ce81da97883     82
5e39c02d7b0fa13a     70
094f032bf5c78dc8     67
af27c5a148d8371f     67
485374deda4cf2ca     66
068c70be7b3a4cc2     66
57d9b5e0a53e48e5     63
2d1bdaaa734bb96a     62
a1b759dc059e346f     59
ed643913fb3cab42     57
1ebde7593f067b2c     51
384df7b39fc369c8     50
ec72c229283bb562     48
2f41eda3156d6485     46
                   ... 
48950e695c18c826      1
870d500f8d9d8edd      1
8ebf55849f24fbe1      1
4455634241fb3569      1
96ea76d4e93b725a      1
44078048b6253bac      1
96e087c60293eed6      1
96957eca37d4de9e      1
4066be3eef8e635e      1
966717ca052da216      1
965e0666