In [None]:
# ADA Project: Tweet sentiments classification

# Purpose of this notebook: 
Before diving to the huge daasets provided by Swisscom, need to understand it and understand want we want to do with it. So we will use this notebook with a small sample of the datasets, to simulate the steps of our project. The final goal is to create a python scipt, so we just need to launch it on the cluster to obtain the desired results.

In [2]:
import pandas as pd

# 0. Data exploration

## 0.1 Start by loading the schema of the data.

Note that *schema.txt* file given in the epfl cluster was manually modified in order to import it easily with pandas.

In [3]:
schema_rawfile = pd.read_csv("twitter-swisscom/schema_home.txt", header=None, sep='\s+')
twitter_data_columns = schema_rawfile[1].values
print("Nbr features:",len(twitter_data_columns))
list(twitter_data_columns)

Nbr features: 20


['id',
 'userId',
 'createdAt',
 'text',
 'longitude',
 'latitude',
 'placeId',
 'inReplyTo',
 'source',
 'truncated',
 'placeLatitude',
 'placeLongitude',
 'sourceName',
 'sourceUrl',
 'userName',
 'screenName',
 'followersCount',
 'friendsCount',
 'statusesCount',
 'userLocation']

In [4]:
data_columns = schema_rawfile[1].values

## 0.2 Import sample of the data

We will not work directly with the entire datasets. Instead, we prepare our code and function here, with a small sample of the data. Then we create a script to be executed in the EPFL cluster.

**First step:** "Clean" the tsv file. The files we are given are in a *.tsv* format: there is one tweet per line and each features within a tweet are separated by a tab. Unfortunately, some tweets text contain break line (*\n*). If we try to import them like that, python will take those break line *\n* as a new line, therefore as a new tweet. 

But the break line inside the tweets are preceed by a backslash. Therefore, we run a bash command to replace all combination of *\ + \n* by a simple white space.

>     sed -e :a -e '/\\$/N; s/\\\n/ /; ta' twitter-swisscom/sample.tsv  > data_clean/sample.tsv

**Second step**: The file *data_clean/sample.tsv* has the following attributes:

    *application/octet-stream; charset=binary*

Therefore, we cannot load it *as is* into a pandas Dataframe. We will convert it to a simple *text/plain* with encoding *utf-8*, that pandas can perfectly handle.

>     cat data_clean/sample.tsv | tr -d '\0' > data_clean/sample_formatted.tsv

**Third step**: Now we can import the data to a pandas dataframe.

In [149]:
import csv
sample_data = pd.read_table(r"data_clean/sample_formatted.tsv",sep='\t', quoting=csv.QUOTE_NONE, header=None, names=data_columns, index_col=0)
sample_data.head()

Unnamed: 0_level_0,userId,createdAt,text,longitude,latitude,placeId,inReplyTo,source,truncated,placeLatitude,placeLongitude,sourceName,sourceUrl,userName,screenName,followersCount,friendsCount,statusesCount,userLocation
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
776522983837954049,735449229028675584,2016-09-15 20:48:01,se lo dici tu... https://t.co/x7Qm1VHBKL,\N,\N,51c0e6b24c64e54e,\N,1,,46.0027,8.96044,Twitter for iPhone,http://twitter.com/#!/download/iphone,plvtone filiae.,hazel_chb,146,110,28621,Earleen.
776523000636203010,2741685639,2016-09-15 20:48:05,https://t.co/noYrTnqmg9,\N,\N,4e7c21fd2af027c6,\N,1,,46.8131,8.22414,Twitter for iPhone,http://twitter.com/#!/download/iphone,samara,letisieg,755,2037,3771,Suisse
776523045200691200,435239151,2016-09-15 20:48:15,@BesacTof @Leonid_CCCP Tu dois t'engager en si...,\N,\N,12eb9b254faf37a3,776522113859608576,5,,47.201,5.94082,Twitter for Android,http://twitter.com/download/android,lebrübrü❤,lebrubru,811,595,30191,Fontain
776523058404290560,503244217,2016-09-15 20:48:18,@Mno0or_Abyat اشوف مظاهرات على قانون العمل الج...,\N,\N,30bcd7f767b4041e,776521597515624448,1,,45.8011,6.16552,Twitter for iPhone,http://twitter.com/#!/download/iphone,عبدالله القنيص,bingnais,28433,417,12262,Shargeyah
776523058504925185,452805259,2016-09-15 20:48:18,Greek night #geneve (@ Emilios in Genève) http...,6.14414,46.1966,c3a6437e1b1a726d,\N,3,,46.2048,6.14319,foursquare,http://foursquare.com,Alkan Şenli,Alkanoli,204,172,3390,İstanbul/Burgazada


## 0.3 Data stats

In [151]:
len(sample_data)

8790

In [158]:
tweet_text = sample_data.text
tweet_text[tweet_text.isnull()]

id
776523375585923072    NaN
776523779556118529    NaN
776527040581406720    NaN
776529088081195008    NaN
776529152002428928    NaN
776530994543333377    NaN
776531875762466816    NaN
776532154230734848    NaN
776534662445469696    NaN
776535185550639104    NaN
776536240476880896    NaN
776536854057349120    NaN
776537026996891649    NaN
776537681077600258    NaN
776538830841253890    NaN
776543990283956224    NaN
776547011898245121    NaN
776547246355652609    NaN
776547299682054144    NaN
776550967055638528    NaN
776552272599875584    NaN
776553481729667074    NaN
776556434049888257    NaN
776560854481379329    NaN
776563963756707845    NaN
776574235678564352    NaN
776583741695139840    NaN
776593476792168448    NaN
776595512900321280    NaN
776636303203045377    NaN
                     ... 
776741317648982016    NaN
776744289107898368    NaN
776745663988105216    NaN
776745714659581952    NaN
776746721284087808    NaN
776747965339856896    NaN
776751397572382720    NaN
776752557