#### Extracting sentiments from tweets
This is an explanation on how to extract the tweets sentiments. It will extract them for the month of April, which takes time. You can find these data pickled in the tweets_en_april.pkl file

In [1]:
import pandas as pd

In [2]:
sample = pd.read_json('data/april/harvest3r_twitter_data_01-04_0.json')

What kind of data are present in a tweet?

In [3]:
sample.head()

Unnamed: 0,_id,_index,_score,_source,_type
0,1459505293000008960,merged_content_2016_04_01_to_2016_04_14,0.001139,"{'main_authoritative': True, 'metadata_score':...",content
1,1459504074000012800,merged_content_2016_04_01_to_2016_04_14,0.001139,"{'main_authoritative': True, 'metadata_score':...",content
2,1459547684000013568,merged_content_2016_04_01_to_2016_04_14,0.001139,"{'main_authoritative': True, 'metadata_score':...",content
3,1459478188000002816,merged_content_2016_04_01_to_2016_04_14,0.001139,"{'main_authoritative': True, 'metadata_score':...",content
4,1459519767000012288,merged_content_2016_04_01_to_2016_04_14,0.001139,"{'main_authoritative': True, 'metadata_score':...",content


We are mostly interested in the _source field

In [4]:
sample.loc[0]._source

{'author_avatar_img': 'https://pbs.twimg.com/profile_images/1195500897/Icon_BBA_bigger.png',
 'author_gender': 'UNKNOWN',
 'author_handle': 'bba_allmedia',
 'author_link': 'https://twitter.com/bba_allmedia',
 'author_name': 'BBA Allmedia',
 'bucket': 1459505100099,
 'canonical': 'https://twitter.com/bba_allmedia/status/715843057392992256',
 'date_found': '2016-04-01T10:08:13Z',
 'domain': 'twitter.com',
 'hashcode': 'xxAm2o2O27-BV82_lRzXa2WTt-8',
 'index_method': 'SOURCE_TASK_COMPOSITE',
 'lang': 'de',
 'links': ['https://t.co/s4g1d8KiWe'],
 'main': 'Aktuellste Immoangebote http://tinyurl.com/cthce3f\xa0',
 'main_authoritative': True,
 'main_checksum': '01uIm3DfvaRag9mywHxDhXv3hYM',
 'main_format': 'TEXT',
 'main_length': 51,
 'metadata_score': 304,
 'permalink': 'https://twitter.com/bba_allmedia/status/715843057392992256',
 'published': '2016-04-01T10:07:39Z',
 'resource': 'https://twitter.com/bba_allmedia/status/715843057392992256',
 'sentiment': 'NEUTRAL',
 'sequence': 1459505293000

We also have the tweets IDs

In [5]:
sample._id.head()

0    1459505293000008960
1    1459504074000012800
2    1459547684000013568
3    1459478188000002816
4    1459519767000012288
Name: _id, dtype: int64

Problem: it seems that the tweets IDs are not unique !

In [6]:
sample._id.value_counts().head()

1459523793000004096    17
1459523794000004096    11
1459559363000008960     9
1459538897000013568     7
1459501304000013568     6
Name: _id, dtype: int64

How many tweets for this day ?

In [7]:
len(sample._source)

25333

Create a new pre-allocated DataFrame

In [8]:
tweets = sample._source
ids = sample._id
ids.name = 'id'

# Loop on all tweets to get all different fields
columns = set()
for tweet in tweets:
    if tweet.keys() is not None:
        columns.update(tweet.keys())
columns = list(columns)

# Pre-allocate the DataFrame, otherwise it takes too much time to fill
# Don't use the tweets IDs for filling the DF as they are not unique !
df = pd.DataFrame(columns=columns, index=range(len(ids)))
df.head()

Unnamed: 0,source_verified,type,permalink,author_gender,author_link,sequence,source_parsed_posts_max,source_content_length,source_handle,source_setting_update_strategy,...,source_last_updated,mentions,version,shares,source_location,source_update_interval,source_publisher_subtype,date_found,source_likes,main_format
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,


In [9]:
for i in range(len(tweets)):
    for key, value in tweets[i].items():
        # Convert lists to strings
        if type(value) == list:
            tweets[i][key] = ' '.join(value)
        
    df.loc[i] = pd.Series(tweets[i])

# Give the tweets their original IDs
df.index = ids

df.head()

Unnamed: 0_level_0,source_verified,type,permalink,author_gender,author_link,sequence,source_parsed_posts_max,source_content_length,source_handle,source_setting_update_strategy,...,source_last_updated,mentions,version,shares,source_location,source_update_interval,source_publisher_subtype,date_found,source_likes,main_format
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1459505293000008960,False,POST,https://twitter.com/bba_allmedia/status/715843...,UNKNOWN,https://twitter.com/bba_allmedia,1459505293000009032,0,270959,bba_allmedia,CYCLICAL,...,2015-12-11T13:08:19Z,,5.1.683,,Bremgarten,3600000,twitter,2016-04-01T10:08:13Z,0,TEXT
1459504074000012800,False,POST,https://twitter.com/BadZurzach/status/71583797...,UNKNOWN,https://twitter.com/BadZurzach,1459504074000012794,20,278303,BadZurzach,CYCLICAL,...,2015-12-16T23:41:18Z,,5.1.683,,Bad Zurzach,3600000,twitter,2016-04-01T09:47:54Z,0,TEXT
1459547684000013568,False,POST,https://twitter.com/AurelieL34/status/71601995...,UNKNOWN,https://twitter.com/AurelieL34,1459547684000013471,20,301219,AurelieL34,CYCLICAL,...,2015-12-12T13:24:33Z,,5.1.683,,Ginevra,3600000,twitter,2016-04-01T21:54:44Z,0,TEXT
1459478188000002816,True,POST,https://twitter.com/ALFEEL_GOOD/status/7157293...,UNKNOWN,https://twitter.com/ALFEEL_GOOD,1459478188000002741,0,298888,ALFEEL_GOOD,CYCLICAL,...,2015-09-29T22:16:46Z,ZHA_News,5.1.681,,Lucerna,3600000,twitter,2016-04-01T02:36:28Z,0,TEXT
1459519767000012288,False,POST,https://twitter.com/velo_dominik/status/715903...,MALE,https://twitter.com/velo_dominik,1459519767000012179,20,296657,velo_dominik,CYCLICAL,...,2015-12-09T11:46:26Z,,5.1.683,,Hochdorf,3600000,twitter,2016-04-01T14:09:27Z,0,TEXT


What are the users' locations for these tweets?

In [10]:
df['source_location'].value_counts().head()

Switzerland    9002
Suisse         2325
Zürich         2298
Schweiz        1813
Geneva         1538
Name: source_location, dtype: int64

What are the sentiments associated to the tweets ?

In [11]:
df['sentiment'].value_counts()

NEUTRAL     18708
POSITIVE     4130
NEGATIVE     1597
Name: sentiment, dtype: int64

Let's extract a few tweets

In [12]:
df['main'].head(10)

id
1459505293000008960    Aktuellste Immoangebote http://tinyurl.com/cth...
1459504074000012800    Du bist 6 - 14 Jahre alt & sprichst Schweizerd...
1459547684000013568    EN DIRECT sur #Periscope : Avec awa la blg htt...
1459478188000002816    @ZHA_News https://youtu.be/UxQmxIm4q1s  Rest i...
1459519767000012288    Unterwegs rastet der Velofahrer gerne bei eine...
1459521768000012032    Come and join me at our testing Roadshow on Ap...
1459547783000011008                Bad Apple! pic.twitter.com/Xad2aVOHFd
1459511039000003840    Number crunching for the past week - 1 new unf...
1459506286000000512                   @ElisaaNunees poisson d'avril laul
1459492313000010240    Just posted a photo @ Langnau im Emmental http...
Name: main, dtype: object

In what languages are they ?

In [13]:
df['lang'].value_counts()

en     9608
de     6656
fr     4835
es     1446
und     898
it      498
pt      223
tl      134
in      133
ht      116
pl      111
tr      101
nl       94
ja       90
ru       64
ar       52
da       47
et       45
sv       42
ko       26
fi       23
no       22
hi       22
sl       13
lv       10
hu        7
uk        5
lt        5
el        2
vi        2
is        2
bg        1
Name: lang, dtype: int64

For this first part, keep only the english tweets

In [14]:
df_en = df[df.lang == 'en']
df_en['main'].head(10)

id
1459478188000002816    @ZHA_News https://youtu.be/UxQmxIm4q1s  Rest i...
1459521768000012032    Come and join me at our testing Roadshow on Ap...
1459547783000011008                Bad Apple! pic.twitter.com/Xad2aVOHFd
1459511039000003840    Number crunching for the past week - 1 new unf...
1459492313000010240    Just posted a photo @ Langnau im Emmental http...
1459518158000004352    Europe's New Mars Mission Bringing NASA Radios...
1459525602000004096    I liked a @YouTube video http://youtu.be/LE1WG...
1459518063000005120    NASA's Spitzer Maps Climate Patterns on a Supe...
1459498184000011264    Microsoft’s HoloLens now available to develope...
1459527636000012032    "i will leave tomorrow's problem for tomorrow'...
Name: main, dtype: object

What about their sentiments ?

In [15]:
df_en['sentiment'].value_counts()

NEUTRAL     4513
POSITIVE    3538
NEGATIVE    1557
Name: sentiment, dtype: int64

In [16]:
tweets = df_en[['source_location', 'sentiment']]
tweets.head(10)

Unnamed: 0_level_0,source_location,sentiment
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1459478188000002816,Lucerna,POSITIVE
1459521768000012032,Ennetbürgen,POSITIVE
1459547783000011008,Suica,NEGATIVE
1459511039000003840,Arth,NEUTRAL
1459492313000010240,Langnau,NEUTRAL
1459518158000004352,Poschiavo,NEUTRAL
1459525602000004096,Laufen,POSITIVE
1459518063000005120,Poschiavo,POSITIVE
1459498184000011264,Poschiavo,POSITIVE
1459527636000012032,Schwyz,NEGATIVE


Okay, now we have to do this for the whole data set ! See twitter_extract_pandas.py file for the automated processing