#### Extracting sentiments from tweets
This is an explanation on how to extract the tweets sentiments. It will extract them for the month of April, which takes time. You can find these data pickled in the tweets_en_april.pkl file

In [None]:
import pandas as pd
import twitter_extract

In [None]:
sample = pd.read_json('data/april/harvest3r_twitter_data_01-04_0.json')

What kind of data are present in a tweet?

In [None]:
sample.head()

We are mostly interested in the _source field

In [None]:
sample.loc[0]._source

We also have the tweets IDs

In [None]:
sample._id.head()

Problem: it seems that the tweets IDs are not unique !

In [None]:
sample._id.value_counts().head()

How many tweets for this day ?

In [None]:
len(sample._source)

Create a new pre-allocated DataFrame

In [None]:
tweets = sample._source
ids = sample._id
ids.name = 'id'

# Loop on all tweets to get all different fields
columns = set()
for tweet in tweets:
    if tweet.keys() is not None:
        columns.update(tweet.keys())
columns = list(columns)

# Pre-allocate the DataFrame, otherwise it takes too much time to fill
# Don't use the tweets IDs for filling the DF as they are not unique !
df = pd.DataFrame(columns=columns, index=range(len(ids)))
df.head()

In [None]:
for i in range(len(tweets)):
    for key, value in tweets[i].items():
        # Convert lists to strings
        if type(value) == list:
            tweets[i][key] = ' '.join(value)
        
    df.loc[i] = pd.Series(tweets[i])

# Give the tweets their original IDs
df.index = ids

df.head()

What are the users' locations for these tweets?

In [None]:
df['source_location'].value_counts().head()

What are the sentiments associated to the tweets ?

In [None]:
df['sentiment'].value_counts()

Let's extract a few tweets

In [None]:
df['main'].head(10)

In what languages are they ?

In [None]:
df['lang'].value_counts()

For this first part, keep only the english tweets

In [None]:
df_en = df[df.lang == 'en']
df_en['main'].head(10)

What about their sentiments ?

In [None]:
df_en['sentiment'].value_counts()

In [None]:
tweets = df_en[['source_location', 'sentiment']]
tweets.head(10)

Okay, now we have to do this for the whole data set ! See twitter_extract file for the automated processing

In [None]:
df = twitter_extract.parse_month('april', '04')

In [None]:
df.to_pickle('processed/tweets_en_april.pkl')