In this notebook, I demonstrate the way we can get started with the data scraped so far. 

The data resides in .txt files. Each file contains data for 1000 tweets. The nature of tweets scraped is as below:
- These tweets explicitly mentioned the respective presidential candidate. No retweeting, no replying to a candidates' tweets etc. These tweets are important in the sense that they explicitly seek to engage a presidential candidate on Twitter. They are not necessarily reactions to/retweets of a presidential candidate's own tweets. Given these assumptions, I believe they are relatively more rich for information-topic discovery.
- Tweets that span multiple lines (have a \n character to be precise) were modified to have an explicit backslashN string instead of \n (to make storage simple). Please replace backslashN with \n explicitly when analyzing the tweets.
- The structure of each tweet data stored per-line is the following fields separated by a :::: delimiter: retweet_count, author_screen_name, author_followers_count, author_following_count, created_at, id_str, full_text


Let's pick one of the files containing Twitter data. I'll choose Bernie.txt as a reference.




In [3]:
with open('Bernie.txt', 'r') as fBernie:
    for line in fBernie:
        data = line.split('::::')
        for field in data:
            print(field)
        break
    

0
ghostriderr74
24
85
2019-11-24 00:35:01
1198399587649613824
@BernieSanders Not -- in my town we have medical clinic charging $85 dollars a month for a family of 4 and discounted med.  By law if something bad happens emergency rooms have to take them.



# Making a Pandas dataframe from tweets

In [59]:
# uncomment and run this if you get a pandas-related error
#!pip install pandas

In [6]:
import pandas as pd

In [7]:
with open('Bernie.txt', 'r') as fBernie:
    file_data = [line.replace('backslashN', '\n').split('::::') for line in fBernie.readlines()]
    

In [10]:
field_names = ['retweet_count', 'author_screen_name', 'author_followers_count', 'author_following_count', 'created_at', 'id_str', 'full_text']
df = pd.DataFrame(data=file_data, columns=field_names)
df.head()

Unnamed: 0,retweet_count,author_screen_name,author_followers_count,author_following_count,created_at,id_str,full_text
0,0,ghostriderr74,24,85,2019-11-24 00:35:01,1198399587649613824,@BernieSanders Not -- in my town we have medic...
1,0,michaelfrank17,2078,2202,2019-11-24 00:35:01,1198399586403942400,This week I heard @MMFlint say that Nancy Pelo...
2,6,savemain_st,46654,18548,2019-11-24 00:35:00,1198399585049284672,@ninaturner @PortiaABoulger @BernieSanders @jj...
3,0,GeoffWaters5,15,21,2019-11-24 00:35:00,1198399582847340544,@stro1786 @BernieSanders Is that a good thing ...
4,0,FLOURNOYFarrell,3505,4994,2019-11-24 00:34:57,1198399569853325312,@KelticSC @thekaraboudjan @SallyAlbright @Lewi...


In [11]:
df.describe()


Unnamed: 0,retweet_count,author_screen_name,author_followers_count,author_following_count,created_at,id_str,full_text
count,1000,1000,1000,1000,1000,1000,1000
unique,23,717,473,578,862,1000,1000
top,0,DeplorableDavi4,4,137,2019-11-24 00:08:36,1198397131746185216,@PublicHenemy1 @BernieSanders Berning up like ...
freq,857,11,16,18,4,1,1


In [12]:
df.dtypes

retweet_count             object
author_screen_name        object
author_followers_count    object
author_following_count    object
created_at                object
id_str                    object
full_text                 object
dtype: object

## Example:Mining for Potentially Useful Statistics??

Twitter API returns tweet data in a random manner. Let's see how many tweets per day we actually have..

In [35]:
df.groupby(pd.to_datetime(df.created_at).map(lambda x:x.date())).count()["id_str"]

created_at
2019-11-23    322
2019-11-24    678
Name: id_str, dtype: int64

Let's create a simple dataframe which contains a subset of columns from the one created above

In [49]:
df_simple = df.loc[:, ['created_at', 'id_str', 'full_text']]

df_simple['created_at'] = pd.to_datetime(df_simple['created_at'])

df_simple.head()

Unnamed: 0,created_at,id_str,full_text
0,2019-11-24 00:35:01,1198399587649613824,@BernieSanders Not -- in my town we have medic...
1,2019-11-24 00:35:01,1198399586403942400,This week I heard @MMFlint say that Nancy Pelo...
2,2019-11-24 00:35:00,1198399585049284672,@ninaturner @PortiaABoulger @BernieSanders @jj...
3,2019-11-24 00:35:00,1198399582847340544,@stro1786 @BernieSanders Is that a good thing ...
4,2019-11-24 00:34:57,1198399569853325312,@KelticSC @thekaraboudjan @SallyAlbright @Lewi...


# Example: Group tweets by day/week/month 

you can replace x.date with x.year, x.day, x.month etc. to specify other grouping criteria.

In [58]:
for groupid, group in df_simple.groupby(df_simple['created_at'].map(lambda x:x.date())):
    print(groupid)
    display(group.head(), )
    print('-------------------')

2019-11-23


Unnamed: 0,created_at,id_str,full_text
678,2019-11-23 23:59:49,1198390731095674880,@yanghis_khan @EntropyXero @deltdennison @Bern...
679,2019-11-23 23:59:47,1198390719062388736,@gregbonner30 @BernieSanders Shutup Greg\n
680,2019-11-23 23:59:44,1198390708463382528,@Locke_Wiggins @problemfellow @ScottTrudell @B...
681,2019-11-23 23:59:41,1198390696467558401,@BernieSanders Back to the 50’s\n
682,2019-11-23 23:59:41,1198390695511199744,"@LMplusG @BernieSanders Yes, however, Bernie, ..."


-------------------
2019-11-24


Unnamed: 0,created_at,id_str,full_text
0,2019-11-24 00:35:01,1198399587649613824,@BernieSanders Not -- in my town we have medic...
1,2019-11-24 00:35:01,1198399586403942400,This week I heard @MMFlint say that Nancy Pelo...
2,2019-11-24 00:35:00,1198399585049284672,@ninaturner @PortiaABoulger @BernieSanders @jj...
3,2019-11-24 00:35:00,1198399582847340544,@stro1786 @BernieSanders Is that a good thing ...
4,2019-11-24 00:34:57,1198399569853325312,@KelticSC @thekaraboudjan @SallyAlbright @Lewi...


-------------------


In [57]:
for groupid, group in df_simple.groupby(df_simple['created_at'].map(lambda x:x.date())):
    print(groupid)
    display(group['full_text'].head(), )
    print('-------------------')

2019-11-23


678    @yanghis_khan @EntropyXero @deltdennison @Bern...
679           @gregbonner30 @BernieSanders Shutup Greg\n
680    @Locke_Wiggins @problemfellow @ScottTrudell @B...
681                    @BernieSanders Back to the 50’s\n
682    @LMplusG @BernieSanders Yes, however, Bernie, ...
Name: full_text, dtype: object

-------------------
2019-11-24


0    @BernieSanders Not -- in my town we have medic...
1    This week I heard @MMFlint say that Nancy Pelo...
2    @ninaturner @PortiaABoulger @BernieSanders @jj...
3    @stro1786 @BernieSanders Is that a good thing ...
4    @KelticSC @thekaraboudjan @SallyAlbright @Lewi...
Name: full_text, dtype: object

-------------------


# Where to go from here??

This notebook demonstrates how we can group tweets by a criterion (here it was daily grouping). Following the grouping, we can run do topic modeling on tweets for a particular day and identify main/important topics. Doing so is essentially what we suggested in the proposal.