In [None]:
import os

def scale_input_data(scale_factor):
  file_bases = ['./input/users', './input/tweets']
  for file_base in file_bases:
    import pandas as pd
    import shutil
    if scale_factor == 1.0:
      shutil.copyfile(file_base + '.csv', file_base + '.scaled.csv')
      continue
    df_to_scale = pd.read_csv(file_base + '.csv')
    new_num_rows = int(scale_factor * len(df_to_scale))
    if scale_factor <= 1.0:
      df_to_scale = df_to_scale.iloc[:new_num_rows]
    else:
      while len(df_to_scale) < new_num_rows:
        df_to_scale = pd.concat([df_to_scale, df_to_scale[:min(new_num_rows - len(df_to_scale), len(df_to_scale))]])
    df_to_scale.to_csv(file_base + '.scaled.csv', index=False)

if 'INPUT_SCALE_FACTOR' in os.environ:
  scale_input_data(float(os.environ['INPUT_SCALE_FACTOR']))

# Russian Tweet Network and Time Series Visualization

The goal of this notebook is to clean the messy data -- filled to the brim with natural language -- and analyze it via time series analysis. Since a lot of this data is based in a contemporary political context, it's important to note how the data aligns with certain political events during the period in question. No machine learning classification will be done in this project for now: it's purely a visual exploration of the data to understand it.

I would love to know any tips and tricks people have for working with time series data in python!

TODO:
3. Annotate TS plots with important political dates

In [1]:
import numpy as np #linear algebra
# import pandas as pd #data processing
exec(os.environ['IREWR_IMPORTS'])

# ALEX: remove plotting
# import seaborn as sns #visualization
# import matplotlib
# import matplotlib.pyplot as plt #visualization
# %matplotlib inline
# plt.style.use('bmh')

from datetime import datetime

In [2]:
users = pd.read_csv('./input/users.scaled.csv')

users.head()

Unnamed: 0,id,location,name,followers_count,statuses_count,time_zone,verified,lang,screen_name,description,created_at,favourites_count,friends_count,listed_count
0,18710816.0,near Utah Ave & Lighthouse an,Robby Delaware,304.0,11484.0,Pacific Time (US & Canada),False,en,RobbyDelaware,"I support the free movement of people, ideas a...",Wed Jan 07 04:38:02 +0000 2009,17.0,670.0,13.0
1,100345056.0,still ⬆️Block⤵️Corner⬇️street,#Ezekiel2517✨...,1053.0,31858.0,,False,en,SCOTTGOHARD,CELEBRITY TRAINER ✨#424W147th✨ #CrossfitCoach ...,Tue Dec 29 23:15:22 +0000 2009,2774.0,1055.0,35.0
2,247165706.0,"Chicago, IL",B E C K S T E R✨,650.0,6742.0,Mountain Time (US & Canada),False,en,Beckster319,Rebecca Lynn Hirschfeld Actress.Model.Writer.A...,Fri Feb 04 06:38:45 +0000 2011,7273.0,896.0,30.0
3,249538861.0,,Chris Osborne,44.0,843.0,,False,en,skatewake1994,,Wed Feb 09 07:38:44 +0000 2011,227.0,154.0,1.0
4,449689677.0,,Рамзан Кадыров,94773.0,10877.0,Moscow,False,ru,KadirovRussia,"Пародийный аккаунт. Озвучиваю то, что политика...",Thu Dec 29 11:31:09 +0000 2011,0.0,7.0,691.0


## Time Series Visualizations
This first section looks at a number of metrics and places them in a time series visualization. The first set is merely a distribution of the number of statuses by certain users, and a distribution of the number of followers. Both plots are based on the users set. Following this, I format the date the twitter user was created and then show two bar plots for users created by year and by month. Finally, I show this same information in a single time series plot using the matplotlib.pyplot.plot function -- this is because the seaborn.tsplot function doesn't work very easily and requires a lot of tinkering to get it working properly. I gave up and chose to just go the easier route. I might change it to a seaborn.pointplot later on.

The second section follows a similar process, but for the tweets dataset. I import the data, clean the date-time values, then extract the months and years and plot this in another time series chart using the matplotlib.pyplot.plot function. One important insight stands out immediately from these visualizations: the vast majority of users in this dataset created their accounts in 2013 or 2014, with a few trickling later on in 2015 and 2016, BUT the vast majority of tweets came during 2016, particularly during the fall. This timeline coincides with the post-convention general election campaigns along with a number of political events like the Wikileaks dump of DNC emails. Even following Election Day in November of 2016, a number of tweets still came in during the transition period and shortly thereafter. 

A couple of notes on the code:
1. In the first section I queried the user data by those users with non-nan values, as the seaborn.distplot did not automatically clean these out and constantly returned an error: useful to note for later analysis. 
2. I used df.assign to create my new dataframe columns when working the date-time metrics. This was an easy and logically straightforward method of creating new values that didn't return errors such as the dreaded "cannot be indexed on a slice" error that Pandas will throw often. In addition, for the date-time data it was necessary to change the index of the dataframe to the date-time data. This made it easy to group the data by months and years. This is of course rather different from another method, which would have been to merely create two new columns -- one for months and one for years. However, I found that this made it more difficult to create a time series representation of the data; when I grouped the data by these two columns, as I originally tried, it created two indices, which constantly returned errors when attempting to plot it. 

In [3]:
# ALEX: remove plotting
# f, ax = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
# sns.distplot(users[np.isfinite(users.statuses_count)].statuses_count, ax=ax[0])
# sns.distplot(users[np.isfinite(users.followers_count)].followers_count, ax=ax[1])
# plt.show()
_ = users[np.isfinite(users.statuses_count)].statuses_count
_ = users[np.isfinite(users.followers_count)].followers_count

In [4]:
form = '%a %b %d %H:%M:%S %z %Y'
users = users.assign(date = users.created_at.map(
    lambda x: datetime.strptime(str(x), form).date() if x is not np.nan else None))
users = users.set_index(pd.DatetimeIndex(users.date))

In [5]:
monthseries = users.groupby(by=[users.index.month]).count()
YearSeries = users.groupby([users.index.year]).count()
# ALEX: remove plotting
# f, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))
# sns.barplot(monthseries.index, monthseries.id, ax=ax[0])
# sns.barplot(YearSeries.index, YearSeries.id, ax=ax[1])
# plt.show()
_ = monthseries.index
_ = monthseries.id
_ = YearSeries.index
_ = YearSeries.id

In [6]:
TimeSeries = users.groupby([users.index.year, users.index.month]).count()
# ALEX: remove plotting
# plt.figure(figsize=(12,6))
# TimeSeries.id.plot()
# plt.xticks(rotation=45)
# plt.ylabel('Number of New Users')
# plt.xlabel('Year, Month')
# plt.show()
TimeSeries.id

date    date
2009.0  1.0      1
        5.0      1
        11.0     1
        12.0     1
2011.0  2.0      2
        12.0     1
2012.0  1.0      1
        3.0      1
        12.0     1
2013.0  6.0     11
        7.0     13
        8.0     94
        9.0     16
        12.0     1
2014.0  2.0      1
        3.0      9
        4.0     12
        5.0     64
        6.0     49
        7.0      4
        8.0      7
        9.0      1
        10.0     4
        12.0     6
2015.0  1.0      1
        3.0     15
        6.0      1
        8.0      1
        9.0      3
        10.0     8
        11.0    13
        12.0     1
2016.0  1.0      2
        2.0      4
        3.0      1
        4.0      6
        5.0      3
        6.0      2
        7.0     18
        8.0      2
        10.0     1
Name: id, dtype: int64

In [7]:
tweets = pd.read_csv('./input/tweets.scaled.csv')

In [8]:
form = '%Y-%m-%d %H:%M:%S'
tweets = tweets.assign(date = tweets.created_str.map(
    lambda x: datetime.strptime(str(x), form).date() if x is not np.nan else None))
tweets = tweets.set_index(pd.DatetimeIndex(tweets.date))

In [9]:
timeseries = tweets.groupby([tweets.index.year, tweets.index.month]).count()
# ALEX: remove plotting
# plt.figure(figsize=(12,6))
# timeseries.user_id.plot()
# plt.xticks(rotation=45)
# plt.ylabel('Number of New Tweets')
# plt.xlabel('Year, Month')
# plt.show()
timeseries.user_id

date    date
2014.0  7.0        12
        8.0         1
        9.0         1
        11.0      388
        12.0      342
2015.0  1.0      1423
        2.0      1169
        3.0      1840
        4.0      1681
        5.0      1449
        6.0      1720
        7.0      1301
        8.0       874
        9.0       292
        10.0      980
        11.0      788
        12.0     3154
2016.0  1.0      1407
        2.0      5073
        3.0      3779
        4.0       443
        5.0      1919
        6.0      1043
        7.0      7163
        8.0     11599
        9.0     25647
        10.0    27983
        11.0    21805
        12.0    19963
2017.0  1.0     21060
        2.0     10149
        3.0      8022
        4.0      4591
        5.0       637
        6.0       474
        7.0      3863
        8.0      1345
        9.0        16
Name: user_id, dtype: int64

## Textual Analysis, Cleaning, and Twitter Handle Networks
This second section in the project was new ground for me. Textual analysis is something I haven't spent much time working with, and this dataset represented a good first start to it. Thankfully, I learned that Pandas has functions dedicated to dealing with text in a REGEX format. As the source code shows, I first made sure to copy the next I wanted to work with before doing anything to it. Then, I extracted twitter handles of tweets that were merely retweets from other people, and by extracting these handles I could determine who the most retweeted accounts were. Following this clean, I replaced a number of string formats with empty strings to make it easier to show in a word cloud the most important words. Https website links were removed, along with the twitter handles, RT, amp, and co. Without removing these they showed up in a large format in the word cloud.

The word cloud itself is something also new to me. I followed some code found elsewhere online, created one long string of text from the previously cleaned set of text data, then plotted the word cloud image without axes. Clearly, Donald Trump, Trump, Obama, Hillary Clinton, and Hillary are the most used words in the dataset -- which follows since these were mostly politically motivated tweets. 

After creating the word cloud, I analyzed the retweeted users. I checked to see if any of these retweets were from other members of the dataset. Originally, I found that there weren't any: however, I realized I was conducting the analysis incorrectly. My retweets extracted included the @ symbol and the colon (:) symbol. I removed the colon symbol in order to get rid of anything that might still be in the string after it and leave just the @user_name string. After this, I got a single value of each name in the list of user_keys from the tweets dataset and the retweets already collected. I did this by doing a count of each unique username using df.value_counts() then getting the index of that dataframe (the df.value_counts().values returns the counted numbers rather than the names, a mistake I originally made). In order to make sure names lined up, I added a @ symbol to the beginning of each user_key obtained from the original dataset. From this, I obtained a list of user_names that were simultaneously PART of the dataset, and retweeted BY the dataset: this creates a networking cascade effect, which I then attempted to quantify.

After extracting the user names from both sets, I was able to quantify it after a lot of trial and error: pandas.Index has an intersection function of a second set of values, such as index1.intersection(values) that allowed me to get a series of usernames along with the number of tweets contained in the dataset!

Couple of things stand out: first off, the total number of retweets is ~37000, and the total number of retweets from users in the dataset is ~35000, so nearly all of the retweets are from users within the dataset. Second, that number is roughly 18% of the total malicious tweets dataset, so if there was a cascade effect it likely wasn't very large compared. 

In [10]:
tags = tweets.text.copy()

# This code extracts where the retweet is from, as it follows a "RT @XXXXX:" format
retweets = tags.str.extract('(@.*:)', expand=True)

# Gets rid of website links
tags = tags.replace('https.*$','',regex=True)
# Gets rid of twitter handles
tags = tags.replace('@.*:','',regex=True)
# Gets rid of RT
tags = tags.replace('RT|amp|co','',regex=True)

In [11]:
# ALEX: remove plotting
# from wordcloud import WordCloud, STOPWORDS

text = ' '.join([str(x) for x in tags.values])

# ALEX: remove plotting
# wc = WordCloud(stopwords=STOPWORDS,background_color='white',max_words=200,scale=3).generate(text)
# plt.figure(figsize=(15,15))
# plt.axis('off')
# plt.imshow(wc)
# plt.show()

In [12]:
retweets = retweets.replace(':.*','',regex=True)
print(retweets[0].value_counts().describe())

count    37807.000000
mean         3.950485
std         19.409811
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max       2207.000000
Name: 0, dtype: float64


In [13]:
user_retweeted = retweets[0].value_counts().index[~pd.isnull(retweets[0].value_counts().index)]
retweeted_user = ['@'+x for x in tweets.user_key.value_counts().index]

cascade = [x for x in retweeted_user if x in user_retweeted]

In [14]:
# easy method to remove the @ symbol again and make a clean user_key set
cascade = pd.Series(cascade).replace('@','',regex=True)
network = tweets.user_key.value_counts()
# wish I knew of this previously, kept trying to join two pd.Series which pandas isn't a fan of
network = network[network.index.intersection(cascade.values)]
network

hyddrox            6813
mrclydepratt       3263
brianaregland      3261
melanymelanin      3212
mr_clampin         3010
traceyhappymom     2990
queenofthewo       2988
heyits_toby        2909
tpartynews         1892
hollandpatrickk    1765
cassishere          741
wadeharriot         732
happkendrahappy     451
gloed_up            327
rightnpr            267
patriototus         165
camosaseko          145
holycrapchrix        92
brightandglory       76
blackmattersus       67
lagonehoe            65
dannythehappies      56
cascaseyp            52
hipppo_              52
toneporter           44
4mysquad             43
maymaymyy            42
heyheyhailey         42
claudia42kern        39
aantiracist          37
erdollum             20
abigailssilk         17
nj_blacknews         17
riogithief           10
dontshootcom         10
instotus              3
jrrbrtt               2
ssus_panther          1
gwen_garland          1
iris0_o               1
handsome_henson       1
wildharee       

In [15]:
print('Total number of retweets from users contained within the dataset: {}'.format(network.values.sum()))
print('Percentage of total dataset: {}'.format(network.values.sum()/len(tweets)))

Total number of retweets from users contained within the dataset: 35723
Percentage of total dataset: 0.17555852606127323


If you liked this notebook or can think of other things I might try and do with it let me know! I will likely come back to this if I come up with some other kind of ideas for it.