### Seamless Bay Area Tweet Analysis: Part 3, Pre-Processing


The goal of this project is to analyze the twitter account of the nonprofit group Seamless Bay Area and determine what makes up the most high-impact tweet as measured by engagements.

In part three we pre-process the data to prepare it for modeling.

In [1]:
# Load necessary libraries
import pandas as pd
import numpy as np
import statistics as stat
import re
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [2]:
df = pd.read_csv("/Users/grahamsmith/Documents/SpringboardWork/Seamless_Twitter_Analysis/cleaned tweets.csv")

In [3]:
#once again, here is our data for reference
df.head()

Unnamed: 0.1,Unnamed: 0,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,tweet words
0,0,@MTCBATA is looking for a new Executive Direct...,2018-10-27 18:01:00,124.0,5.0,0.040323,0.0,0.0,0.0,1.0,4.0,"['@mtcbata', 'is', 'looking', 'for', 'a', 'new..."
1,1,Ultimate seamlessness. https://t.co/CdCLrg2o6a,2018-10-26 14:24:00,345.0,10.0,0.028986,0.0,0.0,1.0,0.0,9.0,"['ultimate', 'seamlessness.', 'https://t.co/cd..."
2,2,Help Nix Prop 6! Save funding for more seamles...,2018-10-26 02:28:00,994.0,19.0,0.019115,4.0,0.0,5.0,3.0,5.0,"['help', 'nix', 'prop', 'save', 'funding', 'fo..."
3,3,It doesn't have to be this way! Let's get to f...,2018-10-23 23:29:00,792.0,7.0,0.008838,2.0,0.0,4.0,1.0,0.0,"['it', ""doesn't"", 'have', 'to', 'be', 'this', ..."
4,4,And then come say hi at next month’s @SPUR_Urb...,2018-10-23 23:09:00,532.0,3.0,0.005639,0.0,0.0,1.0,2.0,0.0,"['and', 'then', 'come', 'say', 'hi', 'at', 'ne..."


We need to perform all the feature engineering that I suspect will be necessary for the modeling step. Specifically, I want to identify links/attached media, @replies (when the tweet references another twitter account), calls to action, and sentiment score.

First we do a bunch of wrangling to get the links from every tweet.

In [4]:
#add a space to the end of every tweet so we can find links at the end of tweets
temp = []
for x in range(len(df)):
    temp.append(df['Tweet text'][x] + ' ')
df['Tweet text'] = temp

In [5]:
#find every sub-string that's "https:// + some characters + a space"
links = []
for x in range(len(df)):
    a = re.findall(r'https://.* ', df['Tweet text'].iloc[x])
    links.append(a)
df['links'] = links

In [6]:
#do a bunch of annoying cleaning so that each item is a nice list of links
temp = []
for x in range(len(df)):
    temp.append(re.split('\s', str(df['links'][x])))

for x in range(len(temp)):
    temp[x] = temp[x][0:-1]

for x in range(len(temp)):
    temp[x] = re.sub('\[', '', str(temp[x]))
    
for x in range(len(temp)):
    temp[x] = re.sub('\]', '', str(temp[x]))
    
for x in range(len(temp)):
    temp[x] = re.sub('\'', '', str(temp[x]))

for x in range(len(temp)):
    temp[x] = re.sub('\"', '', str(temp[x]))

df['links'] = temp

In [7]:
#double check that it looks good
df['links']

0                                https://t.co/Syf9exwPTd
1                                https://t.co/CdCLrg2o6a
2       https://t.co/qM4M7tCFVO, https://t.co/2379qGdY2D
3                                https://t.co/FczQtLbH5d
4                                                       
                              ...                       
2244                             https://t.co/3xZS0tU1xD
2245                             https://t.co/43v9okaDWZ
2246                             https://t.co/VVZZuPjmw1
2247                                                    
2248                             https://t.co/BJMSIraGwy
Name: links, Length: 2249, dtype: object

Next we'll pull out all the replies to other twitter accounts contained within the tweets. 

In [8]:
#for some reason the way I split the words previously didn't work, so I've gone over it again
replies = []
for x in range(len(df)):
    a = re.split(' ', df['Tweet text'].iloc[x])
    replies.append(a)

df['replies_sentance'] = replies

In [9]:
#find all the replies, aka sub-strings starting with @
temp3 = []
for z in range(len(df)):
    temp = []
    for x in df['replies_sentance'][z]:
        temp.append(re.findall(r'@.*', x))
    temp2 = []
    for y in temp:
        if len(y) > 0:
            temp2.append(y)
    temp3.append(temp2)
df['replies'] = temp3

In [19]:
a = []
for x in df['replies']:
    a.append(df['replies'][x][0])

KeyError: "None of [Index([('@MTCBATA',)], dtype='object')] are in the [index]"

Sweet, looks good. Now we turn these into dummy variables for the models we'll be building.

In [11]:
#it's probably easiest to do this in a new dataframe, so we need to add engagements back in
ats = df['replies'].str.get_dummies()
ats['engagements'] = df['engagements']

In [30]:
a = re.sub('\[', '', str(df['replies'][2245]))
b = re.sub('\]', '', a)


"'@alevin', '@anniefryman', '@Scott_Wiener', '@gillibits', '@KG_DC'"

In [12]:
ats.head()

Unnamed: 0,"[[""@AsmMarcBerman's""]]","[[""@Caltrain's""]]","[[""@DavidChiu's""]]","[[""@DavidChiu's]""]]","[[""@GavinNewsom's""], ['@CA_Trans_Agency'], ['@MTCBATA']]","[[""@MTCBATA's""]]","[[""@MetroTransitMN's""]]","[[""@MosesMaynez'""]]","[[""@RepHankJohnson's""]]","[[""@SFBayFerry's""]]",...,"[['@wgyn_'], ['@111MinnaGallery'], ['@DavidChiu']]","[['@willplancal'], ['@SFTRU'], ['@svtransitusers'], ['@chrisarvinsf'], ['@DSA_SF'], ['@_KennyUong_'], ['@graue'], ['@zachlipton'], ['@zdeutschgross'], ['@alevin'], ['@lateefahsimon'], ['@JaniceForBART'], ['@BevanDufty'], ['@GenesisCali'], ['@Urban_Habitat'], ['@transform'], ['@ylinstitute'], ['@SVILC'], ['@ChoiceinAging'], ['@TheCILOfficial'], ['@BikeEastBay'], ['@bikesv'], ['@planetacterra']]","[['@woolie'], ['@jwalshie'], ['@bayareametro.gov']]","[['@woolie'], ['@jwalshie']]","[['@xentrans'], ['@Theysaurus'], ['@SFTRU'], ['@xentrans!']]","[['@xplosneer'], ['@TransForm_Alert'], ['@MTCBATA']]",[['@yinglingfan']],[['@zigdon']],[],engagements
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,5.0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,10.0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,19.0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,7.0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.0


Great, We'll be using that in the next notebook. For now I'll be moving on to the last feature I want to create: sentiment score. I'll be doing this by getting a list of positive and negative words, then comparing each tweet and assigning it a score from -1 to 1 based on how many (if any) of those words it has. This list was downloaded from Kaggle (https://www.kaggle.com/datasets/mukulkirti/positive-and-negative-word-listrar)

In [14]:
words = pd.read_excel('/Users/grahamsmith/Documents/SpringboardWork/Positive and Negative Word List.xlsx')

In [15]:
words

Unnamed: 0.1,Unnamed: 0,Negative Sense Word List,Positive Sense Word List
0,0,,
1,1,abnormal,able
2,2,abolish,abundance
3,3,abominable,accelerate
4,4,abominably,accept
...,...,...,...
4716,4716,zenana,
4717,4717,zephyr,
4718,4718,zero,
4719,4719,zol,


In [None]:
'/Users/grahamsmith/Documents/SpringboardWork/SentiWordNet/data/SentiWordNet_3.0.0.txt'