### Seamless Bay Area Tweet Analysis: Part 3, Pre-Processing


The goal of this project is to analyze the twitter account of the nonprofit group Seamless Bay Area and determine what makes up the most high-impact tweet as measured by engagements.

In part three we pre-process the data to prepare it for modeling. We need to perform all the feature engineering that I suspect will be necessary for the modeling step. Specifically, I want to identify links/attached media, @replies (when the tweet references another twitter account), calls to action, and sentiment score.

In [339]:
# Load necessary libraries
import pandas as pd
import numpy as np
import statistics as stat
import re
from sklearn import linear_model
from sklearn.linear_model import LinearRegression

In [340]:
df = pd.read_csv("/Users/grahamsmith/Documents/SpringboardWork/Seamless_Twitter_Analysis/cleaned tweets.csv")

In [341]:
#once again, here is our data for reference
df.head()

Unnamed: 0.1,Unnamed: 0,Tweet text,time,impressions,engagements,engagement rate,retweets,replies,likes,user profile clicks,url clicks,tweet words
0,0,@MTCBATA is looking for a new Executive Direct...,2018-10-27 18:01:00,124.0,5.0,0.040323,0.0,0.0,0.0,1.0,4.0,"['@mtcbata', 'is', 'looking', 'for', 'a', 'new..."
1,1,Ultimate seamlessness. https://t.co/CdCLrg2o6a,2018-10-26 14:24:00,345.0,10.0,0.028986,0.0,0.0,1.0,0.0,9.0,"['ultimate', 'seamlessness.', 'https://t.co/cd..."
2,2,Help Nix Prop 6! Save funding for more seamles...,2018-10-26 02:28:00,994.0,19.0,0.019115,4.0,0.0,5.0,3.0,5.0,"['help', 'nix', 'prop', 'save', 'funding', 'fo..."
3,3,It doesn't have to be this way! Let's get to f...,2018-10-23 23:29:00,792.0,7.0,0.008838,2.0,0.0,4.0,1.0,0.0,"['it', ""doesn't"", 'have', 'to', 'be', 'this', ..."
4,4,And then come say hi at next month’s @SPUR_Urb...,2018-10-23 23:09:00,532.0,3.0,0.005639,0.0,0.0,1.0,2.0,0.0,"['and', 'then', 'come', 'say', 'hi', 'at', 'ne..."


First we do a bunch of wrangling to get the links from every tweet.

In [342]:
#add a space to the end of every tweet so we can find links at the end of tweets
temp = []
for x in range(len(df)):
    temp.append(df['Tweet text'][x] + ' ')
df['Tweet text'] = temp

In [343]:
#find every sub-string that's "https:// + some characters + a space"
links = []
for x in range(len(df)):
    a = re.findall(r'https://.* ', df['Tweet text'].iloc[x])
    links.append(a)
df['links'] = links

In [344]:
#do a bunch of annoying cleaning so that each item is a nice list of links
temp = []
for x in range(len(df)):
    temp.append(re.split('\s', str(df['links'][x])))

for x in range(len(temp)):
    temp[x] = temp[x][0:-1]

for x in range(len(temp)):
    temp[x] = re.sub('\[', '', str(temp[x]))
    
for x in range(len(temp)):
    temp[x] = re.sub('\]', '', str(temp[x]))
    
for x in range(len(temp)):
    temp[x] = re.sub('\'', '', str(temp[x]))

for x in range(len(temp)):
    temp[x] = re.sub('\"', '', str(temp[x]))

df['links'] = temp

In [345]:
#double check that it looks good
df['links']

0                                https://t.co/Syf9exwPTd
1                                https://t.co/CdCLrg2o6a
2       https://t.co/qM4M7tCFVO, https://t.co/2379qGdY2D
3                                https://t.co/FczQtLbH5d
4                                                       
                              ...                       
2244                             https://t.co/3xZS0tU1xD
2245                             https://t.co/43v9okaDWZ
2246                             https://t.co/VVZZuPjmw1
2247                                                    
2248                             https://t.co/BJMSIraGwy
Name: links, Length: 2249, dtype: object

Next we'll pull out all the replies to other twitter accounts contained within the tweets. 

In [346]:
#for some reason the way I split the words previously didn't work, so I've gone over it again
replies = []
for x in range(len(df)):
    a = re.split(' ', df['Tweet text'].iloc[x])
    replies.append(a)

df['replies_sentance'] = replies

In [347]:
#find all the replies, aka sub-strings starting with @
temp3 = []
for z in range(len(df)):
    temp = []
    for x in df['replies_sentance'][z]:
        temp.append(re.findall(r'@.*', x))
    temp2 = []
    for y in temp:
        if len(y) > 0:
            temp2.append(y)
    temp3.append(temp2)
df['replies'] = temp3

In [348]:
#put it into a list
temp = []
for x in df['replies']:
    temp.append(list(x))

In [349]:
#I couldn't get the dummies to work further down so I converted it into a string, and then split
#it again
temp = []
for x in df['replies']:
    if len(x) > 0:
        a = re.sub('\[', '', str(x))
        b = re.sub('\]', '', a)
        c = re.sub('\'', '', b)
        d = re.sub('\,', '', c)
        temp.append(re.split(' ', d))
    else:
        temp.append('')

In [350]:
#split each of the first 5 replies into a seperate column, so that dummies can be made
rep = []
rep1 = []
rep2 = []
rep3 = []

for x in range(len(temp)):
    if len(temp[x]) == 0:
        rep.append('')
        rep1.append('')
        rep2.append('')
        rep3.append('')
    if len(temp[x]) == 1:
        rep.append(temp[x][0])
        rep1.append('')
        rep2.append('')
        rep3.append('')
    if len(temp[x]) == 2:
        rep.append(temp[x][0])
        rep1.append(temp[x][1])
        rep2.append('')
        rep3.append('')
    if len(temp[x]) == 3:
        rep.append(temp[x][0])
        rep1.append(temp[x][1])
        rep2.append(temp[x][2])
        rep3.append('')
    if len(temp[x]) >= 4:
        rep.append(temp[x][0])
        rep1.append(temp[x][1])
        rep2.append(temp[x][2])
        rep3.append(temp[x][3])

In [351]:
#build a dataframe that we'll use for the dummies
df1 = pd.DataFrame()
df1['engagements'] = df['engagements']
df1['rep'] = rep
df1['rep1'] = rep1
df1['rep2'] = rep2
df1['rep3'] = rep3

In [352]:
#double check that it looks okay with one dummy per column
df1.head()

Unnamed: 0,engagements,rep,rep1,rep2,rep3
0,5.0,@MTCBATA,,,
1,10.0,,,,
2,19.0,,,,
3,7.0,,,,
4,3.0,@SPUR_Urbanist,@icgee,,


In [353]:
#split into dummies
df1 = pd.get_dummies(df1)
df1

Unnamed: 0,engagements,rep_,"rep_""@AsmMarcBermans""","rep_""@Caltrains""","rep_""@DavidChius""","rep_""@GavinNewsoms""","rep_""@MTCBATAs""","rep_""@MetroTransitMNs""","rep_""@MosesMaynez""","rep_""@RepHankJohnsons""",...,rep3_@sfmta_muni,rep3_@skbarz,rep3_@stevepepple,rep3_@theGreaterMarin,rep3_@thecliffbar,rep3_@transpoakland,rep3_@urbenschneider,rep3_@wpusanews.,rep3_@xentrans!,rep3_@yimbyyy
0,5.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,10.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,19.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,7.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,3.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2244,13.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2245,6.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2246,2.0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2247,6.0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [354]:
#just out of curiosity, is there a linear relationship between replies an d
scores = []
for reply in df1.columns[15:]:
    x = np.array(df1[str(reply)]).reshape(-1, 1)
    y = df1['engagements']
    model = LinearRegression().fit(x, y)
    scores.append(model.score(x, y))
np.max(scores)

0.024589337340945483

0.024589337340945483

In [218]:
regr.score(X_test, Y_test)

-4.144548603770765e+25

Great, We'll be using that in the next notebook. For now I'll be moving on to the last feature I want to create: sentiment score. I'll be doing this by getting a list of positive and negative words, then comparing each tweet and assigning it a score from -1 to 1 based on how many (if any) of those words it has. This list was downloaded from Kaggle (https://www.kaggle.com/datasets/mukulkirti/positive-and-negative-word-listrar)

In [12]:
words = pd.read_excel('/Users/grahamsmith/Documents/SpringboardWork/Positive and Negative Word List.xlsx')

In [13]:
words

Unnamed: 0.1,Unnamed: 0,Negative Sense Word List,Positive Sense Word List
0,0,,
1,1,abnormal,able
2,2,abolish,abundance
3,3,abominable,accelerate
4,4,abominably,accept
...,...,...,...
4716,4716,zenana,
4717,4717,zephyr,
4718,4718,zero,
4719,4719,zol,


First we find the number of positive words in each tweet.

In [250]:
temp = []
for x in words['Negative Sense Word List']:
    temp.append(str(x))
words['Negative Sense Words'] = temp

In [257]:
temp = []
for x in df['Tweet text']:
    temp.append([ele for ele in list(words['Negative Sense Words']) if(ele in str(x))])

In [259]:
df['Negative words'] = temp

Then we find the number of positive words in each tweet.

In [261]:
temp = []
for x in words['Positive Sense Word List']:
    temp.append(str(x))
words['Positive Sense Words'] = temp

In [262]:
temp = []
for x in df['Tweet text']:
    temp.append([ele for ele in list(words['Positive Sense Words']) if(ele in str(x))])

In [263]:
df['Positive words'] = temp

Sentiment score is calculated by subtracting the number of negative words in the tweet from the number of positive words and dividing it by the total number of words to find a ratio of positive:negative.

In [276]:
temp = []
for x in range(len(df)):
    temp.append(len(df['Positive words'][x]) - len(df['Negative words'][x])/len(df['Tweet text'][x]))
df['Sentiment Score'] = temp

In the next notebook we will begin to do modeling.

In [None]:
df.to_csv('/Users/grahamsmith/Documents/SpringboardWork/Seamless_Twitter_Analysis/tw features.csv', date_format='%Y-%m-%d %H:%M:%S')