# **Exploratory Data Analysis and Data Cleaning**
Some questions I have:
- How many tweets are not people sharing their score
- For people sharing their score, how much additional (non-standardized) text do people add?

**Load Data**


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('wordle_v1.csv')

print(len(df))
df.head()

487


Unnamed: 0,datetime,text,retweet_count,reply_count,like_count
0,2022-01-28 23:59:59+00:00,Wordle 223 5/6\n\n⬛⬛⬛⬛⬛\n🟨🟩⬛⬛⬛\n⬛🟩🟩⬛⬛\n⬛🟩🟩🟩🟩\n...,0,0,1
1,2022-01-28 23:59:58+00:00,Wordle 224 5/6\n\n⬜🟨⬜⬜🟨\n🟨🟩⬜🟨⬜\n⬜🟩🟩⬜🟩\n⬜🟩🟩🟩🟩\n...,0,0,0
2,2022-01-28 23:59:58+00:00,Wordle 224 4/6*\n\n⬜🟨⬜⬜🟨\n⬜🟩🟩⬜🟩\n⬜🟩🟩🟩🟩\n🟩🟩🟩🟩🟩\...,0,1,1
3,2022-01-28 23:59:58+00:00,Wordle 224 5/6\n\n⬜🟩🟩⬜⬜\n⬜🟨🟨⬜⬜\n🟨🟩🟩⬜⬜\n⬜🟩🟩⬜🟩\n...,0,0,1
4,2022-01-28 23:59:57+00:00,@Sako_506 Wordle (ES) #22 2/6\n\n⬜⬜🟩🟨🟩\n🟩🟩🟩🟩🟩\...,0,1,0


**Alternative 1: Remove the standard text of people sharing their results ("wordle n x/y"  and coloured square emojis)**

In [3]:
text = df.iloc[65,1]
print(text)

Wordle 224 2/6

🟨⬛⬛🟨⬛
🟩🟩🟩🟩🟩

first try??


In [4]:
df['text'] = df['text'].str.replace('🟩', '')
df['text'] = df['text'].str.replace('⬛', '')
df['text'] = df['text'].str.replace('🟨', '')
df['text'] = df['text'].str.replace('⬜', '')
df['text'] = df['text'].str.replace('🟧', '')
df['text'] = df['text'].str.replace('🟦', '')
df['text'] = df['text'].str.replace('🟥', '')
df['text'] = df['text'].str.replace('🟪', '')
df['text'] = df['text'].str.replace('🟫', '')
df['text'] = df['text'].str.replace('\n', '')
df['text'] = df['text'].str.replace('Wordle\s\(ES\)\s#\d+\s\d\/\d', '', regex=True)
df['text'] = df['text'].str.replace('Wordle\s\d+\s\d\/\d', '', regex=True)
df['text'] = df['text'].str.replace('((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*', '', regex=True) #remove links
df['text'] = df['text'].str.replace('@[^\s]+', '', regex=True)#remove twitter handles

In [5]:
import re
for index, row in df.iterrows():
    if re.search('$^', row['text']):
        df = df.drop([index])

print(len(df))
df.head()

244


Unnamed: 0,datetime,text,retweet_count,reply_count,like_count
2,2022-01-28 23:59:58+00:00,*Phew!,0,1,1
4,2022-01-28 23:59:57+00:00,👀,0,1,0
5,2022-01-28 23:59:57+00:00,EZ,0,0,0
7,2022-01-28 23:59:56+00:00,⁦ full credit,0,0,0
8,2022-01-28 23:59:56+00:00,nice try for 1st time,0,0,0


In [6]:
for i in df['text']:
    print(i)

*Phew!
  👀
EZ
 ⁦ full credit
 nice try for 1st time
Close one
    
My first three guesses were crap. So I guessed on my fourth.😳
Brain meat on swole wole for this one
Lutang moments
Well done me!#wordle223
Wordle 223 X/6
A minute to Wordle…
 complaining about different types of posts on social media is my 13th reason…every day we stray further from God 🤦‍♂️, I agree with the wordle one though I guess
My first wordle 😇️️️️️️️️️
*
ugh.
 
I get that some of you are annoyed by Wordle, but may I present the argument that people are having fun with it and it’s harmless fun at that, and I’d much rather see green and yellow squares in my timeline than more tweets and fights about our impending doom? 😘
That was stressful
👎🏼 
 
Crappy
 Got the right one? 
 What a toughie…I almost gave up! 
gg ez
 Tough one
Forgot to post this yesterday. The fastest I’ve gotten it done.
 I knew it was *something* ending in ****, just took a while to guess right! 😂 
  You just love his Wordle scores.
first try??
W

In [7]:
#test out VADER sentiment analysis for one sentence from dataframe
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

text = df.iloc[65,1]
print(text)
analyzer = SentimentIntensityAnalyzer()
vs = analyzer.polarity_scores(text)
print(str(vs))

* 🙄
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}


In [9]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

#Try do on whole dataframe
def extract_compound_polarity(i:str):
    return analyzer.polarity_scores(i)['compound']
    #compound score ranges between -1(extremely negative sentiment) and +1 (extremely positive sentiment)


df['compound_polarity'] = df.text.apply(extract_compound_polarity)

df.head()

Unnamed: 0,datetime,text,retweet_count,reply_count,like_count,compound_polarity
2,2022-01-28 23:59:58+00:00,*Phew!,0,1,1,0.0
4,2022-01-28 23:59:57+00:00,👀,0,1,0,0.0
5,2022-01-28 23:59:57+00:00,EZ,0,0,0,0.3612
7,2022-01-28 23:59:56+00:00,⁦ full credit,0,0,0,0.3818
8,2022-01-28 23:59:56+00:00,nice try for 1st time,0,0,0,0.4215


In [14]:
df.describe()

Unnamed: 0,retweet_count,reply_count,like_count,compound_polarity
count,244.0,244.0,244.0,244.0
mean,0.151639,0.655738,4.467213,0.082413
std,1.997312,3.142096,46.973101,0.358297
min,0.0,0.0,0.0,-0.8385
25%,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0
75%,0.0,1.0,1.0,0.3138
max,31.0,43.0,729.0,0.9684


**Alternative 2: Only look at tweets of people who are not sharing their scores**

In [51]:
for index, row in df.iterrows():
    if re.search('Wordle\s\d+\s\d\/\d', row['text']):
        df = df.drop([index])

for index, row in df.iterrows():
    if re.search('Wordle\s\(ES\)\s#\d+\s\d\/\d', row['text']):
        df = df.drop([index])

for index, row in df.iterrows():
    if re.search('🟩', row['text']):
        df = df.drop([index])

for index, row in df.iterrows():
    if re.search('🟨', row['text']):
        df = df.drop([index])

In [52]:
print(len(df))
df.head()

59


Unnamed: 0,datetime,text,retweet_count,reply_count,like_count
24,2022-01-28 23:59:47+00:00,A minute to Wordle…,0,0,0
25,2022-01-28 23:59:47+00:00,@elleduncanESPN complaining about different ty...,0,0,0
37,2022-01-28 23:59:43+00:00,"I get that some of you are annoyed by Wordle, ...",0,0,11
50,2022-01-28 23:59:37+00:00,@rebeccajpowell Got the right one? https://t.c...,0,1,0
64,2022-01-28 23:59:31+00:00,@TwistNHook @atomsareenough You just love his ...,0,1,0


In [53]:
for i in df['text']:
    print(i)

A minute to Wordle…
@elleduncanESPN complaining about different types of posts on social media is my 13th reason…every day we stray further from God 🤦‍♂️, I agree with the wordle one though I guess
I get that some of you are annoyed by Wordle, but may I present the argument that people are having fun with it and it’s harmless fun at that, and I’d much rather see green and yellow squares in my timeline than more tweets and fights about our impending doom? 😘
@rebeccajpowell Got the right one? https://t.co/hJuge7APan
@TwistNHook @atomsareenough You just love his Wordle scores.
Wordle is fun! It's like mastermind but instead of four pegs it's five letters. And here I was wondering why the hell people kept tweeting green and white boxes..
@ccrichton @Ka20Ce Catherine do you have shares in wordle?
Hello friends, if you want me to include you in my daily routine of wordle encouragement hmu
I can't keep up with his mind anymore😭 am i that dumb? Okay this is a wordle and then... What does this 

**First attempt at sentiment v. time plot**
- Use all tweets that mention wordle. Remove the standard text of people sharing their results ("wordle n x/y"  and coloured square emojis)
- query = 'wordle place_country:GB -is:retweet -has:media -is:nullcast lang:en'
- start_time = '2022-01-25T00:00:00Z'
- end_time = '2022-01-29T00:00:00Z'

In [6]:
import pandas as pd

In [7]:
df = pd.read_csv('wordle_v2.csv')

print(len(df))
df.head()

1000


Unnamed: 0,datetime,text,retweet_count,reply_count,like_count
0,2022-01-28 23:58:50+00:00,"It’s Friday night, I have just got in from a n...",0,0,3
1,2022-01-28 23:56:55+00:00,5 minutes left before Twitter becomes very dul...,2,13,91
2,2022-01-28 23:53:17+00:00,Bored with Wordle? Try Lewdle\nhttps://t.co/Wg...,0,2,0
3,2022-01-28 23:52:43+00:00,I've 1st &amp; 2nd starters \nMouse &amp; Chai...,0,0,0
4,2022-01-28 23:51:46+00:00,Lewdle 10 3/6\n⬛⬛🟩⬛⬛\n⬛⬛🟩🟨⬛\n🟩🟩🟩🟩🟩\n\nDon't be...,0,0,0


In [8]:
#clean data
df['text'] = df['text'].str.replace('🟩', '')
df['text'] = df['text'].str.replace('⬛', '')
df['text'] = df['text'].str.replace('🟨', '')
df['text'] = df['text'].str.replace('⬜', '')
df['text'] = df['text'].str.replace('🟧', '')
df['text'] = df['text'].str.replace('🟦', '')
df['text'] = df['text'].str.replace('🟥', '')
df['text'] = df['text'].str.replace('🟪', '')
df['text'] = df['text'].str.replace('🟫', '')
df['text'] = df['text'].str.replace('\n', '')
df['text'] = df['text'].str.replace('Wordle\s\(ES\)\s#\d+\s\d\/\d', '', regex=True)
df['text'] = df['text'].str.replace('Wordle\s\d+\s\d\/\d', '', regex=True)
df['text'] = df['text'].str.replace('((http|https)\:\/\/)?[a-zA-Z0-9\.\/\?\:@\-_=#]+\.([a-zA-Z]){2,6}([a-zA-Z0-9\.\&\/\?\:@\-_=#])*', '', regex=True) #remove links
df['text'] = df['text'].str.replace('@[^\s]+', '', regex=True)#remove twitter handles


import re
for index, row in df.iterrows():
    if re.search('$^', row['text']):
        df = df.drop([index])

print(len(df))
df.head()

604


Unnamed: 0,datetime,text,retweet_count,reply_count,like_count
0,2022-01-28 23:58:50+00:00,"It’s Friday night, I have just got in from a n...",0,0,3
1,2022-01-28 23:56:55+00:00,5 minutes left before Twitter becomes very dul...,2,13,91
2,2022-01-28 23:53:17+00:00,Bored with Wordle? Try,0,2,0
3,2022-01-28 23:52:43+00:00,I've 1st &amp; 2nd starters Mouse &amp;,0,0,0
4,2022-01-28 23:51:46+00:00,Lewdle 10 3/6Don't be $⋕*! Play the original W...,0,0,0


In [12]:
#get sentiment of tweet text

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()

def extract_compound_polarity(i:str):
    return analyzer.polarity_scores(i)['compound']
    #compound score ranges between -1(extremely negative sentiment) and +1 (extremely positive sentiment)


df['compound_polarity'] = df.text.apply(extract_compound_polarity)
df['weighted_compound_polarity'] = df['compound_polarity']*df['like_count']

df.head()

Unnamed: 0,datetime,text,retweet_count,reply_count,like_count,compound_polarity,weighted_compound_polarity
0,2022-01-28 23:58:50+00:00,"It’s Friday night, I have just got in from a n...",0,0,3,0.5267,1.5801
1,2022-01-28 23:56:55+00:00,5 minutes left before Twitter becomes very dul...,2,13,91,-0.4576,-41.6416
2,2022-01-28 23:53:17+00:00,Bored with Wordle? Try,0,2,0,-0.2732,-0.0
3,2022-01-28 23:52:43+00:00,I've 1st &amp; 2nd starters Mouse &amp;,0,0,0,0.0,0.0
4,2022-01-28 23:51:46+00:00,Lewdle 10 3/6Don't be $⋕*! Play the original W...,0,0,0,0.2139,0.0


In [22]:
times = pd.to_datetime(df.datetime)


In [35]:
df['datetime']=pd.to_datetime(df.datetime)
df['datehour'] = df['datetime'].dt.strftime('%Y-%m-%d %H')

In [36]:
df.head()

Unnamed: 0,datetime,text,retweet_count,reply_count,like_count,compound_polarity,weighted_compound_polarity,datehour
0,2022-01-28 23:58:50+00:00,"It’s Friday night, I have just got in from a n...",0,0,3,0.5267,1.5801,2022-01-28 23
1,2022-01-28 23:56:55+00:00,5 minutes left before Twitter becomes very dul...,2,13,91,-0.4576,-41.6416,2022-01-28 23
2,2022-01-28 23:53:17+00:00,Bored with Wordle? Try,0,2,0,-0.2732,-0.0,2022-01-28 23
3,2022-01-28 23:52:43+00:00,I've 1st &amp; 2nd starters Mouse &amp;,0,0,0,0.0,0.0,2022-01-28 23
4,2022-01-28 23:51:46+00:00,Lewdle 10 3/6Don't be $⋕*! Play the original W...,0,0,0,0.2139,0.0,2022-01-28 23


In [43]:
mean_df = df.groupby(['datehour']).mean()
mean_df[['weighted_compound_polarity']].to_csv('results.csv')
#this csv is in format ready to make sentiment VS time plot