# Acquire and Analyze

## Introduction

For my Acquire and Analyze project, I thought it would be interesting to analyze the sentiment of an NFL team's hashtag during the course of a game. I chose the Baltimore Ravens because they are an exciting team to watch. I started out by pulling all of the tweets that included #Ravens from Twitter during the Sunday Night Football game between the Ravens and the Patriots on 11/16/20. I quickly realized that it would be better to have two games rather than one. So I also pulled all the tweets that included #Ravens again for the following week. The Ravens next opponent was the Tennessee Titans.

Since I am only analyzing the tweets happening during the game, the sentiment of the tweets should follow the flow of the game. So when the Ravens score or make a stop on defense, the sentiment should be positive. If the team the Ravens are playing scores or the Ravens do something bad, the sentiment should be negative.

To help the viewer have a better grasp of what happened during the game, the scoring summary will be provided. Then the sentiment scores for each game with be plotted. We will then look to see how our plot of sentiment scores compares to the win probability chart that ESPN calculates in real-time during the game. 

In [29]:
# importing all the libraries I'll need
import nltk
import numpy as np
import datetime

from string import punctuation
from collections import Counter

from IPython.display import Image

In [7]:
from nltk.corpus import stopwords

sw = stopwords.words('english')

In [8]:
# for sentiment analysis
import random
import nltk
from nltk.sentiment import SentimentAnalyzer
from nltk.sentiment.util import *
from collections import defaultdict

In [9]:
import string
import plotly.express as px

## Cleaning Up Process

The dataset I have collected must be cleaned up. To start, I pulled the two columns of the dataset that I needed:

1) Text in the tweet
    - Tokenization
    - Normalization

2) Time of tweet
    - I had to use the datetime package to strip year, month, day, hour, minute, and second.
    - It makes the times of the tweets way easier to work with.

First, loading in the data from the Ravens vs. Patriots game

In [11]:
#loading in the text file with the data from Twitter
#grabbing the text of the tweet and time and putting them in a list
balt =[]

with open("#Ravens_followerss.txt",'r') as infile :
    next(infile)
    for idx, line in enumerate(infile.readlines()) :
        tweet_text = line.strip().split("\t")[-1]
        #tweet_text = tweet_text.lower().split()
        #tweet_text = [s.translate(str.maketrans('', '', string.punctuation)) for s in tweet_text]
        time = line.strip().split("\t")[-2]
        time = datetime.datetime.strptime(time,"%Y-%m-%d %H:%M:%S")
        
        pair = (time, tweet_text)
        balt.append(pair)
        

And now the same process for the next game where the Ravens played the Titans. There is probably a more efficient way to do this, but I chose to stick with what works.

In [99]:
balt_vs_Titans =[]

with open("#Ravens_game2_redo.txt",'r') as infile :
    next(infile)
    for idx, line in enumerate(infile.readlines()) :
        tweet_text = line.strip().split("\t")[-1]
        #tweet_text = tweet_text.lower().split()
        #tweet_text = [s.translate(str.maketrans('', '', string.punctuation)) for s in tweet_text]
        time = line.strip().split("\t")[-2]
        time = datetime.datetime.strptime(time,"%Y-%m-%d %H:%M:%S")
        
        pairs = (time, tweet_text)
        balt_vs_Titans.append(pairs)

## Deleting Tweets not during the game

There is still one problem with the data that needs to be addressed. The Twitter pull for this data wasn't perfect, and it ended up pulling tweets from before the game and after the game. I want to only analyze the sentiment during the game, so the next step is to delete the tweets that fall outside of the time frame of the game. 

All of Twitter's data uses Coordinated Universal Time (UTC). Mountain Standard Time is five hours ahead of UTC. The Sunday Night Game when the Ravens played the Patriots started around 6:24 P.M. Mountain Standard Time. So this means the data for the game would start at 1:24 P.M. UTC. Same thing for the end of the game.

The same idea applies for the next game as well

In [13]:
#Saving the start time and end time
game_start = datetime.datetime.strptime("2020-11-16 01:24:00","%Y-%m-%d %H:%M:%S")
game_end = datetime.datetime.strptime("2020-11-16 04:20:00","%Y-%m-%d %H:%M:%S")

test_1 = datetime.datetime.strptime("2020-11-16 01:00:00","%Y-%m-%d %H:%M:%S")
test_2 = datetime.datetime.strptime("2020-11-16 03:00:00","%Y-%m-%d %H:%M:%S")
test_3 = datetime.datetime.strptime("2020-11-16 06:00:00","%Y-%m-%d %H:%M:%S")

#making sure the code works
assert(test_1 < game_start)
assert(game_start < test_2 and test_2 < game_end)
assert(game_end < test_3)

In [14]:
#creating game_tweets to hold all the tweets that happened during the game
game_tweets = []

for pair in balt :
    time = pair[0]
    if (time > game_start and time < game_end) :
        game_tweets.append(pair)

In [15]:
#ordering the list of tuples by date
#this makes it go from beginning to end (instead of end to beginning)
game_tweets = sorted(game_tweets, key=lambda x: x[0])

In [147]:
#now for the next game

#Saving the start time and end time
game_start_22nd = datetime.datetime.strptime("2020-11-22 17:55:00","%Y-%m-%d %H:%M:%S")
game_end_22nd = datetime.datetime.strptime("2020-11-22 21:22:00","%Y-%m-%d %H:%M:%S")

test_1 = datetime.datetime.strptime("2020-11-22 16:00:00","%Y-%m-%d %H:%M:%S")
test_2 = datetime.datetime.strptime("2020-11-22 20:00:00","%Y-%m-%d %H:%M:%S")
test_3 = datetime.datetime.strptime("2020-11-22 23:00:00","%Y-%m-%d %H:%M:%S")

#making sure the code works
assert(test_1 < game_start_22nd)
assert(game_start_22nd < test_2 and test_2 < game_end_22nd)
assert(game_end_22nd < test_3)

In [148]:
#creating game_tweets_22nd to hold all the tweets that happened during the game
game_tweets_22nd = []

for pairs in balt_vs_Titans :
    time = pairs[0]
    if (time > game_start_22nd and time < game_end_22nd) :
        game_tweets_22nd.append(pairs)

In [149]:
#ordering the list of tuples by date
#this makes it go from beginning to end (instead of end to beginning)
game_tweets_22nd = sorted(game_tweets_22nd, key=lambda x: x[0])

## Descriptive Stats

Now, the data only includes the tweets that happened during the game. Let's move onto to understand the text in the tweets. The best way to do this is by taking a look at the basic descriptive statistics of the tweets.

In [20]:
clean_tokens =[]

for pair in game_tweets:
    tweet = pair[1]
    tweet = [w for w in tweet.lower().split()]
    tweet = [w for w in tweet if w.isalpha() and w not in sw]
    clean_tokens.extend(tweet)

In [151]:
clean_tokens_22nd =[]

for pairs in game_tweets_22nd:
    tweet = pairs[1]
    tweet = [w for w in tweet.lower().split()]
    tweet = [w for w in tweet if w.isalpha() and w not in sw]
    clean_tokens_22nd.extend(tweet)

In [140]:
print(f"Ravens vs. Patriots on 11/15/20 \n")
print(f"Number of tweets = {len(game_tweets)}")
print(f"Tokens = {len(clean_tokens)}")
print(f"Average tokens per tweet = {len(clean_tokens)/len(game_tweets)}")
print(f"Unique tokens = {len(set(clean_tokens))}")
print(f"Lexical diversity = {len(set(clean_tokens))/len(clean_tokens):.3f}")

#token length vector
len_of_clean_tokens = [len(w) for w in clean_tokens]
print(f"Average token length = {np.mean(len_of_clean_tokens):.2f}")

Ravens vs. Patriots on 11/15/20 

Number of tweets = 1467
Tokens = 9281
Average tokens per tweet = 6.3265167007498295
Unique tokens = 2340
Lexical diversity = 0.252
Average token length = 5.32


In [141]:
print(f"Top 20 most used tokens:")
ravens_fd = FreqDist(clean_tokens)
ravens_fd.most_common(20)

Top 20 most used tokens:


[('game', 122),
 ('lamar', 114),
 ('get', 102),
 ('defense', 86),
 ('go', 84),
 ('like', 84),
 ('play', 75),
 ('jackson', 71),
 ('ravens', 66),
 ('run', 65),
 ('ball', 61),
 ('one', 60),
 ('team', 55),
 ('first', 53),
 ('need', 53),
 ('right', 53),
 ('going', 52),
 ('drive', 51),
 ('good', 49),
 ('win', 46)]

In [153]:
print(f"Ravens vs. Titans on 11/22/20 \n")
print(f"Number of tweets = {len(game_tweets_22nd)}")
print(f"Tokens = {len(clean_tokens_22nd)}")
print(f"Average tokens per tweet = {len(clean_tokens_22nd)/len(game_tweets_22nd)}")
print(f"Unique tokens = {len(set(clean_tokens_22nd))}")
print(f"Lexical diversity = {len(set(clean_tokens_22nd))/len(clean_tokens_22nd):.3f}")

#token length vector
len_of_clean_tokens_22nd = [len(w) for w in clean_tokens_22nd]
print(f"Average token length = {np.mean(len_of_clean_tokens_22nd):.2f}")

Ravens vs. Titans on 11/22/20 

Number of tweets = 1686
Tokens = 10120
Average tokens per tweet = 6.002372479240806
Unique tokens = 2363
Lexical diversity = 0.233
Average token length = 5.26


In [154]:
print(f"Top 20 most used tokens:")

ravens_fd = FreqDist(clean_tokens_22nd)
ravens_fd.most_common(20)

Top 20 most used tokens:


[('game', 129),
 ('get', 111),
 ('lamar', 108),
 ('go', 93),
 ('henry', 86),
 ('titans', 84),
 ('defense', 82),
 ('dobbins', 80),
 ('first', 78),
 ('like', 73),
 ('play', 72),
 ('good', 71),
 ('jackson', 67),
 ('td', 65),
 ('ravens', 64),
 ('ball', 63),
 ('need', 62),
 ('one', 59),
 ('field', 59),
 ('derrick', 58)]

## Analysis of Descriptive Statistics

The main difference between the two games is the number of tweets. The Ravens vs. Titans game had almost 200 more tweets than when the Ravens played the Patriots. This was surpising because the game against the Patriots was in primetime. I would assume more people would be watching the Sunday Night Football game since it is the only game on at that time. The game against the Titans was an 11:00 AM (MT time) game. This time slot also has a number of other games on. One reason the Ravens vs. Titans game might have had more tweets is because the game went into overtime.

There really isn't significant differences in the other statistics.

As far as the top 20 most used tokens for each game, there are fairly similar. The most popular player for the Ravens, Lamar Jackson, shows up in both. The words: game, get, go, defense, first, like, play, ravens, ball, one, good also show up in both. Obviously, these are going to be pretty similar.

Derrick Henry shows up in the Ravens vs Titans game. He is the best player on the Titans.

Dobbins also shows up in this game. He is the Ravens rookie running back who had a break out game.

## Moving on to Sentiment Analysis

In [25]:
#bring in the text file with sentiment scores for each word
sentiment_scores = {}

with open("tidytext_sentiments.txt",'r') as infile :
    next(infile)
    for line in infile.readlines() :
        line = line.strip().split("\t")
        if line[1] == "positive" :
            sentiment_scores[line[0]] = 1
        else :
            sentiment_scores[line[0]] = -1

In [26]:
# this is only taking tokens, can it write out idx, time, score of individual tweet?????

#trying to have a running counter of sentiment
#+1 for positive, -1 for negative

running_counter = [0] * len(clean_tokens)
current_score = 0 

for idx, word in enumerate(clean_tokens) :
    if word in sentiment_scores :
        current_score += sentiment_scores[word.lower()]
    
    running_counter[idx] = current_score
    
    #if idx > 100 :
        #break

In [27]:
#writing out idx, sentiment score counter to a text file
with open("#ravens_scores.txt",'w') as ofile :
    ofile.write("word\tscore\n")
    for idx, score in enumerate(running_counter) :
        ofile.write("\t".join([str(idx+1),str(score)]) + "\n")

Now for the next game...

In [155]:
#probably don't need this twice
#bring in the text file with sentiment scores for each word
sentiment_scores = {}

with open("tidytext_sentiments.txt",'r') as infile :
    next(infile)
    for line in infile.readlines() :
        line = line.strip().split("\t")
        if line[1] == "positive" :
            sentiment_scores[line[0]] = 1
        else :
            sentiment_scores[line[0]] = -1

In [156]:
running_counter = [0] * len(clean_tokens_22nd)
current_score = 0 

for idx, word in enumerate(clean_tokens_22nd) :
    if word in sentiment_scores :
        current_score += sentiment_scores[word.lower()]
    
    running_counter[idx] = current_score
    
    #if idx > 100 :
        #break

In [157]:
#writing out idx, sentiment score counter to a text file
with open("#ravens_vs_titans_scores.txt",'w') as ofile :
    ofile.write("word\tscore\n")
    for idx, score in enumerate(running_counter) :
        ofile.write("\t".join([str(idx+1),str(score)]) + "\n")

# Analysis of Sentiment

## Ravens vs. Patriots - 11/15/20

![title](img/SS_vsPats.png)

The Sunday Night Football game between the Ravens and the Patriots got off to a slow start. It took until the 2nd quarter for the Ravens to strike first. Then the game was back and forth up until halftime. After halftime, the Patriots jumped out to a 13 point lead. The Ravens score at the end of the 3rd to bring the game closer. The 4th quarter was a defensive battle. It was lightly raining the whole game which may have played a factor. The Ravens had one more chance at the end to go the length of the field to score and win the game. As the Ravens took the field for the last drive, the light rain turned into an absolute downpour. The Ravens would eventually turn the ball over on downs which would lead to them losing.

![title](img/win_probPats.png)

The plot from ESPN shows the win probability for each team in real time. When the colored line is closest to the middle (50%), each team has a win probability of 50%. That is when the game is the closest. The farther the line moves away from the middle, the team's win probability gets closer to 100%.

![title](img/vsPats.png)

## Analysis 

The sentiment plot of #Ravens tweets during the SNF game show the back and forth flow that played out. The sentiment is positive at the start even though neither team scores in the first quarter. The plot shows the up-and-down flow of the 1st half where each team trades scores. Around halftime, the sentiment is positive for people tweeting #Ravens. The win probability is also still in favor of the Ravens. The Patriots come out after halftime and score twice, pushing their lead to 13 points. The win probability starts to go in the Patriots direction. The sentiment starts to drop as well signaling negativity in the #Ravens tweets. 

The win probability starts to drift closer to 50% as the game goes on. This is signaling the Ravens still have a chance. I'm guessing this is because the Ravens scored at the end of the 3rd. There is also a slight rise in sentiment. The Ravens have one last chance on their last drive, but they can't pull through. The win probability goes to 100% for the Patriots. The sentiment at the end is up and down. This must be signaling some optimism before the Ravens finally lose.

## Ravens vs. Titans - 11/22/20

![title](img/SS_vsTitans.png)

The Ravens next opponent was the Tennessee Titans. The Ravens came out playing great this game and jumped out to a large lead going into halftime. The Titans battled back and eventually tied the game at the end of regulation which would send the game to overtime. In OT, the Titans running back scored the game winning touchdown, ending the game.

![title](img/win_probTitans.png)

The win probability plot for this game is easier to see because of the difference in color.

![title](img/vsTitans.png)

## Analysis

The sentiment plot for this game shows a positive trend throughout. There are plenty of dips, up-and-down, in sentiment, but no large drops like in the last game. This was very surprising because the Titans scored 14 points straight towards the end of the game. Surely, I thought there would be negativity in the #Ravens tweets during this time. The sentiment plot still follows the win probability plot. The Ravens were favored to win almost the whole game until the end. At the end of the sentiment plot, there is a slight drop in sentiment. This is right when the Titans start to gain control of the game and see there win probability rise. 

If I extended the plot a little further to include #Ravens tweets after the game, we might see the senitment continue to drop.

# Conclusion

The sentiment of an NFL team's hashtag definitely shows the ups-and-downs that teams, along with their fans, go through during the course of a game. The sentiment plot of the first game analzyed between the Ravens and Patriots tracked very well with the game flow and win probability plot from ESPN. The second game, between the Ravens and Titans, again showed sentiment scores rising and falling. The ups-and-downs weren't nearly as extreme as the first game, making it hard to tell if sentiment scores truly compared to win probability.

I was really lucky to choose a team like the Baltimore Ravens who had two back to back games that were super close. 
A wider variety of games may be necessary to truly test if sentiment scores follow win probability. From the analysis of these two games, it definitely showed it is worth testing more.

# Final Thoughts

Sentiment analysis was an idea that I came into Text Mining excited to try out. Sentiment analysis can be noisy in small scales but fairly accurate in large scales. Both of the games had around 1,500 tweets which may not be enough to be considered large scale. 

Another problem with tweets from Twitter is people tend to be sarcastic when they post. It is almost impossible for sentiment analysis to pick up on that.

One last problem that may have occurred, is some of the tweets include hashtags from both teams (#Ravens, #Titans). Therefore, there could have been Titans fans tweeting positive things about their team but including the other teams hashtag.

### * scoring summaries and win probability plots come from ESPN.com