## Kyle Demers
## Dataset Creation

 1) Sources of the dataset a. Where did you get the data? b. How did you get the data? c. What is the license of the data if any? e. Link to code used to create the dataset.

The data set it created by pulling tweets after every matchweek using the twitter handle of each premier league team. The tweets are pulled from a python package called Twarc2. We use [pull_tweets.py](https://github.com/Kyle-Demers08/NLP_340/blob/main/Final%20Project/python_scripts/pull_tweets.py) to pull the tweets. The search term is the premier league teams handle, and I have specified that the tweet should be in English and not have links or be retweets. Some links do however still sneak through, so we will deal with that in future preprocessing. We output all of the tweet metadata into a jsonl file to have a one line per object textfile to then pull tweets from. The jsonl file will be the same name as the teams twitter handle.

We then pull the text of the tweet using [pull_text.py](https://github.com/Kyle-Demers08/NLP_340/blob/main/Final%20Project/python_scripts/pull_text.py). This script reads in each line(one object) and looks for the tweet key. It then checks to make sure there are no duplicates, and prints out the list. 

We run both these python scripts by using [get_all_tweets_about.py](https://github.com/Kyle-Demers08/NLP_340/blob/main/Final%20Project/python_scripts/Get_all_tweets_about.py). This script initializes a list of all the teams in the premier league. It then calls the pull_tweets python script to get tweets about each team. Then it pulls the text from the jsonl file and stores it in a txt file with the name being the teams handle.

**Reiteration of the Goals**
1) Be able to predict when a premier league manager will be sacked using sentiment analysis.

The issue with this is that I have to manually go back in the twitter timeline to pull tweets from around that time. This is because twarc2 only allows pulling tweets from 7 days previous. A can create a small subset of those tweets and create a supervised learning model demonstrating which tweets are common before a manager gets sacked. Another issue is that another manager might not get sacked in the next few months so I would have to use old self gathered data.

2) Try and use tweets about teams to see how many points they have earned 

This I think is a lot more likely, but I'm not sure how to use NLP for this. I might use sentiment analysis to predict if the tweets are good, neutral, or bad correlating to win, draw or loss. I will be able to have tweets from matchweek 26 - 36 in the time of this project. There are also match replays which is likely another game. Overall this is 11 games played which should be interesting to see how well the model can do.

If we look to see how many points each team has earned, we should see where they are standing before matchweek 26.

In [None]:
MW_26_points = {'bournemouth':21 , 'arsenal':60, 'villa':31, 'Brighton':35, 'Chelsea':31, 'spurs':45, 'crystal':27, 'everton':21, 'leicester':24, 'liverpool':39, 'leeds':22, 'manu':49, 'mancity':55, 'forest':25, 'newcastle':41, 'southhampton':18, 'westham':23, 'wolves':24,'fulham':39,'brentford':25}

In [2]:
#imports
import re
import pandas as pd
import os

The first thing we need to do is get our tweets out from a folder and into memory. 

In [3]:
def gettweets(Matchweek) -> str:
    '''
    Matchweek: name of folder inside the data folder
    
    Pulls all of the tweets out of folder inside data/Matchweek ##
    '''
    md = {}
    for filename in os.listdir("data/" + Matchweek): #get the txt document for each team inside the folder data/ input:matchweek ##
        try:
            f = open("data/" +Matchweek + '/' + filename, "r") #open each txt file
        except:
            OSError([21]) #catches .ipynbcheckpoints
            continue
        ts = f.read() #Read it in to a variable as a string
        key = filename[0:-4] # drop the .txt so the keys are the club names
        md[key] = ts #add the tweet info
        #Need to create a tweet seperator. Txt file is a list. when we see ", or a ', that is the end of a tweet
        # and the start of a new tweet. 
        md[key] = md[key].replace('",', '<BREAK HERE>') 
        md[key] = md[key].replace("',", '<BREAK HERE>')
        md[key] = md[key].split("<BREAK HERE>") #create a list from using the break here to find the break between tweets
        #print(ts)
        f.close()
    return md 

There are a lot of tweets in here that we aren't interested in. For example, clickbait to a suspicious website won't tell us how you feel about the team. If you need to sell tickets because something came up won't give us sentiment. Let's remove these types of tweets. Also twitter handles and new line characters aren't relevant. While we will do more formal tokenization later, for now let's quickly get rid of them.

In [4]:
def clean(tweets) -> list:
    '''
    tweets: list of tweets
    
    Remove links, Twitter handles, and new line characters from a list of tweets
    '''
    new = []
    for i,txt in enumerate(tweets):
        #catch anything that looks spam related or not relevant to sentiment and remove it from the tweets list
        if 'https:' in txt: 
            continue
        if 'http:' in txt:
            continue
        if '.com' in txt:
            continue
        if '.co' in txt:
            continue
        if 'trying to get premier league clubs to reply' in txt:
            continue
        if 'you changed my life financially' in txt:
            continue
        if ' Sign up for free' in txt:
            continue
        if 'selling ticket' in txt:
            continue
        txt = re.sub(r'@\w+', '', txt) #remove handles
        txt = re.sub(r'\n', '', txt) #remove \n character
        new.append(txt)
    return new 

We also need a place to store these tweets. Let's create a pandas dataframe to store all of this information

In [5]:
def make_df(team_dict) -> dict:
    '''
    team_dict: dictionary of all teams and their tweets
    
    creates a dataframe of the the teams and their tweets. It also initialize a wins loss and draws column
    '''
    
    df = pd.DataFrame(columns = ['Team', 'Tweet','Win','Draw','Loss']) #initialize a df
    idx =0
    for key in team_dict.keys(): #for every team
        team_dict[key] = clean(team_dict[key]) #clean it
        for tweet in team_dict[key]: #for every tweet
            new_row = pd.DataFrame({'Team':key,'Tweet':tweet,'Win':0,'Draw':0,'Loss':0},index = [idx]) #make the row
            df = pd.concat([df,new_row]) #insert it
            idx+=1
    return df 

To finish filling out this dataframe we need to insert the results from this matchweek incase we decide to do supervised learning in the future. 

In [6]:
def insertwins(dictionary,df):
    '''
    dictionary: dictionary of results that should be updated manually before pushing
    df: dataframe of teams and tweets filled out with empty Win, Draw, Loss columns
    
    Insert the results of the matchweek into the dataframe
    '''
    for key in dictionary.keys():
        df.loc[df['Team'] == key,results[key]] = 1

Now let's use our functions to get data from matchweek 27!

In [7]:
tweets = gettweets('Matchweek_27')
df = make_df(tweets)

## UPDATE EVERY WEEK!!

In [16]:
results = {'afcbournemouth':'Loss','ArsenalFC':'Win','avfc':'Win', 'BHAFC':'Win','ChelseaFC':'Win','COYS':'Loss','CPFC':'Loss','everton':'Draw','lcfc':'Loss','lfc':'Win','LUFC':'Loss','manunited':'Loss','mcfc':'Win','NFFC':'Draw','NUFC':'Loss','SouthamptonFC':'Win','WHUFC':'Loss','WolvesFC':'Win','fulhamfc':'Loss','BrentfordFC':'Win'}

In [17]:
insertwins(results,df)
df

Unnamed: 0,Team,Tweet,Win,Draw,Loss
0,fulhamfc,' Passion towards your club?? Behave yoursel...,0,0,1
1,fulhamfc,' Lol . Was he just walking the streets afte...,0,0,1
2,fulhamfc,' SOON,0,0,1
3,fulhamfc,' We were better when he called us “Cottage’s”,0,0,1
4,fulhamfc,' Who is the Batman for the soccer plays? 🤣,0,0,1
...,...,...,...,...,...
708,SouthamptonFC,' No unfortunately,1,0,0
709,SouthamptonFC,""" Played a blinder for Bournemouth at St A...",1,0,0
710,SouthamptonFC,' This is how u get blocked huh?😂,1,0,0
711,SouthamptonFC,'4 Mar 2023: 1-0 Leicester\n\n made his 330t...,1,0,0


2) Description of the dataset a. What is the size of the dataset? b. What is the format of the dataset? c. What is the structure of the dataset?

After removing a good chunk of tweets, the dataset looks like it is pulling a little over 700 total tweets which is about 35 tweets per Premier League team. This means if a manager is performing poorly for 4 weeks in a row we will have 140 tweets that are likely not so favorable in a row. This could be used to predict if the manager is being sacked. If we are looking to predict how a team does in the next 11 weeks we have 385 tweets per team and 7700 tweets total. This might be a little small, so after this week I will double the search size. 

I plan on storing all of these pandas dataframes as a csv so that the information isn't lost. The information in the csv is what is in the pandas dataframe being the Team, Tweet, Win, Draw, Loss. 

3 a. What are the data models used in the dataset? b. What are the data structures used in the dataset?

In order to get the data from a csv file into a complete pandas dataframe I will read all the files into a sqlite database. From there I can simply union the databases into one complete database. The Tweets and teams are all stored as strings, and the wins, draws, losses are stored as integers. Below are the amount of tweets per team.

In [22]:
for i in results.keys():
    print(i + ' has ' + str(len(df.loc[df['Team']==i])) + ' tweets in the dataframe.')

afcbournemouth has 30 tweets in the dataframe.
ArsenalFC has 30 tweets in the dataframe.
avfc has 59 tweets in the dataframe.
BHAFC has 32 tweets in the dataframe.
ChelseaFC has 53 tweets in the dataframe.
COYS has 30 tweets in the dataframe.
CPFC has 29 tweets in the dataframe.
everton has 34 tweets in the dataframe.
lcfc has 29 tweets in the dataframe.
lfc has 83 tweets in the dataframe.
LUFC has 29 tweets in the dataframe.
manunited has 31 tweets in the dataframe.
mcfc has 35 tweets in the dataframe.
NFFC has 29 tweets in the dataframe.
NUFC has 32 tweets in the dataframe.
SouthamptonFC has 30 tweets in the dataframe.
WHUFC has 30 tweets in the dataframe.
WolvesFC has 29 tweets in the dataframe.
fulhamfc has 30 tweets in the dataframe.
BrentfordFC has 29 tweets in the dataframe.
