## Data Gathering

### Country squads

To answer the project questions, I first had to figure out which data to use. After some intial brainstorming, I decided on using tables with squad information such as player experience, player perstige, etc. To do that, I decided to parse Wikipedia for data on each squad member over the past 5 World Cups. With this data at hand, I then would have to find a way to group the teams together by squad and world cup year. To parse Wikipedia for data, I plan on using their API which allows me to get any info desired from the website. This being said, Wikipedia's API is infamously poorly documented, so I could end up turning to different methods (beautiful soup, pandas, etc). The following image shows the table I wanted to grab from Wikipedia.

![](../images/squadTable.jpeg)

After grabbing this table, my original worry ended up being correct. I was unable to properly utilize Wikipedia's API and I had to use a combination of Pandas and Beautiful Soup to grab the tables from their respective pages. Additionally, the date of birth column wouldn't properly import, so after much research I ended up using Stack Overflow to solve that specific problem. The end result is the following table:

In [3]:
# import the parse dataframe
import pandas as pd
df = pd.read_csv('../data/projectData/allSquads.csv')
df.drop('Unnamed: 0', axis=1).head()

Unnamed: 0,No.,Pos.,Player,Date of birth (age),Caps,Club,Year,Goals
0,1,GK,Thomas Sørensen,12 June 1976,14,Sunderland,2002,
1,2,MF,Stig Tøfting,14 August 1969,36,Bolton Wanderers,2002,
2,3,DF,René Henriksen,27 August 1969,39,Panathinaikos,2002,
3,4,DF,Martin Laursen,26 July 1977,15,Milan,2002,
4,5,DF,Jan Heintze (c),17 August 1963,83,PSV Eindhoven,2002,


### Twitter sentiment

When it come to text data, I decided to search for Tweets regarding the World Cup so I could identify the sentiment toward the event and certain teams. The spin on this data is that teams with higher mentions and better sentiment might have a correlation with team performance. It's a bit of a stretch, but I had to do it for the sake of the assignment.

To get the Twitter data, I used their API, which was way easier to use than Wikipedia's own API. The code for the parse is available in the project repository, but all I had to do was select a word or phrase I wanted to appear in the text (I chose FIFA World Cup), a date range for the tweets, how many tweets I wanted to get back, and the information I wanted to get for each tweet (e.g. text, favorited, etc.). The Twitter parse code yielded the following table:

In [4]:
# reading in csv generated with the parse code
pd.read_csv('../data/projectData/Tweets.csv').head()

Unnamed: 0,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,RT mk2club FIFAWorldCup fever is on Combine FI...,False,0,,9/14/22 22:20,False,,1.57018e+18,,"<a href=""http://twitter.com/download/android"" ...",yuiwincityx07x,1180,True,False,,
1,Where are our FIFAWorldCup Fans?Did you know w...,False,0,,9/14/22 22:20,True,,1.57018e+18,,"<a href=""http://twitter.com/download/iphone"" r...",themunchiemafia,0,False,False,,
2,Deporfans Qué opinan de las declaraciones de M...,False,0,,9/14/22 22:17,True,,1.57017e+18,,"<a href=""http://twitter.com/download/iphone"" r...",Deportrece,0,False,False,,
3,RT neymarjr It is time to rock your world qata...,False,0,,9/14/22 22:16,False,,1.57017e+18,,"<a href=""http://twitter.com/download/iphone"" r...",KruzelSvetlana,478,True,False,,
4,RT neymarjr It is time to rock your world qata...,False,0,,9/14/22 22:11,False,,1.57017e+18,,"<a href=""http://twitter.com/download/android"" ...",Gudson118,478,True,False,,


### Next steps

Both datasets are in need of cleaning. For the Tweets data, I'll have to remove variables that aren't needed (e.g. truncated, created, etc) and I'll need to clean the text data itself. Certain operations to do on the data would be lowercasing, removing stopwords, vectorizing, etc. For the player data, I will need to clean the date of birth column, ensure that there are no numerical outliers, and check that the club spellings are correct. Later on, I will group the rows by country and year to turn this dataset into one that has data for each country in each World Cup. It's possible that later on, when I'm using the data for modeling, I'll realize that I need more data. My guess is that the table format that I intend on having will be good enough to last me throughout the project, but the amount of variables will likely need to change. For now, the data suffices but it might prove to be too small a quantity with which to properly model.

### Closing thoughts

At the time of writing the closing thoughts section, I have completed all models (expect ARM) and I know understand if the data that I first gathered was enough. Much like my prediction, the data turned out to be less than desired. Having less than 200 rows to both train and test with isn't great for modeling; the more data the better. Unfortunately, the program and the assignments for this final project kept me too busy to have the time to look for more or better data, so I did what I could with what I had. The only data I managed to replace was the Twitter data. I realized that what I had at my disposition wouldn't work for the text analysis models, so I went ahead and recreated the Twitter parse but with different code. I didn't include said code in the earlier sections of this page because I already go into detail about it in the model for which the data was used.