## Points Prediction 
- The aim of this model is to try and be able to predict the amount of points a player will score in an upcoming gameweek.
- I want to see if it's possible to get all the players individual game data for the season to train my model as training it on a game by game basis will be much more accurate than over the space of a whole season I think.
- For this I am going to build a seperate python script which can be found in individual/player-data.py

#### Research & Obtaining Data
- I need to try and find a good way to measure this prediction as there isn't a specific target value
- My initial plan was to base it by doing just points but I don't want to do this as thinking about it more, it would just be a total season points tally divided by the amount of games played leaving you with the same amount of points predicted for every week
- I would like to include fixture difficulty in some aspect based off of the formula I created in my fixture analyser
- My current Idea is to see if I can automate going through each players individual FBREF page and getting their game by game statistics, this would give me a much more accurate model as it would actually be basing the model off of a game by game basis rather than the whole season.
- I am going to try this with all of the 24/25 seasons data so far and see if I can make a python script to go through them and retrieve all of the stats.
- I was able to get the first player fine, but they have unique ids in the url which is an issue for automating this process
- I then tried to use an external api that I found here: https://natstat.com/fc
- Unfortunately this was also a dead end as it didn't have nearly enough in depth stats that I need to build this model
- I eventually found a way around the 429 error (too many requests), I had to put a delay of 11 to 15 seconds in between each request, this is to make it seem more natural and so their servers don't shut me out.
- Getting this data took me about 2/3 hours and then I realised I didn't have the player name in the data which was a big part of this for me.
- So I had to run it again and make sure that I was appending the CSV with the players ID from the site and their name so that I can accurately train the model.
- I will have to import goalkeepers seperately as they have a different set of statistics than other players.

#### Pre-Processing & Cleansing Data
- There is going to be a lot of data pre processing and cleansing here as the scraped data is currently extremely messy
- A checklist of what needs to be done to the data is as follows:
    1. Remove games where players didn't play
    2. Remove blank rows
    3. Add headers to the columns
    4. Filter for only Premier League games
    5. Remove unneeded columns.
    6. Add a fantasy points column

In [53]:
import pandas as pd
import numpy as np
import sklearn

df = pd.read_csv('individual/game_by_game_stats.csv')

df = df.dropna()

print(df.head())

         Date  Day     Competition  ...  Match-Report  PlayerID Player-Name
0  2024-08-31  Sat  Premier League  ...  Match Report  774cf58b  Max-Aarons
2  2024-09-30  Mon  Premier League  ...  Match Report  774cf58b  Max-Aarons
4  2024-10-26  Sat  Premier League  ...  Match Report  774cf58b  Max-Aarons
5  2024-11-02  Sat  Premier League  ...  Match Report  774cf58b  Max-Aarons
6  2024-11-09  Sat  Premier League  ...  Match Report  774cf58b  Max-Aarons

[5 rows x 39 columns]


This is just simply reading in the csv, dropping the Nan values and printing the head to make sure it imported ok.

In [54]:
dfGK = df[df['Position'] == "GK"]

dfGK.to_csv('filtered_game_by_game_stats_GK.csv', index=False)

In [None]:
dfgk = pd.read_csv('filtered_game_by_game_stats_GK.csv')

df = df[df['Position'] != "On matchday squad, but did not play"]
dfGK = dfGK[dfGK['Position'] != "On matchday squad, but did not play"]

df = df[df['Competition'] == "Premier League"]
dfGK = dfGK[dfGK['Competition'] == "Premier League"]

df = df.drop(columns=['Competition', 'Matchweek', 'Date', 'Day', 'Match-Report', 'PlayerID'])
dfgk = dfgk.drop(columns=['Competition', 'Matchweek', 'Date', 'Day', 'Venue', 'Match-Report', 'PlayerID'])

df.to_csv('filtered_game_by_game_stats.csv', index=False)
dfgk.to_csv('filtered_game_by_game_stats_GK.csv', index=False)

This block here is doing three primary things, it's removing games where players didn't make an appearance, it is making sure it's just Premier League data that we are working with adn then finally it's dropping the unneeded columns.

The next step is to add the fantasy points column.

In [None]:
dfgk = dfgk[~dfgk['Team'].str.contains("eng", case=False, na=False)]
dfgk.to_csv('filtered_game_by_game_stats_GK.csv', index=False)