## Points Per Game Change Log

#### 17/02/25
- For this model I am going to use KNN
- I am going to build an NBA player comparison model
- I am going to obtain the data from a web scraper on (https://www.basketball-reference.com/leagues/NBA_2025_per_game.html) this site.
- My aim is for me to be able to give the model stats from a player and it to be able to predict how many points per game they will average.

In [32]:
import requests

url = 'https://www.basketball-reference.com/leagues/NBA_2015_per_game.html'

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}

response = requests.get(url, headers=headers)

response.encoding = "utf-8"

with open("old-data.html", "w", encoding="utf-8") as file:
    file.write(response.text)

- My first step is to extract the raw html from the web page so that I can then extract the necessary table from there. I am using headers here so that the bot appears as a real user and not a bot as some sites will block it from accessing the site if it detects bot behaviour.

- ChatGPT gave me the response.encoding line, I need to use this here as when I was importing the data through html without utf 8 encoding, it was unable to recognise certain player names as they have symbols from other languages in them.

In [None]:
from bs4 import BeautifulSoup
import pandas as pd

# HTML file containing the table
html_file = "old-data.html"

# Load the HTML file with BeautifulSoup
with open(html_file, "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "lxml")  # You can also try 'lxml' if needed

# Locate the <table> with id 'per_game_stats'
table = soup.find("table", {"id": "per_game_stats"})

if table:
    print("Found <table id='per_game_stats'>.")

    # Parse the table using pandas
    try:
        df = pd.read_html(str(table))[0]
        # Save the DataFrame to a CSV file
        df.to_csv("15-nba.csv", index=False)
        print("Player stats saved to '15-nba.csv'")
    except ValueError as e:
        print(f"Error parsing the table: {e}")
else:
    print("No <table> with id 'per_game_stats' found.")


Found <table id='per_game_stats'>.
Player stats saved to '15-nba.csv'


  df = pd.read_html(str(table))[0]


- I am following a similar template to what I did to obtain data for my 4th year project which can be seen here (https://github.com/DoyleAaron/4th-year-final-project), it is a great method of getting data but there can be a lot of data cleaning to ensure it is ready to be used in a machine learning model.

- Now I need to go back and get a few more years worth of data so that I have a substantial amount of data to build my model from. This is done manually by going through and changing the url for each year and saving it to a new CSV file.

- I have now gone back and obtained the last 10 years worth of per game data for all NBA players, I now need to combine and tidy the data so that I can use it in my model.

In [40]:
csv_files = ["15-nba.csv", "16-nba.csv", "17-nba.csv", "18-nba.csv", "19-nba.csv", "20-nba.csv", "21-nba.csv", "22-nba.csv", "23-nba.csv", "24-nba.csv", "25-nba.csv"]

df = pd.concat([pd.read_csv(file) for file in csv_files], ignore_index=True)

df.to_csv("all-season-data.csv", index=False)

#  Removing null values
df = df.dropna()

- As all of the data follows the same format as it is coming from the same website I can use pd.concat which takes in the array of csv files that I entered and combines them into one big csv file.

In [2]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split

df = pd.read_csv("all-season-data.csv")

print(df.head())

# I am just importing the data and printing the first 5 rows to ensure that the data is loaded correctly.

    Rk             Player   Age Team Pos     G    GS    MP   FG   FGA  ...  \
0  1.0  Russell Westbrook  26.0  OKC  PG  67.0  67.0  34.4  9.4  22.0  ...   
1  2.0       James Harden  25.0  HOU  SG  81.0  81.0  36.8  8.0  18.1  ...   
2  3.0       Kevin Durant  26.0  OKC  SF  27.0  27.0  33.8  8.8  17.3  ...   
3  4.0       LeBron James  30.0  CLE  SF  69.0  69.0  36.1  9.0  18.5  ...   
4  5.0      Anthony Davis  21.0  NOP  PF  68.0  68.0  36.1  9.4  17.6  ...   

   ORB  DRB   TRB  AST  STL  BLK  TOV   PF   PTS                 Awards  
0  1.9  5.4   7.3  8.6  2.1  0.2  4.4  2.7  28.1          MVP-4,AS,NBA2  
1  0.9  4.7   5.7  7.0  1.9  0.7  4.0  2.6  27.4          MVP-2,AS,NBA1  
2  0.6  6.0   6.6  4.1  0.9  0.9  2.7  1.5  25.4                     AS  
3  0.7  5.3   6.0  7.4  1.6  0.7  3.9  2.0  25.3  MVP-3,DPOY-13,AS,NBA1  
4  2.5  7.7  10.2  2.2  1.5  2.9  1.4  2.1  24.4   MVP-5,DPOY-4,AS,NBA1  

[5 rows x 31 columns]


In [3]:
X = df[["eFG%", "AST", "TRB", "STL", "BLK", "FG%"]]
y = df["PTS"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

X_train = X_train.dropna()
y_train = y_train.loc[X_train.index]
X_test = X_test.dropna()
y_test = y_test.loc[X_test.index]

- In this code block I am just outlining what my X and Y values are and then splitting the training and test data in a 80/20 split.
- After some issues I ran into with null values I found online to drop the null values directly from the training set as when I tried to drop null values normally from the overall dataset it wasn't working as expected.

In [4]:
knn = KNeighborsRegressor(n_neighbors=1)

knn.fit(X_train, y_train)

- I am using a Regressor rather than a classifier because I want to try and predict

In [None]:
y_pred = knn.predict(X_test)

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

# Calculate error metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)

# Print results
print("MAE:", mae)     # Lower is better
print("MSE:", mse)     # Lower is better
print("RMSE:", rmse)   # Lower is better


MAE: 3.0175712347354136
MSE: 19.67989823609227
RMSE: 4.436203132870752
