![DataDunkers.ca Banner](https://github.com/Data-Dunkers/lessons/blob/main/images/top-banner.jpg?raw=true)

# Machine Learning - Predicting NBA Positions

## Objectives

In this notebook we will:

- use the [sklearn](https://scikit-learn.org/stable/index.html) library to create, train, understand, and test basic machine-learning models
- ientify potential sources of error in datasets, and fix them before training machine-learning models
- explore ways to improve model accuracy

## Introduction

[Machine learning](https://simple.wikipedia.org/wiki/Machine_learning) (ML) is the study of using data and algorithms for computers to learn. By analyzing large amounts of data, computers can identify patterns and make decisions with minimal human intervention. 

Imagine you have a lot of information about NBA players, like how many points they score, how many rebounds they get, and how many assists they make. With machine learning, we can teach a computer to look at these stats and predict what position a player might play, like a guard, forward, or center. 

It's like giving the computer a bunch of clues and letting it figure out the answer! This can be helpful for coaches and teams to understand their players better and make smarter decisions during games.

## Import Libraries and Data

Let's begin by importing the libraries and data we'll be using in this notebook.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV

nba_player_stats = pd.read_csv('20232024nbaplayerstatsreg.csv', encoding='latin1')
nba_player_stats

We see that we have the names of NBA players with their stats for the 2023-2024 NBA season.

Let's take a look at all the different columns in our dataset. 

In [None]:
nba_player_stats.columns

In the context of an NBA game, each column in our dataset represents specific statistics about a player's performance. For example, `Rk` is the player's rank, `Player` is the player's name, `Pos` is their position, and so on. In machine learning, we'll use these columns as "features" for our model.

**Features** are individual measurable properties or characteristics of the data that help the model make predictions. In this case, the player's stats will be our features, and we'll use them to predict an NBA player's position.

## Cleaning Data

We saw multiple instances of a player named "Precious Achiuwa" in the dataset, so we will want to remove those duplicate entries.

**Data cleaning** is the process of fixing or removing incorrect, corrupted, or irrelevant data from a dataset. This ensures that the data is accurate and ready for analysis or machine learning tasks.

Let's search for `'Precious Achiuwa'` in our dataset.

In [None]:
nba_player_stats.loc[nba_player_stats['Player'] == 'Precious Achiuwa']

We want to make it so that every player in this dataset is only represented once.

In order to achieve this, we'll find any players that played for multiple teams and drop any rows that are not their total stats (`TOT`). First let's look at all of the affected players.

In [None]:
for player in nba_player_stats[nba_player_stats['Tm'] == 'TOT']['Player']:
    display(nba_player_stats[nba_player_stats['Player'] == player])

It looks like the `TOT` row is always first, so let's [drop_duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html) and keep only the first of the duplicates.

In [None]:
nba_player_stats = nba_player_stats.drop_duplicates(subset=['Player'], keep='first')
nba_player_stats

Perfect, let's check if there is only one instance of "Precious Achiuwa" in our dataset.

In [None]:
nba_player_stats.loc[nba_player_stats['Player'] == 'Precious Achiuwa']

Another potential problem in our machine-learning model are players with few statistics in the NBA. Players with few statistics often have incomplete or less representative data, which can skew the model's learning process. These outliers may introduce noise (random or unpredictable fluctuations) and reduce the overall accuracy of the model.

To solve this issue, let's set a parameter that a player needs at least `6` points. This will help our model to learn more accurate patterns from the data.

In [None]:
nba_player_stats = nba_player_stats[nba_player_stats['PTS'] > 6].reset_index(drop=True)
nba_player_stats

Perfect, one last thing we want to do for our model is to check out the possible positions.

In [None]:
nba_player_stats['Pos'].unique()

To simplify, we'll map the following as guards (**G**):

* points guards (PG)
* shooting guards (SG)
* PG-SG
* SF-SG

And map the following as forwards (**F**):

* small-forwards (SF)
* power-forwards (PF)
* SF-PF
* PF-C
* C-PF

In [None]:
position_mapping = {'PG': 'G', 'SG': 'G', 'PG-SG': 'G', 'SF-SG': 'G', 'PF': 'F', 'SF': 'F', 'SF-PF': 'F', 'PF-C': 'F', 'C-PF': 'F'}

nba_player_stats['Pos'] = nba_player_stats['Pos'].map(position_mapping).fillna(nba_player_stats['Pos'])
nba_player_stats = nba_player_stats.reset_index(drop=True)
nba_player_stats

Let's check to make sure we just have three possible positions.

In [None]:
nba_player_stats['Pos'].unique()

## Machine Learning

You don't have to know the specific coding details of creating our model but understand the generic model of what is going on to replicate similar models in your own projects.

### Selecting Features and Target

`features` is a list of columns that represent different statistics of the players. These are the inputs to our model.

`target` is the column we want to predict, which in this case is the player's position ('Pos').

### Splitting Data

`X` contains the feature data (player stats), and is usually represented by `X` in machine-learning.

`y` contains the target data (player positions), and is usually represented by `y` in machine-learning.

We then split our data into two portions of testing data `(X_test, y_test)` and two portions of training data `(X_train, y_train)`.

### Training the Model

`RandomForestClassifier` is the machine learning algorithm we selected for this particular problem. You don't need to know the specifics of how the model works, but it essentially creates a "forest" of decision trees and combines their predictions for better accuracy.

`model.fit(X_train, y_train)` then trains the model using the training data.

In [None]:
features = ['FG%', '3P', '3PA', '3P%', '2P%', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
target = 'Pos'

X = nba_player_stats[features]
y = nba_player_stats[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

model = RandomForestClassifier(random_state=10)

model.fit(X_train, y_train)

### Testing the Model and Finding our Accuracy

`model.predict(X_test)` uses the trained model to predict player positions on the test data.

In [None]:
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
accuracy

A model scoring at 50% accuracy means that 50% of the time it can correctly identify an NBA player's position.

Using the `classification_report`, we can find more details on our model's accuracy, specifically in regard to how well it scores in guessing particular positions.

In [None]:
print(classification_report(y_test, y_pred))

This is okay for our first instance of a machine learning model, but one way to potentially improve our model is to see which hyper-parameters we can tweak in our model. 

Once again, you just need to know enough of what this does to implement similar methods in your own models. This code cell may take a while to run.

In [None]:
param_grid = {'n_estimators': [50, 100, 200],'max_depth': [None, 10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]}

model = RandomForestClassifier(random_state=10)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print(f'Best parameters: {grid_search.best_params_}')
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Best model accuracy: {accuracy:}')

This means if we wanted to implement these changes we would use the following code to initialize our model:

```python
best_model = RandomForestClassifier(max_depth=10, min_samples_leaf=2, min_samples_split=5, n_estimators=100, random_state=10)
```

But in our case it wouldn't make much difference to the accuracy.

## Using the Model

Let's test our model with [Pascal Siakam](https://www.nba.com/stats/player/1627783?SeasonType=Regular+Season)'s stats from the [2022-2023 regular season](https://www.basketball-reference.com/players/s/siakapa01.html).

In [None]:
model.fit(X_train, y_train)
ps_stats = pd.DataFrame({
    'FG%': [0.48], 
    '3P': [1.3], 
    '3PA': [4.0], 
    '3P%': [0.324], 
    '2P%': [0.523], 
    'FT%': [0.774], 
    'ORB': [1.8], 
    'DRB': [6.0], 
    'TRB': [7.8], 
    'AST': [5.8], 
    'STL': [0.9], 
    'BLK': [0.5], 
    'TOV': [2.4], 
    'PF': [3.2], 
    'PTS': [24.2]})
model.predict(ps_stats)[0]

Pascal Siakam played Power Forward for the Raptors in the 2022-2023 season, so our model's prediction was accurate.

## Conclusion

In this notebook, we demonstrated the process of building and optimizing a machine learning model to predict NBA player positions based on their game statistics. We started with data cleaning and feature selection to ensure our dataset was ready to be used in a machine-learning model. We then trained a `RandomForestClassifier` model using our cleaned dataset, and explored improving the model's accuracy by identifying the best hyper-parameters.

In your projects, find datasets that have useful features that can be used in the context of "prediction". There are loads of different ways you can implement machine learning in Python. If you're interested in developing more machine-learning models using `sklearn`, you can find more information on [their official website](https://scikit-learn.org/stable/).

[![Data Dunkers License](https://github.com/Data-Dunkers/lessons/blob/main/images/bottom-banner.jpg?raw=true)](https://github.com/Data-Dunkers/lessons/blob/main/LICENSE.md)