![Data Dunkers Banner](https://github.com/PS43Foundation/data-dunkers/blob/main/docs/top-banner.jpg?raw=true)

<a href="https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fdata-dunkers%2Fdata-dunkers-modules&branch=main&subPath=AI/predicting-role.ipynb&depth=1" target="_parent"><img src="https://raw.githubusercontent.com/callysto/curriculum-notebooks/master/open-in-callysto-button.svg?sanitize=true" width="123" height="24" alt="Open in Callysto"/></a><a href="https://colab.research.google.com/github/data-dunkers/data-dunkers-modules/blob/mainAI/predicting-role.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg?sanitize=true" width="123" height="24" alt="Open in Colab"/></a>

# Machine Learning - Predicting NBA Positions

## Objectives

Students will be able to:

- use the [sklearn](https://scikit-learn.org/stable/index.html) library to create, train, understand, and test basic machine-learning models
- ientify potential sources of error in datasets, and fix them before training machine-learning models
- understand what "accuracy" means in a machine-learning context, and find ways to improve model accuracy
- decide when to use, and not use, certain different machine-learning models

## Introduction

[Machine learning](https://simple.wikipedia.org/wiki/Machine_learning) (ML) is the study of using data and algorithms for computers to learn. By analyzing large amounts of data, computers can identify patterns and make decisions with minimal human intervention. 

Imagine you have a lot of information about NBA players, like how many points they score, how many rebounds they get, and how many assists they make. With machine learning, we can teach a computer to look at these stats and predict what position a player might play, like a guard, forward, or center. 

It's like giving the computer a bunch of clues and letting it figure out the answer! This can be helpful for coaches and teams to understand their players better and make smarter decisions during games.

Let's begin by importing the libraries we'll be using in this notebook.

## Import Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import tensorflow as tf
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
print('libraries imported')

## Cleaning Data

Data cleaning is the process of fixing or removing incorrect, corrupted, or irrelevant data from a dataset. This ensures that the data is accurate and ready for analysis or machine learning tasks.

The dataset we want to import is "flipped", meaning that the rows are columns, and columns are rows.

To get rid of this issue, we will be using this code snippet: `nba_player_stats.iloc[::-1]`, which reverses the order of the rows in the dataframe.

In [None]:
nba_player_stats_url = 'https://raw.githubusercontent.com/Data-Dunkers/data-dunkers-modules/main/data-dunkers/data/20232024nbaplayerstatsreg.csv'

nba_player_stats = pd.read_csv(nba_player_stats_url, delimiter=';', encoding='latin1')
nba_player_stats = nba_player_stats.iloc[::-1]
display(nba_player_stats)

We see that we have the name of NBA players with their stats for the 2023-2024 NBA season.

Let's take a look at all the different columns in our dataset. 

In [None]:
nba_player_stats.columns

In the context of an NBA game, each column in our dataset represents specific statistics about a player's performance. For example, `Rk` is the player's rank, `Player` is the player's name, `Pos` is their position, and so on and so forth.

In machine learning, we'll use these columns as "features" for our model. **Features** are individual measurable properties or characteristics of the data that help the model make predictions. In this case, the player's stats will be our features, and we'll use them to predict an NBA player's position.

Remember how we mentioned data-cleaning earlier? In our particular case, we want to remove duplicates in our dataset for the sake of consistency. 

When viewing our dataset earlier, we saw 3 instances of a player named "Precious Achiuwa". Let's search for that particular player in our dataset.

In [None]:
check_for_duplicate = 'Precious Achiuwa'
results = nba_player_stats.loc[nba_player_stats["Player"] == check_for_duplicate]
results

We have 3 instances of "Precious Achiuwa" in our dataset. We want to make it so that every player in this dataset is only represented once.

In order to achieve this, let's use the `groupby()` function in pandas and if there are multiple instances of a player in our dataset, we only take the player with the highest points (PTS) by using the `.idxmax()` function. 

In [None]:
nba_player_stats = nba_player_stats.loc[nba_player_stats.groupby('Player')['PTS'].idxmax()]

nba_player_stats = nba_player_stats.reset_index(drop=True)
display(nba_player_stats)

Perfect, let's check if there is only one instance of "Precious Achiuwa" in our dataset.

In [None]:
check_for_duplicate = 'Precious Achiuwa'
results = nba_player_stats.loc[nba_player_stats["Player"] == check_for_duplicate]
results

We've successfully eliminated this issue using data-cleaning.

Another potential problem in our machine-learning model are players with low statistics in the NBA. Players with low statistics, often have incomplete or less representative data, which can skew the model's learning process. These outliers may introduce noise (random or unpredictable fluctuations) and reduce the overall accuracy of the model. 

To solve this issue, let's set a parameter that a player needs at least 10 points. This ensures that our dataset includes only those players who have a more substantial and consistent presence in games helping our model to learn more accurate patterns and relationships from the data, leading to better predictions.

In [None]:
nba_player_stats = nba_player_stats.drop(nba_player_stats[nba_player_stats['PTS'] <= 10].index)
nba_player_stats.reset_index(drop=True)

To test that we've successfully dropped all players with less than 10 points, we can see if Precious Achiuwa is still in our dataset, as we know previously that he had less than 10 points.

In [None]:
check_for_over_10 = 'Precious Achiuwa'
results = nba_player_stats.loc[nba_player_stats["Player"] == check_for_over_10]
results

Perfect, one last thing we want to do for our model is to map points guards (PG) and shooting guards (SG) together as guards or **G**, alongside small-forwards (SF) and power-forwards (PF) as just forwards or **F**.

In [None]:
position_mapping = {'PG': 'G', 'SG': 'G', 'PF': 'F', 'SF': 'F'}

nba_player_stats['Pos'] = nba_player_stats['Pos'].map(position_mapping).fillna(nba_player_stats['Pos'])
nba_player_stats = nba_player_stats.reset_index(drop=True)
display(nba_player_stats)

Now we can move onto our machine-learning model. 

You don't have to know the specific coding details of creating our model but understand the generic model of what is going on to replicate similar models in your own projects.

### Selecting Features and Target

```python
features = ['FG%', '3P', '3PA', '3P%', '2P%', 'FT%', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
target = 'Pos'
```

`features` is a list of columns that represent different statistics of the players. These are the inputs to our model.
`target` is the column we want to predict, which in this case is the player's position ('Pos').

### Splitting Data

```python
X = nba_player_stats[features]
y = nba_player_stats[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```

`X` contains the feature data (player stats), and is commonly denoted with an uppercase X in machine-learning.

`y` contains the target data (player positions), and is commonly denoted with an lowercase y in machine-learning.

We then split our data into 2 portions testing data (X_test, y_test) and 2 portions of training data (y_train, y_test).

### Training the Model

```python
model = RandomForestClassifier(random_state=10)
model.fit(X_train, y_train)
```

`RandomForestClassifier` is the machine learning algorithm we selected for this particular problem. You don't need to know the specifics of how the model works, but it essentially creates a "forest" of decision trees and combines their predictions for better accuracy.

`model.fit(X_train, y_train)` then trains the model using the training data.

### Testing the Model

```python
y_pred = model.predict(X_test)
```

`model.predict(X_test)` uses the trained model to predict player positions on the test data.

### Finding our Accuracy

```python
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:}")

print(classification_report(y_test, y_pred))
```

We can learn find our how well our model did based on an evaluation between 0-100%. For reference, a model scoring at 50% accuracy, means that 50% of the time it can correctly identify an NBA player's position.

Using the `classification_report`, we can find more details on our model's accuracy, specifically in regard to how well it scores in guessing particular positions.

In [None]:
features = ['FG%', '3P', '3PA', '3P%', '2P%', 'FT%',
            'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK', 'TOV', 'PF', 'PTS']
target = 'Pos'

X = nba_player_stats[features]
y = nba_player_stats[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=10)

model = RandomForestClassifier(random_state=10)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:}")

print(classification_report(y_test, y_pred))

Looking at our model accuracy, we scored at **0.63%** (rounded). 

This is okay for our first instance of a machine-learning model, but is there a way we can improve our model accuracy?

One way to potentially improve our model is to see which hyper-parameters we can tweak in our machine-learning model. 

Once again, you don't need to know the specific nuances of the code shown below, but simply have a generic understanding of what it does to implement similar methods in your own models.

In [None]:
param_grid = {'n_estimators': [50, 100, 200],'max_depth': [None, 10, 20, 30],'min_samples_split': [2, 5, 10],'min_samples_leaf': [1, 2, 4]}

model = RandomForestClassifier(random_state=10)
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Best model accuracy: {accuracy:}")

We see that the best parameters for higher model accuracy for our particular dataset is:

`Best parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 50}`

This means if we wanted to implement these changes we would use the following code to initialize our model:

```python
best_model = RandomForestClassifier(max_depth=None, min_samples_leaf=1, min_samples_split=2, n_estimators=50, random_state=10)
```

We've also gotten an approximate 5% increase in model performance, which is great!

## Conclusion

In this notebook, we demonstrated the process of building and optimizing a machine learning model to predict NBA player positions based on their game statistics. We started with data cleaning and feature selection to ensure our dataset was ready to be used in a machine-learning model. We then trained a `RandomForestClassifier` model using our cleaned dataset, and improved the model by identifying the best hyper-parameters to improve our model's accuracy. 

In your projects, find datasets that have useful features that can be used in the context of "prediction". There are loads of different ways you can implement machine learning in Python. If you're interested in developing more machine-learning models using `sklearn`, you can find more information on [their official website](https://scikit-learn.org/stable/).