# Week One Final Project: NBA Salary Prediction and Exploration

Where has week one gone! We have one more project for you to put a nice little bow on all of the hard work you've done so far. For this project, be persistent, be curious, and ask questions if you get stuck!

## The Project

You and your teammates will create one prediction model and *AT LEAST* three plots or charts. Everyone will present their model and their charts during the final session of the day.
* Model predictions will be ranked according to their r-squared values and we will crown a winner!
* Your plots should be driven by curiosity. Everyone will present at least one plot.

## Helper Functions

We've provided helper functions down below. If you need help remembering what they do, refer to the `airbnb_solution.ipynb` example in the `3 - Day Three` folder.

## Let's Get Started: Reading and preparing the data

You'll need to use a lot of existing libraries and packages to look at the data. The cell below imports what you need into this notebook.

In [None]:
# We'll use these packages
import pandas as pd
import numpy as np
import ast
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, accuracy_score
from sklearn.preprocessing import MultiLabelBinarizer

The cell below limits how many columns to show at once. Some data can be quite large, and you don't want to overload your notebook. For the nba data we are looking at, we shouldn't run into this issue, but it's a good practice to run a cell like this when analyzing data.

In [None]:
pd.set_option('display.max_columns', 100)

Next, let's read the data from the csv file. The dataset we will be using contains statistics and salary information for NBA players during the 2022-2023 season.

[Link to dataset source and information](https://www.kaggle.com/datasets/jamiewelsh2/nba-player-salaries-2022-23-season)

**You will be looking for features of the data that can model (or predict) any of the players' salaries.**

In [None]:
# Read in the data!
nba_data = pd.read_csv("nba_dataset.csv")

Take a look at the first five rows of the data using the `df.head()` method.

In [None]:
nba_data.head()

### Helper Function 1: Splitting up text data that differs within the column

Sometimes column data will be in text form, but you still want to use it in your analysis. The next five code cells help you to modify the data so that it can be used for this type of analysis.

For example, some positions might have higher salaries than others, but you currently can't look at individual positions because they are all in one column.

Since you can't examine text data, the code below will to split each position out into it's own column, and indicate with a `1` or `0` whether the player played that position. While there may seem like better ways to view this data from a human perspective, this is considered *'best practice'* in data science.

The code below splits out the selected column (`Position`) into its own `dataframe`, making each value a new column. It then stores the 'dummy' dataframe in a new variable called `position_dummies`.

In [None]:
# Helper function for splitting out text data that differs by column, but is representing a category
# like genre, or artist, but not something track name
position_dummies = pd.get_dummies(nba_data["Position"])
position_dummies

This process creates new columns. The cell below stores all these new column header names in a variable called `team_dummies_columns` so that you can easily look it up later if you need.

In [None]:
position_dummies_columns = position_dummies.columns

Now you can take a look at those column names using the code cell below.

In [None]:
position_dummies_columns

Now that you have a dataframe that you can use, combine your two separate dataframes (`nba_data` and `position_dummies`) using the `pandas.concat()` method, which will concatenate `position_dummies` onto the end of `nba_data`.

The cell below does this work, and then stores this new dataframe in a variable called `merged_data`. 

In [None]:
#genre_dummies is it's own dataframe, I want to contatenate it onto my nba_data
merged_data = pd.concat([nba_data, position_dummies], axis=1)
merged_data

Finally, since you're going to be doing the rest of the analysis using the `nba_data` variable, put this `merged_data` information into that variable by reassigning `nba_data`.

***Note: you may want to repeat this process for 'Team' if that interests you***

In [None]:
nba_data = merged_data
nba_data

### Helper Function 2: Splitting up data in dictionary form

Sometimes data will come in a dictionary form, which looks like:

```
{
    'rock': 'False',
    'rap': 'True',
    'country': 'False,
    ...
}
```
This data isn't in a great form to do numerical analysis with, so the code cell below can be useful to create individual columns based on the keys in the dictionary.

There is no data that is in dictionary form in the NBA dataset we are looking at, so it doesn't need to be run. This notebook is set up so that you can use different data in the future, so keep this helper function around.

In [None]:
# DON'T USE THIS IF YOU DONT HAVE ANY DATA IN A DICTIONARY FORM

# Helper Function: Feature Engineering
# Use this to turn dictionary columns into useful features
# We use the genre column as an example

column = "column"  # FEEL FREE TO CHANGE THIS
number_to_keep = 100

def process_col_name(col_name):
    col_name_list = ast.literal_eval(col_name)
    if not isinstance(col_name_list, list):
        return []
    return [dic['name'] for dic in col_name_list if isinstance(dic, dict) and 'name' in dic]

nba_data[f'{column}_list'] = nba_data[column].apply(process_col_name)

# Compute the frequency of each col_name member
freq = pd.Series([name for sublist in nba_data[f'{column}_list'].tolist() for name in sublist]).value_counts()

# Keep the top 100 most frequent col_name members
top_col_name = freq[:number_to_keep].index.tolist()

# Filter the lists in the column to only include top col_name members
nba_data[f'{column}_list'] = nba_data[f'{column}_list'].apply(lambda x: [i for i in x if i in top_col_name])

mlb = MultiLabelBinarizer()

binary_matrix = pd.DataFrame(mlb.fit_transform(nba_data[f'{column}_list']), columns=mlb.classes_)

# Clean the column names: keep only alphanumeric characters and underscores
binary_matrix.columns = binary_matrix.columns.str.replace('[^0-9a-zA-Z_]', '', regex=True)

# Now, concatenate the binary matrix with the original DataFrame
new_feature_names = binary_matrix.columns
nba_data = pd.concat([nba_data, binary_matrix], axis=1)

## Looking at the Data: Plotting

The two code cells below help you to look at data in different ways using the **bar** plot and **scatter** plot functionality of dataframes.

***To complete the challenge, you will need to have three (3) different plots that told you something about the data.***

Play around with different columns to look at and compare. You will need to create at least one new code cell to add your third plot, which can be either a bar plot or a scatter plot. 

In [None]:
# Helper Function: Two Bar Chart Plots
groupby_variable = "col_1"
y_value = "col_2"

fig, axs = plt.subplots(2, 1, figsize=(12, 6))
nba_data.groupby(groupby_variable)[y_value].mean().plot(kind="bar", ax=axs[0], title=f"Average {y_value}")
nba_data.groupby(groupby_variable)[y_value].count().plot(kind="bar", ax=axs[1], title=f"Count for each Bucket")
fig.tight_layout()

In [None]:
# Helper Function: Scatter Plot

x_value = "col_1"
y_value = "col_2"

nba_data.plot(x=x_value, y=y_value, kind="scatter", alpha=0.2)

## Model Training: Selecting features to explain something

The code cell below will support you in training a model that predicts the salary of any player.

This is a *'regression'* model, meaning that it tries to predict a value, as opposed to *'classification'* which identifies a singular class.

Basically, you are looking for what combination of the columns, which we call features, can predict how much a player in this dataset was paid. The model training part then works out how much of a weight to put on those features.

The cell below will give a list of all of the options that you can use for features (just don't use 'salary', that's what you're trying to predict).

Select some of these columns to include in the `features` variable list (it can be longer than 3).

***The output of this cell will have a lot of information***  
What you are looking for is the `R**2` value under the `Validation Data Statistics`.

In [None]:
list(nba_data.columns`)

In [None]:
# Helper Function: Model Training
# DO NOT USE YOUR TARGET COLUMN IN THE FEATURES
features = ["col_1", "col_2", "col_3"]

target = "Salary"  # LEAVE THIS ALONE
model_type = "linear regression"  # Options: "random forest" or "linear regression"
features_to_show = 15


if model_type == "random forest":
    model = RandomForestRegressor()
elif model_type == "linear regression":
    model = LinearRegression()

shuffled_data = nba_data.sample(len(nba_data))  # Shuffle our data
train_data = shuffled_data[:int(len(shuffled_data)*0.8)]
validation_data = shuffled_data[int(len(shuffled_data)*0.8):]

model.fit(train_data[features], train_data[target])

train_data[f"predicted_{target}"] = model.predict(train_data[features])
validation_data[f"predicted_{target}"] = model.predict(validation_data[features])

# How do we measure our success?
print("Training Data Statistics")
print("mean_absolute_error: ", mean_absolute_error(train_data[target], train_data[f"predicted_{target}"]))
print("mean_squared_error", mean_squared_error(train_data[target], train_data[f"predicted_{target}"]))
print("R**2", r2_score(train_data[target], train_data[f"predicted_{target}"]))
print("")

print("Validation Data Statistics")
print("mean_absolute_error: ", mean_absolute_error(validation_data[target], validation_data[f"predicted_{target}"]))
print("mean_squared_error", mean_squared_error(validation_data[target], validation_data[f"predicted_{target}"]))
print("R**2", r2_score(validation_data[target], validation_data[f"predicted_{target}"]))

if model_type == "random forest":
    importances = model.feature_importances_
    indices = np.argsort(importances)[-features_to_show:]  # sort top features

    # Create a figure and a set of subplots
    fig, ax = plt.subplots()

    # Bar plot
    ax.barh(range(len(indices)), importances[indices], color='b', align='center')
    plt.yticks(range(len(indices)), [features[i] for i in indices])
    plt.xlabel('Relative Importance')
    plt.title(f'Top {features_to_show} Feature Importances')
    plt.show()

if model_type == "linear regression":
    coefficients = model.coef_
    indices = np.argsort(np.abs(coefficients))[-features_to_show:]  # sort top features by magnitude

    # Create a figure and a set of subplots
    fig, ax = plt.subplots()

    # Bar plot
    ax.barh(range(len(indices)), coefficients[indices], color='g', align='center')
    plt.yticks(range(len(indices)), [features[i] for i in indices])
    plt.xlabel('Coefficient Value')
    plt.title(f'Top {features_to_show} Feature Coefficients in Linear Regression')
    plt.show()