# ARE YOU ALL STARS MATERIAL?

In this notebook, we will explore whether a player with certain stats is All-Stars material or not using a machine learning approach. We will start by gathering and preparing our dataset, which consists of comprehensive statistics of NBA players, both current and past. This data has been meticulously collected and verified from the official [NBA website](https://www.nba.com/).

We will then proceed with data preprocessing, feature selection, and model training to predict the likelihood of a player being an All-Star. Let's dive into the exciting world of basketball analytics and machine learning!

First of all, we import the necessary dependencies.

In [6]:
import numpy as np
import pandas as pd


We then convert each csv to dataframes using Pandas.

In [None]:
all_stars_df = pd.read_csv('players_dataset/All_Stars.csv', header=0)
atlantic_df = pd.read_csv('players_dataset/Atlantic.csv', header=0)
central_df = pd.read_csv('players_dataset/Central.csv', header=0)
northwest_df = pd.read_csv('players_dataset/Northwest.csv', header=0)
pacific_df = pd.read_csv('players_dataset/Pacific.csv', header=0)
southeast_df = pd.read_csv('players_dataset/Southeast.csv', header=0)
southwest_df = pd.read_csv('players_dataset/Southwest.csv', header=0)

<class 'str'>


Then, we clean up the data and make sure Python can read each category correctly. E.g., the player's height is still in feet and inches format, we'd have to make sure that Python can read it as inches (with integer type).

In [34]:
def convert_weight(df, weight_col="Weight"):
    df[weight_col] = df[weight_col].apply(lambda x: int(x.strip()[:-2]))
    return df

all_stars_df = convert_weight(all_stars_df)
atlantic_df = convert_weight(atlantic_df)
central_df = convert_weight(central_df)
northwest_df = convert_weight(northwest_df)
pacific_df = convert_weight(pacific_df)
southeast_df = convert_weight(southeast_df)
southwest_df = convert_weight(southwest_df)

print(all_stars_df["Height"])
print(all_stars_df["Weight"])


ValueError: invalid literal for int() with base 10: '195l'

Create a function for calculating cosine similarity metric

In [10]:
Southwest_df = pd.read_csv('players_dataset/Southwest.csv', header=0)

print(Southwest_df)

                 Name  Weight Height  PPG (Points per game)  \
0       Dwight Powell  240lb   6'10"                    1.5   
1   Spencer Dinwiddie  215lb    6'5"                   10.5   
2       Klay Thompson  220lb    6'5"                   13.9   
3          Danté Exum  214lb    6'5"                   11.7   
4       Naji Marshall  220lb    6'6"                   11.2   
5       Aaron Holiday   185lb   6'0"                    4.7   
6         Cody Zeller   240lb  6'11"                    1.8   
7         Jalen Green   186lb   6'4"                   21.2   
8        Steven Adams   265lb  6'11"                    3.7   
9        Jock Landale   255lb  6'11"                    4.1   
10       Luke Kennard  206lb    6'5"                    9.3   
11  Marvin Bagley III  235lb   6'10"                    4.7   
12       Desmond Bane  215lb    6'5"                   18.3   
13          Ja Morant  174lb    6'2"                   20.4   
14       John Konchar  210lb     6'5                   

In [21]:
import numpy as np
import pandas as pd
"""
def cosine_similarity(a, b):
    #Compute cosine similarity between two vectors.
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Load All-Star and Non-All-Star player data
all_stars_df = pd.read_csv('players_dataset/All_Stars.csv', header=0)
southwest_df = pd.read_csv('players_dataset/Southwest.csv', header=0)
southeast_df = pd.read_csv('players_dataset/Southeast.csv', header=0)
pacific_df = pd.read_csv('players_dataset/Pacific.csv', header=0)
northwest_df = pd.read_csv('players_dataset/Northwest.csv', header=0)
central_df = pd.read_csv('players_dataset/Central.csv', header=0)
atlantic_df = pd.read_csv('players_dataset/Atlantic.csv', header=0)

# Combine all non-All-Star data
divisions = [southwest_df, southeast_df, pacific_df, northwest_df, central_df, atlantic_df]
non_all_star_df = pd.concat(divisions, ignore_index=True)

# Assign labels
all_stars_df["Label"] = 1  # All-Star
non_all_star_df["Label"] = 0  # Non-All-Star

# Combine all data
data = pd.concat([all_stars_df, non_all_star_df], ignore_index=True)

# Convert weight to numeric (removing 'lb')
data["Weight"] = data["Weight"].str.extract(r'(\d+)').astype(float)

# Convert height to inches
height_split = data["Height"].str.extract(r'(?P<feet>\d+)\'(?P<inches>\d+)')
data["Height"] = height_split["feet"].astype(float) * 12 + height_split["inches"].astype(float)

# Select relevant numerical columns
features = ["Weight", "Height", "PPG (Points per game)", "RPG (Rebound per game)", "APG (Assists per game)", "PIE (Player Impact Estimate)"]
players_stats = data[features].apply(pd.to_numeric, errors='coerce').fillna(0)

# Normalize features for better similarity calculation
players_stats = (players_stats - players_stats.min()) / (players_stats.max() - players_stats.min())

# Extract labels
labels = data["Label"].to_numpy()

# k-NN Classification with predefined k values
def predict_all_star(new_player, k):
    new_player = np.array(new_player)
    dataset = players_stats.to_numpy()
    
    # Compute similarities to all players
    similarities = [(cosine_similarity(new_player, dataset[i]), labels[i], data.iloc[i]["Name"], *dataset[i]) for i in range(len(dataset))]
    
    # Sort by highest similarity (descending order)
    sorted_similarities = sorted(similarities, key=lambda x: x[0], reverse=True)
    
    # Select top-k nearest neighbors
    top_k = sorted_similarities[:k]
    
    # Count votes
    all_star_votes = sum(1 for sim in top_k if sim[1] == 1)
    not_all_star_votes = sum(1 for sim in top_k if sim[1] == 0)
    
    # Determine final classification
    prediction = "All-Star" if all_star_votes > not_all_star_votes else "Not All-Star"
    
    return prediction, top_k

def get_valid_input(prompt, convert_func=float):
    while True:
        try:
            value = convert_func(input(prompt))
            return value
        except ValueError:
            print("Invalid input. Please enter a valid value.")

print("Enter player stats:")
name = input("Name: ")
weight = get_valid_input("Weight (lbs): ")
feet = get_valid_input("Height (feet): ", int)
inches = get_valid_input("Height (inches): ", int)
height = feet * 12 + inches
ppg = get_valid_input("PPG (Points per game): ")
rpg = get_valid_input("RPG (Rebound per game): ")
apg = get_valid_input("APG (Assists per game): ")
pie = get_valid_input("PIE (Player Impact Estimate): ")

new_player = [weight, height, ppg, rpg, apg, pie]

# Predict for k values 3, 5, and 7
for k in [3, 5, 7]:
    prediction, top_k = predict_all_star(new_player, k)
    print(f"\nK = {k}")
    print(f"{'Name':<15}{'Weight':<10}{'Height':<10}{'PPG':<10}{'RPG':<10}{'APG':<10}{'PIE':<10}{'Cosine Similarity':<20}")
    for player in top_k:
        print(f"{player[2]:<15}{player[3]:<10.2f}{player[4]:<10.2f}{player[5]:<10.2f}{player[6]:<10.2f}{player[7]:<10.2f}{player[8]:<10.2f}{player[0]:<20.4f}")
    print(f"{name:<15}{weight:<10}{height:<10}{ppg:<10}{rpg:<10}{apg:<10}{pie:<10}{prediction:<15}")"""


import numpy as np
import pandas as pd

def dot(X, Y):
    """Compute the dot product of two vectors."""
    return sum(x * y for x, y in zip(X, Y))

def norm(X):
    """Compute the Euclidean norm of a vector."""
    return sum(x ** 2 for x in X) ** 0.5

def cosine_similarity(X, Y):
    """Calculate cosine similarity between two vectors."""
    return dot(X, Y) / (norm(X) * norm(Y))

# Load All-Star and Non-All-Star player data
all_stars_df = pd.read_csv('players_dataset/All_Stars.csv', header=0)
southwest_df = pd.read_csv('players_dataset/Southwest.csv', header=0)
southeast_df = pd.read_csv('players_dataset/Southeast.csv', header=0)
pacific_df = pd.read_csv('players_dataset/Pacific.csv', header=0)
northwest_df = pd.read_csv('players_dataset/Northwest.csv', header=0)
central_df = pd.read_csv('players_dataset/Central.csv', header=0)
atlantic_df = pd.read_csv('players_dataset/Atlantic.csv', header=0)

# Combine all non-All-Star data
divisions = [southwest_df, southeast_df, pacific_df, northwest_df, central_df, atlantic_df]
non_all_star_df = pd.concat(divisions, ignore_index=True)

# Assign labels
all_stars_df["Label"] = 1  # All-Star
non_all_star_df["Label"] = 0  # Non-All-Star

# Combine all data
data = pd.concat([all_stars_df, non_all_star_df], ignore_index=True)

# Convert weight to numeric (removing 'lb')
data["Weight"] = data["Weight"].str.extract(r'(\d+)').astype(float)

# Convert height to inches
height_split = data["Height"].str.extract(r'(?P<feet>\d+)\'(?P<inches>\d+)')
data["Height"] = height_split["feet"].astype(float) * 12 + height_split["inches"].astype(float)

# Select relevant numerical columns
features = ["Weight", "Height", "PPG (Points per game)", "RPG (Rebound per game)", "APG (Assists per game)", "PIE (Player Impact Estimate)"]
players_stats_original = data[features].apply(pd.to_numeric, errors='coerce').fillna(0)

# Extract labels
labels = data["Label"].to_numpy()

# k-NN Classification with predefined k values
def predict_all_star(new_player, k):
    new_player = np.array(new_player)
    dataset = players_stats_original.to_numpy()
    
    # Compute similarities to all players
    similarities = [(cosine_similarity(new_player, dataset[i]), labels[i], data.iloc[i]["Name"], *players_stats_original.iloc[i]) for i in range(len(dataset))]
    
    # Sort by highest similarity (descending order)
    sorted_similarities = sorted(similarities, key=lambda x: x[0], reverse=True)
    
    # Select top-k nearest neighbors
    top_k = sorted_similarities[:k]
    
    # Count votes
    all_star_votes = sum(1 for sim in top_k if sim[1] == 1)
    not_all_star_votes = sum(1 for sim in top_k if sim[1] == 0)
    
    # Determine final classification
    prediction = "All-Star" if all_star_votes > not_all_star_votes else "Not All-Star"
    
    return prediction, top_k

def get_valid_input(prompt, convert_func=float):
    while True:
        try:
            value = convert_func(input(prompt))
            return value
        except ValueError:
            print("Invalid input. Please enter a valid value.")

print("Enter player stats:")
name = input("Name: ")
weight = get_valid_input("Weight (lbs): ")
feet = get_valid_input("Height (feet): ", int)
inches = get_valid_input("Height (inches): ", int)
height = feet * 12 + inches
ppg = get_valid_input("PPG (Points per game): ")
rpg = get_valid_input("RPG (Rebound per game): ")
apg = get_valid_input("APG (Assists per game): ")
pie = get_valid_input("PIE (Player Impact Estimate): ")

new_player = [weight, height, ppg, rpg, apg, pie]

# Predict for k values 7, 9, and 11
for k in [7, 9, 11]:
    prediction, top_k = predict_all_star(new_player, k)
    print(f"\nK = {k}")
    print(f"{'Name':<15}{'Weight':<10}{'Height':<10}{'PPG':<10}{'RPG':<10}{'APG':<10}{'PIE':<10}{'Cosine Similarity':<20}{'Label':<15}")
    for player in top_k:
        label = "All-Star" if player[1] == 1 else "Not All-Star"
        print(f"{player[2]:<15}{player[3]:<10.2f}{player[4]:<10.2f}{player[5]:<10.2f}{player[6]:<10.2f}{player[7]:<10.2f}{player[8]:<10.2f}{player[0]:<20.4f}{label:<15}")
    print(f"{name:<15}{weight:<10.2f}{height:<10.2f}{ppg:<10.2f}{rpg:<10.2f}{apg:<10.2f}{pie:<10.2f}{'':<20}{prediction:<15}")


  """


Enter player stats:

K = 7
Name           Weight    Height    PPG       RPG       APG       PIE       Cosine Similarity   Label          
Stephen Curry  185.00    74.00     23.40     4.50      6.10      13.60     1.0000              All-Star       
Tyler Herro    195.00    77.00     23.90     5.50      5.50      13.70     1.0000              All-Star       
Tyler Herro    195.00    77.00     23.90     5.50      5.50      13.70     1.0000              Not All-Star   
Jalen Brunson  190.00    74.00     26.10     2.80      7.50      13.90     0.9998              All-Star       
Damian Lillard 195.00    74.00     25.80     4.70      7.50      14.70     0.9998              All-Star       
Kyrie Irving   195.00    74.00     24.60     4.80      4.80      12.80     0.9998              All-Star       
Zach LaVine    200.00    77.00     23.70     4.60      4.60      12.00     0.9998              Not All-Star   
james          185.00    74.00     23.40     4.50      6.10      13.60               

Create a function for calculating euclidian distance metric

In [22]:
def euclidian_distance(values):
    # should be able to take in 6 values (or any number of values) and return the euclidian distance between them
    pass