# **ARE YOU ALL STARS MATERIAL?**

In this notebook, we will explore whether a player with certain stats is All-Stars material or not using a machine learning approach. We will start by gathering and preparing our dataset, which consists of comprehensive statistics of NBA players, both current and past. This data has been meticulously collected and verified from the official [NBA website](https://www.nba.com/).

We will then proceed with data preprocessing, feature selection, and model training to predict the likelihood of a player being an All-Star. Let's dive into the exciting world of basketball analytics and machine learning!

First of all, we import the necessary dependencies.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from scipy.spatial.distance import mahalanobis

We then convert each csv to dataframes using Pandas and then combining the non all-star players.

In [2]:
all_stars_df = pd.read_csv('players_dataset/All_Stars.csv', header=0)
atlantic_df = pd.read_csv('players_dataset/Atlantic.csv', header=0)
central_df = pd.read_csv('players_dataset/Central.csv', header=0)
northwest_df = pd.read_csv('players_dataset/Northwest.csv', header=0)
pacific_df = pd.read_csv('players_dataset/Pacific.csv', header=0)
southeast_df = pd.read_csv('players_dataset/Southeast.csv', header=0)
southwest_df = pd.read_csv('players_dataset/Southwest.csv', header=0)

divisions = [southwest_df, southeast_df, pacific_df, northwest_df, central_df, atlantic_df]
non_all_star_df = pd.concat(divisions, ignore_index=True)

Assign binary labels for all-stars and non all-stars. In our case, 1 denotes all-star, 0 denotes otherwise.

In [3]:
all_stars_df["Label"] = 1  # All-Star
non_all_star_df["Label"] = 0  # Non-All-Star

all_players_df = pd.concat([all_stars_df, non_all_star_df], ignore_index=True)

Then, we clean up the data and make sure Python can read each category correctly. E.g., the player's height is still in feet and inches format, we'd have to make sure that Python can read it as inches (with number type). For simplicity in code, we used Regex.

In [4]:
# Extract the numerical value of the weight, removing the "lbs" suffix
all_players_df["Weight"] = all_players_df["Weight"].str.extract(r'(\d+)').astype(float)

# Converting the height to inches from feet and inches format
height_split = all_players_df["Height"].str.extract(r'(?P<feet>\d+)\'(?P<inches>\d+)')
all_players_df["Height"] = height_split["feet"].astype(float) * 12 + height_split["inches"].astype(float)

## Distance Metrics
Before we build our classifiers, let's first define the different distance metrics we will use to measure similarity between data points. These metrics help determine how "close" two players are based on their stats. We'll be using 3 different metrics for comparison:
1. Cosine Similarity
2. Euclidean distance
3. Mahalanobis distance
<br>

### Cosine Similarity
Cosine similarity measures the similarity between two vectors based on the angle between them. Cosine similarity measures the similarity between two vectors based on the angle between them. It is useful when the magnitude of the values does not matter, only their direction (e.g., comparing player performance trends rather than raw numbers). We use numpy's extensive mathematical functions for this.

In [5]:
def cosine_similarity(v1, v2):
    """Calculate cosine similarity between two vectors. The cosine similarity is a measure of similarity between two non-zero vectors of an inner product."""
    return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))

### Euclidean Distance
Euclidean distance measures the straight-line distance between two points in feature space. It is the most commonly used metric for KNN as it treats all features equally.

In [6]:
def euclidean_distance(v1, v2):
    """Calculate euclidean distance between two vectors. For every dimension, calculate the difference between the two vectors, square it, sum all the squared differences, and take the square root of the sum."""
    return np.sqrt(np.sum((np.array(v1) - np.array(v2))**2))

### Mahalanobis Distance
Mahalanobis distance accounts for correlations between variables and scales the distances accordingly. Mahalanobis distance accounts for correlations between variables and scales the distances accordingly. It is particularly useful when features (e.g., height and weight) are correlated. In this case, we use a library to get the mahalanobis distance (just so that they can handle the matrix multiplication behind the scenes). Our function definition helps with inverting the covariance matrix before using the library's function.

In [7]:
def mahalanobis_distance(v1, v2, cov_matrix):
    """Calculate the Mahalanobis distance between two vectors. The Mahalanobis distance is a measure of the distance between a point and a distribution."""
    inv_cov = np.linalg.inv(cov_matrix)
    return mahalanobis(v1, v2, inv_cov)

## K-Nearest Neighbour

Now that all data is tidy and clean, we start with training the model using K-Nearest Neighbour (KNN). KNN is a supervised machine learning algorithm used for classification and regression. It works by finding the k closest points (neighbors) to a given data point and assigning a label based on the majority vote of those neighbors.

Before applying KNN, we need to split our dataset into a **training set** and a **test set**. The training set is used to teach the model, while the test set evaluates its performance.

In [8]:
# Let x = features
# player_features stores an array of arrays of features of all players
player_features = all_players_df[["Height", "Weight", "PPG (Points per game)", "RPG (Rebound per game)", "APG (Assists per game)", "PIE (Player Impact Estimate)"]]

# Let y = labels
# player_labels stores an array of labels (1 or 0, all-stars or not) of all players
player_labels = all_players_df["Label"]

# player_names stores an array of names of all players
player_names = all_players_df["Name"]

# Perform train-test split with 90% training data and 10% testing data and keeping names to identify players
x_train, x_test, y_train, y_test, train_name, test_name = train_test_split(player_features, player_labels, player_names, test_size=0.1, random_state=42)

# Get the covariance matrix of the training data
cov_matrix = np.cov(x_train, rowvar=False)

Now, let's create a function to predict labels using KNN with different distance metrics. Here's how we implemented this manually:

1. **Initialization**: Choose the number of neighbors `k` and a distance metric (e.g., Euclidean, Cosine, Mahalanobis).

2. **Distance Calculation**: For a new data point, calculate the distance between this point and all points in the training set using the chosen distance metric.

3. **Sorting**: Sort the calculated distances in ascending order (or descending)—according to the metrics chosen.

4. **Neighbor Selection**: Select the top `k` closest points (neighbors) from the sorted list.

5. **Voting**: Count the labels of the selected `k` neighbors. The label with the highest count is the predicted label for the new data point.

6. **Prediction**: Assign the predicted label to the new data point.

In [9]:
def knn_prediction(x_train, y_train, x_test, k, metric, cov_matrx=None):
    """A K-Nearest Neighbors classifier that predicts the label of the test data based on the training data and a distance metric."""
    predictions = []
    for test_point in x_test:
        distances = []
        for i, train_point in enumerate(x_train):
            if metric == 'cosine':
                distance = cosine_similarity(test_point, train_point)
            elif metric == 'euclidean':
                distance = euclidean_distance(test_point, train_point)
            elif metric == 'mahalanobis':
                distance = mahalanobis_distance(test_point, train_point, cov_matrix)
            distances.append((distance, y_train[i]))
        
        # Sort the distances and get the k-nearest neighbors, if cosine similarity, sort in descending order to get the largest values
        distances.sort(reverse=(1 if metric == 'cosine' else 0))
        k_neighbours = [label for j, label in distances[:k]]

        # Predict the label of the test point based on the majority label of the k-nearest neighbors
        prediction = 1 if k_neighbours.count(1) > k_neighbours.count(0) else 0
        predictions.append(prediction)

    return predictions

Now, we perform all 3 metrics on our dataset and evaluate the model using accuracy and precision to determine how well it performs mathematically. The following is how we calculated our performance metrics:
1. **Accuracy**: How often the model predicts correctly. [$\frac{True\:Positives\:+\:True\:Negatives}{Number\:of\:Samples}$]
<br><br>
2. **Precision**: How many predicted positives are correct. [$\frac{True\:Positives}{True\:Positives\:+\:False\:Positives}$]

In [14]:
def calculate_accuracy(y_test, y_pred):
    """Calculate the accuracy of the model."""
    true_positives = true_negatives = 0
    for i in range(len(y_test)):
        if y_test[i] == 1 and y_pred[i] == 1:
            true_positives += 1
        elif y_test[i] == 0 and y_pred[i] == 0:
            true_negatives += 1
    
    return (true_positives + true_negatives) / len(y_test)

def get_knn_results(x_train, y_train, x_test, y_test, k_values, cov_matrix=None):
    """Get the K-Nearest Neighbors results for a given k and distance metric."""
    for k in k_values:
        print(f"-----K = {k}-----")
                
        cosine_pred = knn_prediction(x_train, y_train, x_test, k, 'cosine')
        euclidean_pred = knn_prediction(x_train, y_train, x_test, k, 'euclidean')
        mahalanobis_pred = knn_prediction(x_train, y_train, x_test, k, 'mahalanobis', cov_matrix)

        cosine_accuracy = calculate_accuracy(y_test, cosine_pred)
        euclidean_accuracy = calculate_accuracy(y_test, euclidean_pred)
        mahalanobis_accuracy = calculate_accuracy(y_test, mahalanobis_pred)

        # print the test and prediction labels with names of the players in a pretty way
        for i in range(len(y_test)):
            print(f"Name: {test_name.iloc[i]}, Test Label: {y_test[i]}, Cosine Prediction: {cosine_pred[i]}, Euclidean Prediction: {euclidean_pred[i]}, Mahalanobis Prediction: {mahalanobis_pred[i]}")

        # for i in range(len(y_test)):
        #     print(f"Name: {test_name[i]}, Test Label: {y_test[i]}, Cosine Prediction: {cosine_pred[i]}, Euclidean Prediction: {euclidean_pred[i]}, Mahalanobis Prediction: {mahalanobis_pred[i]}")

        print(f"Cosine Accuracy: {cosine_accuracy}")
        print(f"Euclidean Accuracy: {euclidean_accuracy}")
        print(f"Mahalanobis Accuracy: {mahalanobis_accuracy}")
        
    return 0

# cosine_pred = knn_prediction(x_train.values, y_train.values, x_test.values, 5, 'cosine')
# euclidean_pred = knn_prediction(x_train.values, y_train.values, x_test.values, 5, 'euclidean')
# mahalanobis_pred = knn_prediction(x_train.values, y_train.values, x_test.values, 5, 'mahalanobis', cov_matrix)

# cosine_accuracy = calculate_accuracy(y_test.values, cosine_pred)
# euclidean_accuracy = calculate_accuracy(y_test.values, euclidean_pred)
# mahalanobis_accuracy = calculate_accuracy(y_test.values, mahalanobis_pred)

k_values = [3, 5, 7, 11]
get_knn_results(x_train.values, y_train.values, x_test.values, y_test.values, k_values, cov_matrix)

-----K = 3-----
Name: Donovan Mitchell, Test Label: 1, Cosine Prediction: 1, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Steven Adams, Test Label: 0, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Stanley Umude, Test Label: 0, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: DeAndre Jordan, Test Label: 0, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Jalen Duren, Test Label: 0, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Evan Mobley, Test Label: 1, Cosine Prediction: 1, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Trae Young, Test Label: 1, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: DaQuan Jeffries, Test Label: 0, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Joe Ingles, Test Label: 0, Cosine Prediction: 0, Euclidean Prediction: 0, Mahalanobis Prediction: 0
Name: Do

0

In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

# Sample DataFrame
data = {
    "Name": ["Player A", "Player B", "Player C", "Player D", "Player E"],
    "Weight": [200, 220, 240, 260, 280],
    "Height": [76, 78, 80, 82, 84],
    "PPG": [10, 15, 20, 25, 30],
    "All-Stars": [False, True, False, True, True]
}

df = pd.DataFrame(data)

# Extract player names, features (X), and labels (y)
player_names = df["Name"]  
X = df[["Weight", "Height", "PPG"]]  
y = df["All-Stars"]  

# Perform train-test split, keeping names
X_train, X_test, y_train, y_test, train_names, test_names = train_test_split(X, y, player_names, test_size=0.4, random_state=42)

# Convert back to DataFrames
train_df = pd.DataFrame(X_train)
train_df["Name"] = train_names
train_df["All-Stars"] = y_train

test_df = pd.DataFrame(X_test)
test_df["Name"] = test_names
test_df["All-Stars"] = y_test

print("\n--- Training Set ---")
print(train_df)

print("\n--- Test Set ---")
print(test_df)


In [None]:
import numpy as np
import pandas as pd

def dot(X, Y):
    """Compute the dot product of two vectors."""
    return sum(x * y for x, y in zip(X, Y))

def norm(X):
    """Compute the Euclidean norm of a vector."""
    return sum(x ** 2 for x in X) ** 0.5

def cosine_similarity(X, Y):
    """Calculate cosine similarity between two vectors."""
    return dot(X, Y) / (norm(X) * norm(Y))

# Load All-Star and Non-All-Star player data
all_stars_df = pd.read_csv('players_dataset/All_Stars.csv', header=0)
southwest_df = pd.read_csv('players_dataset/Southwest.csv', header=0)
southeast_df = pd.read_csv('players_dataset/Southeast.csv', header=0)
pacific_df = pd.read_csv('players_dataset/Pacific.csv', header=0)
northwest_df = pd.read_csv('players_dataset/Northwest.csv', header=0)
central_df = pd.read_csv('players_dataset/Central.csv', header=0)
atlantic_df = pd.read_csv('players_dataset/Atlantic.csv', header=0)

# Combine all non-All-Star data
divisions = [southwest_df, southeast_df, pacific_df, northwest_df, central_df, atlantic_df]
non_all_star_df = pd.concat(divisions, ignore_index=True)

# Assign labels
all_stars_df["Label"] = 1  # All-Star
non_all_star_df["Label"] = 0  # Non-All-Star

# Combine all data
data = pd.concat([all_stars_df, non_all_star_df], ignore_index=True)

# Convert weight to numeric (removing 'lb')
data["Weight"] = data["Weight"].str.extract(r'(\d+)').astype(float)

# Convert height to inches
height_split = data["Height"].str.extract(r'(?P<feet>\d+)\'(?P<inches>\d+)')
data["Height"] = height_split["feet"].astype(float) * 12 + height_split["inches"].astype(float)

# Select relevant numerical columns
features = ["Weight", "Height", "PPG (Points per game)", "RPG (Rebound per game)", "APG (Assists per game)", "PIE (Player Impact Estimate)"]
players_stats_original = data[features].apply(pd.to_numeric, errors='coerce').fillna(0)

# Extract labels
labels = data["Label"].to_numpy()

# k-NN Classification with predefined k values
def predict_all_star(new_player, k):
    new_player = np.array(new_player)
    dataset = players_stats_original.to_numpy()
    
    # Compute similarities to all players
    similarities = [(cosine_similarity(new_player, dataset[i]), labels[i], data.iloc[i]["Name"], *players_stats_original.iloc[i]) for i in range(len(dataset))]
    
    # Sort by highest similarity (descending order)
    sorted_similarities = sorted(similarities, key=lambda x: x[0], reverse=True)
    
    # Select top-k nearest neighbors
    top_k = sorted_similarities[:k]
    
    # Count votes
    all_star_votes = sum(1 for sim in top_k if sim[1] == 1)
    not_all_star_votes = sum(1 for sim in top_k if sim[1] == 0)
    
    # Determine final classification
    prediction = "All-Star" if all_star_votes > not_all_star_votes else "Not All-Star"
    
    return prediction, top_k



def get_valid_input(prompt, convert_func=float):
    while True:
        try:
            value = convert_func(input(prompt))
            return value
        except ValueError:
            print("Invalid input. Please enter a valid value.")

# print("Enter player stats:")
name = input("Name: ")
weight = get_valid_input("Weight (lbs): ")
feet = get_valid_input("Height (feet): ", int)
inches = get_valid_input("Height (inches): ", int)
height = feet * 12 + inches
ppg = get_valid_input("PPG (Points per game): ")
rpg = get_valid_input("RPG (Rebound per game): ")
apg = get_valid_input("APG (Assists per game): ")
pie = get_valid_input("PIE (Player Impact Estimate): ")

new_player = [weight, height, ppg, rpg, apg, pie]

# Predict for k values 7, 9, and 11
for k in [7, 9, 11]:
    prediction, top_k = predict_all_star(new_player, k)
    print(f"\nK = {k}")
    print(f"{'Name':<15}{'Weight':<10}{'Height':<10}{'PPG':<10}{'RPG':<10}{'APG':<10}{'PIE':<10}{'Cosine Similarity':<20}{'Label':<15}")
    for player in top_k:
        label = "All-Star" if player[1] == 1 else "Not All-Star"
        print(f"{player[2]:<15}{player[3]:<10.2f}{player[4]:<10.2f}{player[5]:<10.2f}{player[6]:<10.2f}{player[7]:<10.2f}{player[8]:<10.2f}{player[0]:<20.4f}{label:<15}")
    print(f"{name:<15}{weight:<10.2f}{height:<10.2f}{ppg:<10.2f}{rpg:<10.2f}{apg:<10.2f}{pie:<10.2f}{'':<20}{prediction:<15}")


Function for calculating the euclidian distance of two points

In [None]:
def euclidian_distance(arr1, arr2):
    # should be able to take in 2 array (or any number of values) and return the euclidian distance between them
    a1 = np.array(arr1)
    a2 = np.array(arr2)
    return np.sqrt(np.sum((a1 - a2)**2))
    pass

Functions for the centroid classifier, checks for euclidian and cosine

In [None]:
# Start for centroid classifier
def centroid_classifier(new_player):
    # Calculate average statistics for All-Stars and Non-All-Stars directly from the DataFrames and change it to a numpy array
    all_star_averages = all_stars_df[features].mean().to_numpy()
    non_all_star_averages = non_all_star_df[features].mean().to_numpy()
    
    centroid_check_euclidian(new_player,all_star_averages,non_all_star_averages)
    centroid_check_cosine(new_player,all_star_averages,non_all_star_averages)

# Check similarity between the average of the dataset and the new point with euclidian distance
def centroid_check_euclidian(new_player, all_star_avg, non_all_satr_avg):
    # Calculate Euclidean distances to both centroids
    euclidian_distance_to_all_star = euclidian_distance(new_player, all_star_avg)
    eudlidian_distance_to_non_all_star = euclidian_distance(new_player, non_all_satr_avg)

    # Classify based on the closer centroid
    if euclidian_distance_to_all_star < eudlidian_distance_to_non_all_star:
        return 1
    else:
        return 0

# Check similarity between the average of the dataset and the new point with cosine similarity
def centroid_check_cosine(new_player, all_star_avg, non_all_satr_avg):
    cosine_similarity_to_all_star = cosine_similarity(new_player, all_star_avg)
    cosine_similarity_to_non_all_star = cosine_similarity(new_player, non_all_satr_avg)
    
    # Classify based on which centroid is larger (closer to 1)
    if cosine_similarity_to_all_star > cosine_similarity_to_non_all_star:
        return 1
    else:
        return 0


# Check similarity between the average of the dataset and the new point with mahalobis distance
def centroid_mahalobis_distance(new_player, all_star_avg, non_all_satr_avg):
    mahalobis_distance_to_all_star = mahalobis_distance(new_player, all_star_avg)
    mahalobis_distance_to_non_all_star = mahalobis_distance(new_player, non_all_satr_avg)
    
    # Classify based on the closer centroid
    if mahalobis_distance_to_all_star < mahalobis_distance_to_non_all_star:
        return 1
    else:
        return 0