Pre-Analysis Data Considerations: Yes there were missing values, to account for them I either replaced those spots with data that would attempt to keep it consistent like for example for the weight and height I just used the mean of each respective column. For other values like assists and rebounds, missing values meant nothing was recorded so I inserted 0's in those positions. And for any strings I also tried to replace them with numerical values to try to reflect patterns in relation with how many points they'd score. I also got rid of any rows that contained NaN values in the end for safe measures.
I tried to include only the features that had a linear relationship with the amount of points each player scored.
Features are discussed in depth below.

In [98]:
#Project 1, Nick Hageman and Nathan Schaefer
import csv
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
import pandas as pd
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn import (datasets, neighbors,
                     naive_bayes,
                     model_selection as skms,
                     linear_model, dummy,
                     metrics,
                     pipeline,
                     preprocessing as skpre)

#import ipywidgets as widgets
#from ipywidgets import interact


The below cell are the chosen features for the model. I tried to choose only the features that would be good in differentiating players in relation to how many points they'd score to improve it's predictions. Some of these columns had missing values so I will talk about how I dealt with those in the next cell.

In [99]:

#Choosing features
features = [
    "HEIGHT",
    "WEIGHT",
    "SEASON_EXP",
    "AST",
    "REB",
    "PIE",
    "DLEAGUE_FLAG",
    "ALL_STAR_APPEARANCES",
    "DRAFT_ROUND",
    "POSITION",
    "PTS"
]

features2 = [
    "HEIGHT",
    "WEIGHT",
    "SEASON_EXP",
    "AST",
    "REB",
    "PIE",
    "ALL_STAR_APPEARANCES",
    "DRAFT_ROUND",
    'POSITION',
    "DLEAGUE_FLAG"
]


For the assits, rebounds, player impact estimate, and all star appearances, I assumed that if there were missing values that they just didn't have any of those contributions and didn't want to replace them with a mean of the other values. So what I ended up doing is replacing them with 0's because I didn't want to throw off the model. This however was not the case for height and weight because the players missing these values couldn't possibly have a height/weight of 0 so I just put the sum of the respective columns to try to keep the data consistent. For the draft round feature, there were some values that were strings "Undrafted" which I replaced with 4 because I wanted to associate them with players chosen in the late rounds because they weren't drafted at all. I used a similar strategy with the position to try to reflect which positions would be scoring the most points and assigning them to the highest values. Lastly, I changed the DLEAGUE_FLAG values to 1's and 0's so it could be interpreted rather than getting errors for being a string. For safe measure I removed all rows that contained NaN values to ensure there were no errors. I also had to convert the numerical values to float16 types which I don't fully understand why but fixed the errors.

In [100]:

data_train_df = pd.read_csv("train.csv") 

data_train_drop = data_train_df[features]
# Modifying any rows with missing/string values
data_train_drop["AST"].fillna(0, inplace = True)
data_train_drop["REB"].fillna(0, inplace = True)
data_train_drop["PIE"].fillna(0, inplace = True)
data_train_drop["ALL_STAR_APPEARANCES"].fillna(0, inplace = True)
data_train_drop["HEIGHT"].fillna(data_train_drop["HEIGHT"].mean(), inplace = True)
data_train_drop["WEIGHT"].fillna(data_train_drop["WEIGHT"].mean(), inplace = True)
data_train_drop["DRAFT_ROUND"] = data_train_drop["DRAFT_ROUND"].replace("Undrafted", 4).replace("None", 4).astype(float)
data_train_drop["POSITION"] = data_train_drop["POSITION"].replace("Center", 2).replace("Forward", 8).replace("Guard", 3).replace("Center-Forward", 7).replace("Guard-Forward", 6).replace("Forward-Center", 5).replace("Forward-Guard", 4).astype(float)
data_train_drop["POSITION"].fillna(0, inplace = True)

#Converts Y/N strings to 1's and 0's
def conversion(x):
    if x == "Y":
        return(1)
    elif x == "N":
        return(0)
data_train_drop["DLEAGUE_FLAG"] = data_train_drop["DLEAGUE_FLAG"].apply(conversion)

#Remove all rows with NaN values
data_train_drop.dropna(inplace=True)
data_train_df = data_train_drop.copy()

#converts to float16 types
for feature in features:
    if feature != 'PTS':
        data_train_df[feature] = data_train_df[feature].convert_dtypes(np.float16)
        data_train_df[feature] = data_train_df[feature].convert_dtypes(np.float16)

data_train_ft = pd.DataFrame(data_train_df[features2])
data_train_tgt = data_train_df["PTS"]


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_train_drop["DRAFT_ROUND"] = data_train_drop["DRAFT_ROUND"].replace("Undrafted", 4).replace("None", 4).astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_train_drop["POSITION"] = data_train_drop["POSITION"].replace("Center", 2).replace("Forward", 8).repl

Here I did a training/test split to get different data to 1) find the best model and 2) use the other set to fit the model. I tried to test linear regression and several different models of KNN with various values of k by utilizing Cross Validation. The model with the lowest RMSE would be chosen as the best model and would be used to fit the data, Linear regression came on top every time.

In [101]:

#----------------------------------
# Split into training/testing data 
#----------------------------------
(data_train_plus_validation_ftrs, data_test_ftrs,
 data_train_plus_validation_tgt, data_test_tgt) = skms.train_test_split(data_train_ft,
                                                                      data_train_tgt,
                                                                      test_size=.30,
                                                                      random_state = 39)

# define dictionary of models to try
models_to_try = {'lr': linear_model.LinearRegression()}
# add k-NN models with various values of k to models_to_try
for k in range(1,16,2):
    models_to_try[f'{k}-NN'] = neighbors.KNeighborsRegressor(n_neighbors=k)

# compute rmse for each model using k-fold cross-validation
model_rmse = {}
for model_name in models_to_try:
    scores = skms.cross_val_score(models_to_try[model_name],
                                  data_train_plus_validation_ftrs,
                                  data_train_plus_validation_tgt,
                                  cv=5,
                                  scoring='neg_mean_squared_error')
    # convert score to rmse
    mean_rmse = np.sqrt(-scores.mean())
    print(f'{model_name}: {mean_rmse:.2f}')
    model_rmse[model_name] = mean_rmse

# get model with lowest error
best_model_name = min(model_rmse,key=model_rmse.get)
print(f'\nBest model: {best_model_name}; RMSE: {model_rmse[best_model_name]:.2f}')
best_model = models_to_try[best_model_name]


lr: 2.70
1-NN: 3.77
3-NN: 3.15
5-NN: 3.09
7-NN: 3.15
9-NN: 3.18
11-NN: 3.21
13-NN: 3.25
15-NN: 3.26

Best model: lr; RMSE: 2.70


Here I just modified the test data to be able to make predictions with my model just like how I modified the features earlier. I then fit the data on the model, made my predictions, and recorded them on a csv file to submit.

In [102]:

# This is just using the test.csv to setup a dataframe of the correct size
# and indicies (the "id" field).
make_submission_df = pd.read_csv("test.csv")
submission_ftrs = make_submission_df[features2]

#Modify testing data
submission_ftrs["DLEAGUE_FLAG"] = submission_ftrs["DLEAGUE_FLAG"].apply(conversion)
submission_ftrs["AST"].fillna(0, inplace = True)
submission_ftrs["REB"].fillna(0, inplace = True)
submission_ftrs["PIE"].fillna(0, inplace = True)
submission_ftrs["ALL_STAR_APPEARANCES"].fillna(0, inplace = True)
submission_ftrs["HEIGHT"].fillna(submission_ftrs["HEIGHT"].mean(), inplace = True)
submission_ftrs["WEIGHT"].fillna(submission_ftrs["WEIGHT"].mean(), inplace = True)
submission_ftrs["DRAFT_ROUND"] = submission_ftrs["DRAFT_ROUND"].replace("Undrafted", 4).replace("None", 4).astype(float)
submission_ftrs["POSITION"] = submission_ftrs["POSITION"].replace("Center", 2).replace("Forward", 8).replace("Guard", 3).replace("Center-Forward", 7).replace("Guard-Forward", 6).replace("Forward-Center", 5).replace("Forward-Guard", 4).astype(float)
submission_ftrs["POSITION"].fillna(0, inplace = True)

# drop all columns except 'id'
make_submission_df = make_submission_df[['id']]
# make sure the column of ID's that we just read in is the index column
make_submission_df = make_submission_df.set_index('id')

#predictions = np.random.rand(1350)*5
fit = best_model.fit(data_test_ftrs, data_test_tgt)
predictions = best_model.predict(submission_ftrs)

# Here, you add your predictions to the dataframe
make_submission_df['PTS'] = predictions

# Either one of these will work
# The first one will round all floating point numbers to 2 decimals
# Makes it easier to look at.
make_submission_df.to_csv('submission.csv',sep=',', float_format='%.2f')
#make_submission_df.to_csv('submission.csv',sep=',')



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission_ftrs["DLEAGUE_FLAG"] = submission_ftrs["DLEAGUE_FLAG"].apply(conversion)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  submission_ftrs["DRAFT_ROUND"] = submission_ftrs["DRAFT_ROUND"].replace("Undrafted", 4).replace("None", 4).astype(float)
A value is trying to 