# predictiveModel.jl
Start Date: 12.11.20

## Purpose
Use the full team stats dataframe to predict the winner and scores of the opening week of NBA games through machine learning

## Steps
1. Load data
2. Filter game results into first week/first 2 weeks of games each year
3. Combine game results with team stats from the previous season
4. Evaluate correlation
5. Test and train model using scikit learn
6. Apply the trained model for 2020

## Desired outcome
A dataframe detailing predicted game results from 12.22.20-12.27.20

## Step 1 - Load Data

In [1]:
using Pkg
#Pkg.add("CSV")
#Pkg.add("JuMP")
#Pkg.add("Lathe")
#Pkg.add("ScikitLearn")
#Pkg.add("Queryverse")
#Pkg.add("PyCall")
#Pkg.add("DataFramesMeta")
#Pkg.add("IJulia")
#Pkg.add("Plots")
#Pkg.add("Gadfly")
#Pkg.add("DataFrames")
#Pkg.add("GLM")
#Pkg.add("StatsModels")
#Pkg.add("DecisionTree")
#Pkg.add("AutoMLPipeline")
#Pkg.add("Random")
#Pkg.add("JLD2")
Pkg.installed()

└ @ Pkg /Users/julia/buildbot/worker/package_macos64/build/usr/share/julia/stdlib/v1.5/Pkg/src/Pkg.jl:554


Dict{String,VersionNumber} with 18 entries:
  "CSV"            => v"0.8.2"
  "StatsModels"    => v"0.6.15"
  "JuMP"           => v"0.21.3"
  "ScikitLearn"    => v"0.6.3"
  "Lathe"          => v"0.1.3"
  "Queryverse"     => v"0.6.2"
  "PyCall"         => v"1.92.1"
  "DataFramesMeta" => v"0.6.0"
  "StatsBase"      => v"0.33.2"
  "AutoMLPipeline" => v"0.2.2"
  "DecisionTree"   => v"0.10.10"
  "Plots"          => v"1.9.1"
  "IJulia"         => v"1.23.1"
  "Feather"        => v"0.5.7"
  "Gadfly"         => v"1.3.1"
  "JLD2"           => v"0.3.1"
  "DataFrames"     => v"0.21.8"
  "GLM"            => v"1.3.11"

In [2]:
using CSV, DataFrames
fullTeamDF = CSV.read("data/fullTeamDF.csv", DataFrame)
tail(fullTeamDF)

Unnamed: 0_level_0,League,Y,Franchise,Team,G,FGM,FGA,FG%,3FGM
Unnamed: 0_level_1,String,Int64,String,String,Int64,Int64,Int64,Float64,Int64
1,NBA,2019,Portland Trail Blazers,POR,75,3160,6833,0.46,967
2,NBA,2019,Sacramento Kings,SAC,72,2943,6364,0.46,914
3,NBA,2019,San Antonio Spurs,SAS,71,2995,6350,0.47,760
4,NBA,2019,Toronto Raptors,TOR,72,2897,6331,0.46,995
5,NBA,2019,Utah Jazz,UTA,72,2886,6130,0.47,963
6,NBA,2019,Washington Wizards,WAS,72,2990,6544,0.46,864


In [3]:
using CSV, DataFrames
results = CSV.read("data/nbaResults.csv",  DataFrame)
deletecols!(results, :Type)
tail(results)

Unnamed: 0_level_0,D,M,Y,Season,League,teamWin,teamWinAbr,League_1
Unnamed: 0_level_1,Int64,Int64,Int64,String,String,String,String,String
1,13,8,2020,(2019-20),NBA,Utah Jazz,UTA,NBA
2,13,8,2020,(2019-20),NBA,Washington Wizards,WAS,NBA
3,14,8,2020,(2019-20),NBA,Indiana Pacers,IND,NBA
4,14,8,2020,(2019-20),NBA,Los Angeles Clippers,LAC,NBA
5,14,8,2020,(2019-20),NBA,Philadelphia 76ers,PHI,NBA
6,14,8,2020,(2019-20),NBA,Toronto Raptors,TOR,NBA


In [4]:
# Split off 2019's statistics to prepare for 2020 schedule
using DataFramesMeta
predStats = @where(fullTeamDF, :Y .== 2019)
fullTeamDF = @where(fullTeamDF, :Y .<2019)
tail(fullTeamDF)

Unnamed: 0_level_0,League,Y,Franchise,Team,G,FGM,FGA,FG%,3FGM
Unnamed: 0_level_1,String,Int64,String,String,Int64,Int64,Int64,Float64,Int64
1,NBA,2018,Portland Trail Blazers,POR,82,3470,7427,0.47,904
2,NBA,2018,Sacramento Kings,SAC,82,3541,7637,0.46,927
3,NBA,2018,San Antonio Spurs,SAS,82,3468,7248,0.48,812
4,NBA,2018,Toronto Raptors,TOR,82,3460,7305,0.47,1015
5,NBA,2018,Utah Jazz,UTA,82,3314,7082,0.47,993
6,NBA,2018,Washington Wizards,WAS,82,3456,7387,0.47,930


## Step 2 - Filter for Opening Week

Excluding 2020  the NBA opening week usually is played around the week of October 23rd. This model will be predicting the scores and outcomes of Opening Week 2020 using the final team statistics from the last regular season. In order to train the model correctly  the results data frame will be filtered for the month of October and it will be filtered from . These dates may be slightly wide  however  this range will ensure the data has at least the first game of each team.

* Another method could be taking the first fifteen games of each season

In [5]:
using DataFramesMeta
tail(results)

Unnamed: 0_level_0,D,M,Y,Season,League,teamWin,teamWinAbr,League_1
Unnamed: 0_level_1,Int64,Int64,Int64,String,String,String,String,String
1,13,8,2020,(2019-20),NBA,Utah Jazz,UTA,NBA
2,13,8,2020,(2019-20),NBA,Washington Wizards,WAS,NBA
3,14,8,2020,(2019-20),NBA,Indiana Pacers,IND,NBA
4,14,8,2020,(2019-20),NBA,Los Angeles Clippers,LAC,NBA
5,14,8,2020,(2019-20),NBA,Philadelphia 76ers,PHI,NBA
6,14,8,2020,(2019-20),NBA,Toronto Raptors,TOR,NBA


## Step 3 - Combine Game and Past Season Data

In [6]:
using DataFrames, DataFramesMeta
insert!(fullTeamDF, 3, 0, :nextYr)
fullTeamDF = @transform(fullTeamDF, nextYr = :Y .+ 1)
fullTeamDF

Unnamed: 0_level_0,League,Y,nextYr,Franchise,Team,G,FGM,FGA,FG%
Unnamed: 0_level_1,String,Int64,Int64,String,String,Int64,Int64,Int64,Float64
1,NBA,1990,1991,Atlanta Hawks,ATL,82,3349,7223,0.46
2,NBA,1990,1991,Boston Celtics,BOS,82,3695,7214,0.51
3,NBA,1990,1991,Brooklyn Nets,NJN,82,3311,7459,0.44
4,NBA,1990,1991,Charlotte Hornets,CHA,82,3286,7033,0.47
5,NBA,1990,1991,Chicago Bulls,CHI,82,3632,7125,0.51
6,NBA,1990,1991,Cleveland Cavaliers,CLE,82,3259,6857,0.48
7,NBA,1990,1991,Dallas Mavericks,DAL,82,3245,6890,0.47
8,NBA,1990,1991,Denver Nuggets,DEN,82,3901,8868,0.44
9,NBA,1990,1991,Detroit Pistons,DET,82,3194,6875,0.46
10,NBA,1990,1991,Golden State Warriors,GSW,82,3566,7346,0.49


In [7]:
# Build joined dataframes
results_fullTeamDF = join(results, fullTeamDF, on = [:Y => :nextYr, :teamWin => :Franchise], kind = :left, makeunique = true)
results_fullTeamDF = join(results_fullTeamDF, fullTeamDF, on = [:Y => :nextYr, :teamWin => :Franchise], kind = :left, makeunique = true)
deletecols!(results_fullTeamDF, [:League, :League_1, :League_2, :League_3, :Y_1, :Y_2,:Country, :City, :coachWin, :coachLose,:teamWin, :teamLose])
results_fullTeamDF = dropmissing(results_fullTeamDF)

Unnamed: 0_level_0,D,M,Y,Season,teamWinAbr,teamLoseAbr,winScore,loseScore,Margin
Unnamed: 0_level_1,Int64,Int64,Int64,String,String,String,Int64,Int64,Int64
1,2,1,1991,(1990-91),ATL,LAC,120,107,13
2,2,1,1991,(1990-91),BOS,NYK,113,86,27
3,2,1,1991,(1990-91),DET,DEN,118,107,11
4,2,1,1991,(1990-91),IND,SAS,121,109,12
5,2,1,1991,(1990-91),MIL,CHA,106,91,15
6,2,1,1991,(1990-91),MIN,DAL,115,95,20
7,2,1,1991,(1990-91),PHO,CLE,105,83,22
8,2,1,1991,(1990-91),SEA,PHI,127,99,28
9,2,1,1991,(1990-91),UTA,MIA,112,104,8
10,3,1,1991,(1990-91),HOU,CHI,114,92,22


## Step 4 - Build Models

### Step 4.a - Logistic Model

In [8]:
# remove unnecessary variables: D  M  Y  etc
logModDF = deletecols!(results_fullTeamDF, [:D, :M, :Y, :G, :G_1, :Season, :teamWinAbr, :teamLoseAbr, :Team, :Team_1])

Unnamed: 0_level_0,winScore,loseScore,Margin,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Float64,Int64,Int64,Float64,Int64
1,120,107,13,3349,7223,0.46,271,836,0.32,2034
2,113,86,27,3695,7214,0.51,109,346,0.32,1646
3,118,107,11,3194,6875,0.46,131,440,0.3,1686
4,121,109,12,3450,6994,0.49,249,749,0.33,2010
5,106,91,15,3337,6948,0.48,257,753,0.34,1796
6,115,95,20,3265,7276,0.45,108,381,0.28,1531
7,105,83,22,3573,7199,0.5,138,432,0.32,2064
8,127,99,28,3500,7117,0.49,136,427,0.32,1608
9,112,104,8,3214,6537,0.49,148,458,0.32,1951
10,114,92,22,3403,7287,0.47,316,989,0.32,1631


In [9]:
# Train-test-split
using Random
sample = randsubseq(1:size(logModDF,1), 0.75)
train = logModDF[sample, :]
notsample = [i for i in 1:size(logModDF, 1) if isempty(searchsorted(sample,i))]
test = logModDF[notsample, :]

Unnamed: 0_level_0,winScore,loseScore,Margin,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Int64,Float64,Int64,Int64,Float64,Int64
1,118,107,11,3194,6875,0.46,131,440,0.3,1686
2,115,95,20,3265,7276,0.45,108,381,0.28,1531
3,135,108,27,3308,6822,0.48,185,558,0.33,1654
4,110,108,2,3298,7256,0.45,270,754,0.36,1818
5,93,89,4,3409,6988,0.49,81,297,0.27,1883
6,117,112,5,3349,7223,0.46,271,836,0.32,2034
7,99,83,16,3194,6875,0.46,131,440,0.3,1686
8,88,86,2,3337,6948,0.48,257,753,0.34,1796
9,107,90,17,3409,6988,0.49,81,297,0.27,1883
10,114,111,3,3577,7369,0.49,341,904,0.38,1912


In [10]:
# Train Model
X_train = train
Y_train = train[:Margin]
deletecols(X_train, [:winScore, :loseScore, :Margin])
X_train=convert(Matrix, X_train)

25701×109 Array{Float64,2}:
 120.0  107.0   13.0  3349.0  7223.0  0.46  …  373.0  1207.0  1763.0  8996.0
 113.0   86.0   27.0  3695.0  7214.0  0.51     534.0  1275.0  1633.0  9043.0
 121.0  109.0   12.0  3450.0  6994.0  0.49     246.0  1241.0  1838.0  8323.0
 106.0   91.0   15.0  3337.0  6948.0  0.48     304.0  1100.0  1824.0  7093.0
 105.0   83.0   22.0  3573.0  7199.0  0.5      316.0   976.0  1290.0  7333.0
 127.0   99.0   28.0  3500.0  7117.0  0.49  …  286.0   962.0  1454.0  5409.0
 112.0  104.0    8.0  3214.0  6537.0  0.49     236.0  1065.0  1262.0  7601.0
 114.0   92.0   22.0  3403.0  7287.0  0.47     383.0  1258.0  1598.0  8432.0
 108.0  104.0    4.0  3343.0  6911.0  0.48     345.0  1122.0  1421.0  8522.0
  97.0   87.0   10.0  3337.0  6948.0  0.48     304.0  1100.0  1824.0  7093.0
 131.0  113.0   18.0  3086.0  6818.0  0.45  …  334.0  1029.0  1653.0  6857.0
 118.0  108.0   10.0  3390.0  7268.0  0.47     344.0  1267.0  1699.0  8060.0
 111.0   96.0   15.0  3349.0  7223.0  0.46     3

In [11]:
using ScikitLearn
@sk_import linear_model: LogisticRegression
model = LogisticRegression(fit_intercept=true) 

PyObject LogisticRegression()

In [12]:
using ScikitLearn: fit!
fit!(model, X_train, Y_train)

PyObject LogisticRegression()

In [13]:
using ScikitLearn: predict
accuracy = sum(predict(model, X_train) .== Y_train) / length(Y_train)
println("accuracy: $accuracy") # 7% Accurate testing for the margin

accuracy: 0.0697249134274931


In [14]:
# Test the model
X_test = test
Y_test = test[:Margin]
deletecols(X_test, [:winScore, :loseScore, :Margin])
X_test=convert(Matrix, X_test)
Y_test = test[:winScore]
accuracy = sum(predict(model, X_test) .== Y_test) / length(Y_test)
println("accuracy: $accuracy") # Eek...0% Accurate

accuracy: 0.0


In [15]:
# Cross-validation
using ScikitLearn.CrossValidation: cross_val_score
X = logModDF
deletecols(X, [:winScore, :loseScore, :Margin])
X=convert(Matrix, X)
y = logModDF[:winScore]
cross_val_score(LogisticRegression(max_iter=130), X, y; cv=5)

└ @ ScikitLearn.Skcore /Users/tjsmith99/.julia/packages/ScikitLearn/NJwUf/src/cross_validation.jl:144


5-element Array{Float64,1}:
 0.03456826137689615
 0.03278688524590164
 0.0358191426893717
 0.029745251067589455
 0.0344012992765392

#### Logistic Regression results and notes

* Best Cross Validation score --> ~3% 
* Tried to one-hot encode the categorical variables, but wouldn't work
* Better output variable to look at --> TeamWinAbr
* The data's too complex for a linear model. Additionally, the categorical variables have value to the results. The next model to attempt will be the Random Forest model

### Step 4.b - Decision Tree Model

In [16]:
#Build Dataframe
results_fullTeamDF = join(results, fullTeamDF, on = [:Y => :nextYr, :teamWin => :Franchise], kind = :left, makeunique = true)
results_fullTeamDF = join(results_fullTeamDF, fullTeamDF, on = [:Y => :nextYr, :teamWin => :Franchise], kind = :left, makeunique = true)
deletecols!(results_fullTeamDF, [:League, :League_1, :League_2, :League_3, :Y_1, :Y_2, :Country, :City, :coachWin, :coachLose, :teamWin, :teamLose, :teamLoseAbr, :D, :M, :Y, :G])
results_fullTeamDF = dropmissing(results_fullTeamDF)
decTreeDF = results_fullTeamDF
decTreeDF

Unnamed: 0_level_0,Season,teamWinAbr,winScore,loseScore,Margin,Team,FGM,FGA,FG%
Unnamed: 0_level_1,String,String,Int64,Int64,Int64,String,Int64,Int64,Float64
1,(1990-91),ATL,120,107,13,ATL,3349,7223,0.46
2,(1990-91),BOS,113,86,27,BOS,3695,7214,0.51
3,(1990-91),DET,118,107,11,DET,3194,6875,0.46
4,(1990-91),IND,121,109,12,IND,3450,6994,0.49
5,(1990-91),MIL,106,91,15,MIL,3337,6948,0.48
6,(1990-91),MIN,115,95,20,MIN,3265,7276,0.45
7,(1990-91),PHO,105,83,22,PHO,3573,7199,0.5
8,(1990-91),SEA,127,99,28,SEA,3500,7117,0.49
9,(1990-91),UTA,112,104,8,UTA,3214,6537,0.49
10,(1990-91),HOU,114,92,22,HOU,3403,7287,0.47


In [17]:
# Train-test-split
using Random
sample = randsubseq(1:size(decTreeDF,1), 0.75)
train = decTreeDF[sample, :]
notsample = [i for i in 1:size(decTreeDF, 1) if isempty(searchsorted(sample,i))]
test = decTreeDF[notsample, :]

Unnamed: 0_level_0,Season,teamWinAbr,winScore,loseScore,Margin,Team,FGM,FGA,FG%
Unnamed: 0_level_1,String,String,Int64,Int64,Int64,String,Int64,Int64,Float64
1,(1990-91),DET,118,107,11,DET,3194,6875,0.46
2,(1990-91),PHO,105,83,22,PHO,3573,7199,0.5
3,(1990-91),ORL,110,108,2,ORL,3298,7256,0.45
4,(1990-91),WAS,118,108,10,WAS,3390,7268,0.47
5,(1990-91),SAS,93,89,4,SAS,3409,6988,0.49
6,(1990-91),UTA,102,99,3,UTA,3214,6537,0.49
7,(1990-91),PHI,120,104,16,PHI,3289,6925,0.47
8,(1990-91),POR,132,108,24,POR,3577,7369,0.49
9,(1990-91),SAS,107,90,17,SAS,3409,6988,0.49
10,(1990-91),SEA,96,88,8,SEA,3500,7117,0.49


In [18]:
X_train = convert(Array, train[:, 6:114])
y_train = convert(Array, train[:, 2])

25713-element Array{String,1}:
 "ATL"
 "BOS"
 "IND"
 "MIL"
 "MIN"
 "SEA"
 "UTA"
 "HOU"
 "LAL"
 "MIL"
 "NYK"
 "SAC"
 "ATL"
 ⋮
 "HOU"
 "DAL"
 "MIA"
 "MIL"
 "BRO"
 "PHO"
 "BOS"
 "PHI"
 "LAC"
 "DEN"
 "GSW"
 "DAL"

In [19]:
X_test = convert(Array, test[:, 6:114])
y_test = convert(Array, test[:, 2])

8351-element Array{String,1}:
 "DET"
 "PHO"
 "ORL"
 "WAS"
 "SAS"
 "UTA"
 "PHI"
 "POR"
 "SAS"
 "SEA"
 "LAL"
 "PHO"
 "DEN"
 ⋮
 "TOR"
 "WAS"
 "NOP"
 "MIN"
 "NYK"
 "PHI"
 "CLE"
 "DAL"
 "SAC"
 "ATL"
 "DET"
 "CLE"

In [20]:
# Fit the model
using DecisionTree
model = DecisionTreeClassifier(max_depth=5)
fit!(model, X_train, y_train)

DecisionTreeClassifier
max_depth:                5
min_samples_leaf:         1
min_samples_split:        2
min_purity_increase:      0.0
pruning_purity_threshold: 1.0
n_subfeatures:            0
classes:                  ["ATL", "BOS", "BRO", "CHA", "CHI", "CLE", "DAL", "DEN", "DET", "GSW"  …  "PHI", "PHO", "POR", "SAC", "SAS", "SEA", "TOR", "UTA", "VAN", "WAS"]
root:                     Decision Tree
Leaves: 31
Depth:  5

In [21]:
# Predict
dectree_pred = DecisionTree.predict(model, X_test)

8351-element Array{String,1}:
 "DET"
 "PHO"
 "ORL"
 "WAS"
 "SAS"
 "UTA"
 "PHI"
 "POR"
 "SAS"
 "SEA"
 "LAL"
 "PHO"
 "DEN"
 ⋮
 "TOR"
 "WAS"
 "NJN"
 "MIN"
 "NYK"
 "PHI"
 "CLE"
 "DAL"
 "SAC"
 "ATL"
 "DET"
 "CLE"

In [22]:
# Compute accuracy
correct = 0
n=length(y_test)
for i in 1:n
    if y_test[i] == dectree_pred[i]
        correct = correct +1
    end
end
println(correct / n)

0.9770087414680877


#### ACCURACY OF DECISION TREE = 97.75%

In [23]:
# Add predictions to test dataframe
insert!(test, 2, 0, :predictedWin_DT)
test = @transform(test, predictedWin_DT = dectree_pred)

Unnamed: 0_level_0,Season,predictedWin_DT,teamWinAbr,winScore,loseScore,Margin,Team,FGM
Unnamed: 0_level_1,String,String,String,Int64,Int64,Int64,String,Int64
1,(1990-91),DET,DET,118,107,11,DET,3194
2,(1990-91),PHO,PHO,105,83,22,PHO,3573
3,(1990-91),ORL,ORL,110,108,2,ORL,3298
4,(1990-91),WAS,WAS,118,108,10,WAS,3390
5,(1990-91),SAS,SAS,93,89,4,SAS,3409
6,(1990-91),UTA,UTA,102,99,3,UTA,3214
7,(1990-91),PHI,PHI,120,104,16,PHI,3289
8,(1990-91),POR,POR,132,108,24,POR,3577
9,(1990-91),SAS,SAS,107,90,17,SAS,3409
10,(1990-91),SEA,SEA,96,88,8,SEA,3500


### Step 4.c - Random Forest Model


In [24]:
using DecisionTree
# Fit the model
rf = RandomForestClassifier()
fit!(rf, X_train, y_train)

RandomForestClassifier
n_trees:             10
n_subfeatures:       -1
partial_sampling:    0.7
max_depth:           -1
min_samples_leaf:    1
min_samples_split:   2
min_purity_increase: 0.0
classes:             ["ATL", "BOS", "BRO", "CHA", "CHI", "CLE", "DAL", "DEN", "DET", "GSW"  …  "PHI", "PHO", "POR", "SAC", "SAS", "SEA", "TOR", "UTA", "VAN", "WAS"]
ensemble:            Ensemble of Decision Trees
Trees:      10
Avg Leaves: 320.2
Avg Depth:  13.3

In [25]:
# Predict on the test set
rf_pred = DecisionTree.predict(rf, X_test)

8351-element Array{String,1}:
 "DET"
 "PHO"
 "ORL"
 "WAS"
 "SAS"
 "UTA"
 "PHI"
 "POR"
 "SAS"
 "SEA"
 "LAL"
 "PHO"
 "DEN"
 ⋮
 "TOR"
 "WAS"
 "NOP"
 "MIN"
 "NYK"
 "PHI"
 "CLE"
 "DAL"
 "SAC"
 "ATL"
 "DET"
 "CLE"

In [26]:
# Compute the accuracy
correct = 0
n =length(y_test)
for i in 1:n
    if y_test[i] == rf_pred[i]
        correct = correct + 1
    end
end
println(correct/n)

0.9980840617890073


#### ACCURACY OF RANDOM FOREST = 99.85%

In [27]:
# Add random forest prediction
insert!(test, 2, 0, :predictedWin_RF)
test = @transform(test, predictedWin_RF = rf_pred)

Unnamed: 0_level_0,Season,predictedWin_RF,predictedWin_DT,teamWinAbr,winScore,loseScore,Margin
Unnamed: 0_level_1,String,String,String,String,Int64,Int64,Int64
1,(1990-91),DET,DET,DET,118,107,11
2,(1990-91),PHO,PHO,PHO,105,83,22
3,(1990-91),ORL,ORL,ORL,110,108,2
4,(1990-91),WAS,WAS,WAS,118,108,10
5,(1990-91),SAS,SAS,SAS,93,89,4
6,(1990-91),UTA,UTA,UTA,102,99,3
7,(1990-91),PHI,PHI,PHI,120,104,16
8,(1990-91),POR,POR,POR,132,108,24
9,(1990-91),SAS,SAS,SAS,107,90,17
10,(1990-91),SEA,SEA,SEA,96,88,8


### Step 4.d - Build Full Prediction Dataframe

In [28]:
# Let's build a prediction for the entire set of games and test it's accuracy
# Build Initial Dataframe
results_fullTeamDF = join(results, fullTeamDF, on = [:Y => :nextYr, :teamWin => :Franchise], kind = :left, makeunique = true)
results_fullTeamDF = join(results_fullTeamDF, fullTeamDF, on = [:Y => :nextYr, :teamWin => :Franchise], kind = :left, makeunique = true)
deletecols!(results_fullTeamDF, [:League, :League_1, :League_2, :League_3, :Y_1, :Y_2, :Country, :City, :coachWin, :coachLose, :teamWin, :teamLose, :teamLoseAbr, :D, :M, :Y, :G])
results_fullTeamDF = dropmissing(results_fullTeamDF)
finalDF = results_fullTeamDF
finalDF

Unnamed: 0_level_0,Season,teamWinAbr,winScore,loseScore,Margin,Team,FGM,FGA,FG%
Unnamed: 0_level_1,String,String,Int64,Int64,Int64,String,Int64,Int64,Float64
1,(1990-91),ATL,120,107,13,ATL,3349,7223,0.46
2,(1990-91),BOS,113,86,27,BOS,3695,7214,0.51
3,(1990-91),DET,118,107,11,DET,3194,6875,0.46
4,(1990-91),IND,121,109,12,IND,3450,6994,0.49
5,(1990-91),MIL,106,91,15,MIL,3337,6948,0.48
6,(1990-91),MIN,115,95,20,MIN,3265,7276,0.45
7,(1990-91),PHO,105,83,22,PHO,3573,7199,0.5
8,(1990-91),SEA,127,99,28,SEA,3500,7117,0.49
9,(1990-91),UTA,112,104,8,UTA,3214,6537,0.49
10,(1990-91),HOU,114,92,22,HOU,3403,7287,0.47


In [29]:
using DecisionTree
X = convert(Array, finalDF[:, 6:114])
y = convert(Array, finalDF[:, 2]) 
dectree_pred = DecisionTree.predict(model, X)
rf_pred = DecisionTree.predict(rf, X)
insert!(finalDF, 2, 0, :predictedWin_RF)
finalDF = @transform(finalDF, predictedWin_RF = rf_pred)
insert!(finalDF, 2, 0, :predictedWin_DT)
finalDF = @transform(finalDF, predictedWin_DT = dectree_pred)

Unnamed: 0_level_0,Season,predictedWin_DT,predictedWin_RF,teamWinAbr,winScore,loseScore,Margin
Unnamed: 0_level_1,String,String,String,String,Int64,Int64,Int64
1,(1990-91),ATL,ATL,ATL,120,107,13
2,(1990-91),BOS,BOS,BOS,113,86,27
3,(1990-91),DET,DET,DET,118,107,11
4,(1990-91),IND,IND,IND,121,109,12
5,(1990-91),MIL,MIL,MIL,106,91,15
6,(1990-91),MIN,MIN,MIN,115,95,20
7,(1990-91),PHO,PHO,PHO,105,83,22
8,(1990-91),SEA,SEA,SEA,127,99,28
9,(1990-91),UTA,UTA,UTA,112,104,8
10,(1990-91),HOU,HOU,HOU,114,92,22


In [30]:
# Test Decision Tree Accuracy
correct = 0
n =length(y)
for i in 1:n
    if y[i] == dectree_pred[i]
        correct = correct + 1
    end
end
println(correct/n) #97.75% accurate

0.9774835603569751


In [31]:
# Test Random Forest Accuracy
correct = 0
n =length(y)
for i in 1:n
    if y[i] == rf_pred[i]
        correct = correct + 1
    end
end
println(correct/n) #99.87% accurate

0.9985615312353218


In [32]:
# Extract dataframe
CSV.write("data/fullPredictiveData.csv", finalDF)

"data/fullPredictiveData.csv"

In [33]:
using JLD2
@save "model_file.jld2" model
@save "model_file.jld2" rf

## Step 5 - Make Prediction for Opening Week

Based on the results of the regressions we will only use the Decision Tree and Random Forest models

In [73]:
# Opening Week Matchups
using CSV, DataFrames
openingWeek = CSV.read("data/week1Matchups.csv", DataFrame)

Unnamed: 0_level_0,TeamA,TeamB
Unnamed: 0_level_1,String,String
1,GSW,BRO
2,LAC,LAL
3,CHA,CLE
4,NYK,IND
5,MIA,ORL
6,WAS,PHI
7,MIL,BOS
8,NOP,TOR
9,ATL,CHI
10,OCT,HOU


In [74]:
deletecols!(predStats, :Franchise)
predStats

LoadError: ArgumentError: column name :Franchise not found in the data frame

In [75]:
# Join with predStats df from the beginning
openingWeek = join(openingWeek, predStats, on = [:TeamA => :Team], kind = :outer, makeunique = true)
openingWeek = join(openingWeek, predStats, on = [:TeamB => :Team], kind = :outer, makeunique = true)
openingWeek = dropmissing(openingWeek)
deletecols!(openingWeek, [:League_1, :Y_1, :League, :Y, :G])
rename!(openingWeek, :TeamA => :Team, :TeamB => :Team_1)

Unnamed: 0_level_0,Team,Team_1,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM,FTA,FT%
Unnamed: 0_level_1,String,String,Int64,Int64,Float64,Int64,Int64,Float64,Int64,Int64,Float64
1,GSW,BRO,2510,5730,0.44,678,2032,0.33,1214,1511,0.8
2,LAC,LAL,2992,6425,0.47,895,2410,0.37,1498,1894,0.79
3,CHA,CLE,2425,5586,0.43,785,2231,0.35,1052,1406,0.75
4,NYK,IND,2638,5896,0.45,631,1872,0.34,1076,1550,0.69
5,MIA,ORL,2880,6160,0.47,979,2584,0.38,1440,1840,0.78
6,WAS,PHI,2990,6544,0.46,864,2345,0.37,1394,1770,0.79
7,MIL,BOS,3160,6638,0.48,1007,2840,0.35,1336,1800,0.74
8,NOP,TOR,3065,6598,0.46,982,2656,0.37,1229,1687,0.73
9,ATL,CHI,2723,6067,0.45,805,2416,0.33,1237,1566,0.79
10,OCT,HOU,2879,6156,0.47,770,2171,0.35,1422,1787,0.8


In [76]:
# Fix final dataframe
#deletecols!(finalDF, [:Season, :predictedWin_DT, :predictedWin_RF, :winScore, :loseScore, :Margin])
insert!(openingWeek, 1, "", :teamWinAbr)
finalDF

Unnamed: 0_level_0,teamWinAbr,Team,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM,FTA
Unnamed: 0_level_1,String,String,Int64,Int64,Float64,Int64,Int64,Float64,Int64,Int64
1,ATL,ATL,3349,7223,0.46,271,836,0.32,2034,2544
2,BOS,BOS,3695,7214,0.51,109,346,0.32,1646,1997
3,DET,DET,3194,6875,0.46,131,440,0.3,1686,2211
4,IND,IND,3450,6994,0.49,249,749,0.33,2010,2479
5,MIL,MIL,3337,6948,0.48,257,753,0.34,1796,2241
6,MIN,MIN,3265,7276,0.45,108,381,0.28,1531,2082
7,PHO,PHO,3573,7199,0.5,138,432,0.32,2064,2680
8,SEA,SEA,3500,7117,0.49,136,427,0.32,1608,2143
9,UTA,UTA,3214,6537,0.49,148,458,0.32,1951,2472
10,HOU,HOU,3403,7287,0.47,316,989,0.32,1631,2200


In [61]:
openingWeek

Unnamed: 0_level_0,teamWinAbr,Team,Team_1,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM
Unnamed: 0_level_1,String,String,String,Int64,Int64,Float64,Int64,Int64,Float64,Int64
1,,GSW,BRO,2510,5730,0.44,678,2032,0.33,1214
2,,LAC,LAL,2992,6425,0.47,895,2410,0.37,1498
3,,CHA,CLE,2425,5586,0.43,785,2231,0.35,1052
4,,NYK,IND,2638,5896,0.45,631,1872,0.34,1076
5,,MIA,ORL,2880,6160,0.47,979,2584,0.38,1440
6,,WAS,PHI,2990,6544,0.46,864,2345,0.37,1394
7,,MIL,BOS,3160,6638,0.48,1007,2840,0.35,1336
8,,NOP,TOR,3065,6598,0.46,982,2656,0.37,1229
9,,ATL,CHI,2723,6067,0.45,805,2416,0.33,1237
10,,OCT,HOU,2879,6156,0.47,770,2171,0.35,1422


In [77]:
openingWeek = vcat(finalDF, openingWeek)

Unnamed: 0_level_0,teamWinAbr,Team,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM,FTA
Unnamed: 0_level_1,String,String,Int64,Int64,Float64,Int64,Int64,Float64,Int64,Int64
1,ATL,ATL,3349,7223,0.46,271,836,0.32,2034,2544
2,BOS,BOS,3695,7214,0.51,109,346,0.32,1646,1997
3,DET,DET,3194,6875,0.46,131,440,0.3,1686,2211
4,IND,IND,3450,6994,0.49,249,749,0.33,2010,2479
5,MIL,MIL,3337,6948,0.48,257,753,0.34,1796,2241
6,MIN,MIN,3265,7276,0.45,108,381,0.28,1531,2082
7,PHO,PHO,3573,7199,0.5,138,432,0.32,2064,2680
8,SEA,SEA,3500,7117,0.49,136,427,0.32,1608,2143
9,UTA,UTA,3214,6537,0.49,148,458,0.32,1951,2472
10,HOU,HOU,3403,7287,0.47,316,989,0.32,1631,2200


In [78]:
tail(openingWeek, 40)

Unnamed: 0_level_0,teamWinAbr,Team,FGM,FGA,FG%,3FGM,3FGA,3FG%,FTM,FTA
Unnamed: 0_level_1,String,String,Int64,Int64,Float64,Int64,Int64,Float64,Int64,Int64
1,,GSW,2510,5730,0.44,678,2032,0.33,1214,1511
2,,LAC,2992,6425,0.47,895,2410,0.37,1498,1894
3,,CHA,2425,5586,0.43,785,2231,0.35,1052,1406
4,,NYK,2638,5896,0.45,631,1872,0.34,1076,1550
5,,MIA,2880,6160,0.47,979,2584,0.38,1440,1840
6,,WAS,2990,6544,0.46,864,2345,0.37,1394,1770
7,,MIL,3160,6638,0.48,1007,2840,0.35,1336,1800
8,,NOP,3065,6598,0.46,982,2656,0.37,1229,1687
9,,ATL,2723,6067,0.45,805,2416,0.33,1237,1566
10,,OCT,2879,6156,0.47,770,2171,0.35,1422,1787


In [79]:
X = convert(Array, openingWeek[:, 2:110])

34104×109 Array{Any,2}:
 "ATL"  3349  7223  0.46   271   836  …  1864  729  373  1207  1763  8996
 "BOS"  3695  7214  0.51   109   346     2148  669  534  1275  1633  9043
 "DET"  3194  6875  0.46   131   440     1821  485  347  1128  1834  8169
 "IND"  3450  6994  0.49   249   749     2097  634  246  1241  1838  8323
 "MIL"  3337  6948  0.48   257   753     1836  784  304  1100  1824  7093
 "MIN"  3265  7276  0.45   108   381  …  1810  676  357   924  1714  7722
 "PHO"  3573  7199  0.5    138   432     1980  576  316   976  1290  7333
 "SEA"  3500  7117  0.49   136   427     1709  672  286   962  1454  5409
 "UTA"  3214  6537  0.49   148   458     2099  578  236  1065  1262  7601
 "HOU"  3403  7287  0.47   316   989     1808  708  383  1258  1598  8432
 "LAL"  3343  6911  0.48   226   744  …  2068  627  345  1122  1421  8522
 "MIL"  3337  6948  0.48   257   753     1836  784  304  1100  1824  7093
 "NYK"  3308  6822  0.48   185   558     2172  638  417  1341  1758  8444
 ⋮            

In [80]:
using DecisionTree, ScikitLearn
dectree_pred = DecisionTree.predict(model, X)
rf_pred = DecisionTree.predict(rf, X)

34104-element Array{String,1}:
 "ATL"
 "BOS"
 "DET"
 "IND"
 "MIL"
 "MIN"
 "PHO"
 "SEA"
 "UTA"
 "HOU"
 "LAL"
 "MIL"
 "NYK"
 ⋮
 "IND"
 "SAC"
 "HOU"
 "CHI"
 "TOR"
 "SEA"
 "ORL"
 "NYK"
 "CLE"
 "DAL"
 "SAC"
 "NOP"

In [81]:
insert!(openingWeek, 2, 0, :predictedWin_RF)
openingWeek = @transform(openingWeek, predictedWin_RF = rf_pred)
insert!(openingWeek, 2, 0, :predictedWin_DT)
openingWeek = @transform(openingWeek, predictedWin_DT = dectree_pred)

Unnamed: 0_level_0,teamWinAbr,predictedWin_DT,predictedWin_RF,Team,FGM,FGA,FG%,3FGM
Unnamed: 0_level_1,String,String,String,String,Int64,Int64,Float64,Int64
1,ATL,ATL,ATL,ATL,3349,7223,0.46,271
2,BOS,BOS,BOS,BOS,3695,7214,0.51,109
3,DET,DET,DET,DET,3194,6875,0.46,131
4,IND,IND,IND,IND,3450,6994,0.49,249
5,MIL,MIL,MIL,MIL,3337,6948,0.48,257
6,MIN,MIN,MIN,MIN,3265,7276,0.45,108
7,PHO,PHO,PHO,PHO,3573,7199,0.5,138
8,SEA,SEA,SEA,SEA,3500,7117,0.49,136
9,UTA,UTA,UTA,UTA,3214,6537,0.49,148
10,HOU,HOU,HOU,HOU,3403,7287,0.47,316


In [82]:
deletecols!(openingWeek, :teamWinAbr)

Unnamed: 0_level_0,predictedWin_DT,predictedWin_RF,Team,FGM,FGA,FG%,3FGM,3FGA,3FG%
Unnamed: 0_level_1,String,String,String,Int64,Int64,Float64,Int64,Int64,Float64
1,ATL,ATL,ATL,3349,7223,0.46,271,836,0.32
2,BOS,BOS,BOS,3695,7214,0.51,109,346,0.32
3,DET,DET,DET,3194,6875,0.46,131,440,0.3
4,IND,IND,IND,3450,6994,0.49,249,749,0.33
5,MIL,MIL,MIL,3337,6948,0.48,257,753,0.34
6,MIN,MIN,MIN,3265,7276,0.45,108,381,0.28
7,PHO,PHO,PHO,3573,7199,0.5,138,432,0.32
8,SEA,SEA,SEA,3500,7117,0.49,136,427,0.32
9,UTA,UTA,UTA,3214,6537,0.49,148,458,0.32
10,HOU,HOU,HOU,3403,7287,0.47,316,989,0.32


In [83]:
tail(openingWeek, 40)

Unnamed: 0_level_0,predictedWin_DT,predictedWin_RF,Team,FGM,FGA,FG%,3FGM,3FGA,3FG%
Unnamed: 0_level_1,String,String,String,Int64,Int64,Float64,Int64,Int64,Float64
1,BRO,BOS,GSW,2510,5730,0.44,678,2032,0.33
2,LAC,IND,LAC,2992,6425,0.47,895,2410,0.37
3,CLE,DAL,CHA,2425,5586,0.43,785,2231,0.35
4,LAL,NOP,NYK,2638,5896,0.45,631,1872,0.34
5,OCT,LAL,MIA,2880,6160,0.47,979,2584,0.38
6,WAS,TOR,WAS,2990,6544,0.46,864,2345,0.37
7,BOS,NOP,MIL,3160,6638,0.48,1007,2840,0.35
8,OCT,WAS,NOP,3065,6598,0.46,982,2656,0.37
9,CHI,CHI,ATL,2723,6067,0.45,805,2416,0.33
10,LAL,HOU,OCT,2879,6156,0.47,770,2171,0.35


In [84]:
predictionDF = tail(openingWeek, 40)
CSV.write("data/openingWeek2020.csv", predictionDF)

"data/openingWeek2020.csv"

# Conclusion

The data has ran and made it's prediction at the 97% and 99% confidence intervals. However, there were some issues in the results. There appear to be some predictions that do not feature either team. I believe the blame for this lies in not giving the model the two options before running it. While this issue may have occured for some of the games, there were still several extractable results. 