Magic: the Gathering (MTG) was rated by eBay as the most popular card game based on the metric of one item sold per minute. Hasbro's earning reports from the last three quarters show eBay's statement has merit. It's also a personal favorite of mine that I've been playing for over two decades. That's why it's frustrating to me to see so many cards have had to be banned.<br><br>
As a very brief overview, MTG creates four new sets a year for their rotating format, Standard. Only cards printed in the last two years are able to be played in Standard, meaning the pool of cards that can be used is only about 2,000 of the over 50,000 cards (which can be found at [https://mtgjson.com/api/v5/csv/cards.csv]) that ever have been printed. In the 27 years since its creation, there have been 33 bans; 14 of those were in the last two years with two of those cards being banned around one month after they were made.<br><br>
So why does that matter? One of the main reasons is that this is an expensive hobby. Competitive players will easily spend over 1,000 dollars per year on the cards to make their decks with 80 percent of that being used on about 5 percent of the cards. Good cards tend to be expensive (because demand and supply) so losing the ability to play with a card you just spent 50 dollars on feels really bad. It can shake customer confidence, depress tournament turnout, and ultimately hurt the game itself. MTG'S R&D team has talked extensively about wanting to ban sparingly to avoid this exact scenario.<br><br>
The next question is why have so many bans happened lately? Most players and content creators (people who give commentary on the game as a whole) have said it's because cards are being made far more powerful in the hopes of selling more product. While Wizards of the Coast, the company that runs MTG and a subsidiary of Hasbro, has tried to have their R&D balance their cards for years, including threatening to fire the entire R&D department when 6 cards had to be banned in 2003, the recent problems led them to create a playtest team specifically tasked with ensuring cards are fun, interesting, and exciting with balance being a slight secondary to this. My answer to this issue is to leverage machine learning to automate as much of the R&D process as possible.<br><br>
Using supervised learning, we can automate much of the card creation process for creatures. Because creatures make up the largest portion of the cards being made each year, the time saved here can be used to better balance the game, potentially leading to less bans and a better format for all MTG players, including myself, to continue enjoying.

As a note, during the supervised learning capstone, I went with a very naive approach; that is to say, I checked what was in the data to find the best ways to feature engineer them. For this one, I'm assuming I either have some level of expertise in the field (which in this case, I do) or have a guide that goes over what is in the data (similar to what can be found at [https://www.kaggle.com/lespin/house-prices-dataset]).

In [1]:
#import the necessities
import matplotlib.pyplot as plt
import dask.array as da
import dask.dataframe as dd
import statsmodels.api as sm
import xgboost as xgb
import joblib
import math
import warnings
from dask.distributed import Client, progress
from dask_ml.model_selection import train_test_split
from dask_ml.linear_model import LogisticRegression, LinearRegression
from dask_ml.xgboost import XGBClassifier, XGBRegressor
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.linear_model import LogisticRegression as logr
from sklearn.linear_model import LinearRegression as linr
from sklearn.metrics import mean_absolute_error, roc_auc_score
from sklearn.model_selection import cross_val_score, cross_validate, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from statsmodels.tools.eval_measures import mse, rmse
%matplotlib inline
warnings.filterwarnings('ignore')

In [2]:
client = Client(n_workers=4, threads_per_worker=2, memory_limit='4GB')

In [3]:
#get the data which will become our model
df = dd.read_csv('D:/Downloads/cardsutf8.txt', sep='\t', encoding='latin1', low_memory=False, dtype={'asciiName': 'object',
       'colorIndicator': 'object',
       'convertedManaCost': 'float64',
       'duelDeck': 'object',
       'edhrecRank': 'float64',
       'faceName': 'object',
       'flavorName': 'object',
       'frameEffects': 'object',
       'frameVersion': 'object',
       'loyalty': 'object',
       'mcmId': 'float64',
       'mcmMetaId': 'float64',
       'originalReleaseDate': 'object',
       'otherFaceIds': 'object',
       'promoTypes': 'object',
       'side': 'object',
       'tcgplayerProductId': 'float64',
       'watermark': 'object'})

In [4]:
cards = df.copy()

In [5]:
#check the data (specifically the columns)
cards.columns

Index(['index', 'id', 'artist', 'asciiName', 'availability', 'borderColor',
       'cardKingdomFoilId', 'cardKingdomId', 'colorIdentity', 'colorIndicator',
       'colors', 'convertedManaCost', 'duelDeck', 'edhrecRank',
       'faceConvertedManaCost', 'faceName', 'flavorName', 'flavorText',
       'frameEffects', 'frameVersion', 'hand', 'hasAlternativeDeckLimit',
       'isFullArt', 'isOnlineOnly', 'isOversized', 'isPromo', 'isReprint',
       'isReserved', 'isStarter', 'isStorySpotlight', 'isTextless',
       'isTimeshifted', 'keywords', 'layout', 'leadershipSkills', 'life',
       'loyalty', 'manaCost', 'mcmId', 'mcmMetaId', 'mtgArenaId',
       'mtgjsonV4Id', 'mtgoFoilId', 'mtgoId', 'multiverseId', 'name', 'number',
       'originalReleaseDate', 'originalText', 'originalType', 'otherFaceIds',
       'power', 'printings', 'promoTypes', 'purchaseUrls', 'rarity',
       'scryfallId', 'scryfallIllustrationId', 'scryfallOracleId', 'setCode',
       'side', 'subtypes', 'supertypes', 'tcgp

The goal of this capstone is to try to predict the power (and any further attributes if possible). To that end, there are several features in this set that are not values intrinsic to the card itself, suck as cardKingdomId, edhrecRank, mcmId, etc. We will keep anything that would appear on the printed card.

In [6]:
#get the columns of interest
cards = df[['borderColor', 'colors', 'convertedManaCost', 'power',
            'rarity', 'toughness', 'type']]
#gold and silver cards are not legal for play and borderless is just a frame type
#so strip away all but black and white border
cards = cards[(cards['borderColor'] != 'borderless') & (cards['borderColor'] != 'silver')
             & (cards['borderColor'] != 'gold')]
#we're only going to use this model for creatures, as those have less variance than other
#types of cards
cards = cards[cards['type'].str.contains('Creature')]
#some of the powers aren't integers due to effects, so since those aren't easily
#predictable, we'll drop the problem cards
cards = cards[(cards['power'] != '*') & (cards['power'] != '1+*') & 
              (cards['power'] != '2+*') & (cards['power'] != '?')]
#same with toughness
cards = cards[(cards['toughness'] != '*') & (cards['toughness'] != '1+*')]
#these columns are useful but would be more useful as numbers
cards['power'] = cards['power'].astype(int)
cards['toughness'] = cards['toughness'].astype(int)
cards['convertedManaCost'] = cards['convertedManaCost'].astype(int)
#the nulls in colors relate to colorless creatures, so we'll
#fill that with C to represent those. 
cards['colors'] = cards['colors'].fillna('C')

In [7]:
#check the data for null values
cards.isna().sum(axis=0).compute()

borderColor          0
colors               0
convertedManaCost    0
power                0
rarity               0
toughness            0
type                 0
dtype: int64

In [8]:
#this is something that was elegantly done in pandas since you can
#just attach a column by using a line like the following
#cards['White'] = np.where(cards['colors'].str.contains('W'), 1)
#however, the dask implementation doesn't allow for both true
#and false values to be set in the same statement, hence the where
#and mark lines. regardless of what value was placed in the
#.str.contains response, it returned a null value in the resulting
#column, hence the need to fill the null values.
dummy = cards['colors']
dummy = dummy.where(dummy.str.contains('W', 0))
dummy = dummy.fillna('Not')
dummy = dummy.where(dummy == 'Not', 1)
dummy = dummy.mask(dummy == 'Not', 0)
cards = cards.assign(White=dummy)

dummy = cards['colors']
dummy = dummy.where(dummy.str.contains('U', 0))
dummy = dummy.fillna('Not')
dummy = dummy.where(dummy == 'Not', 1)
dummy = dummy.mask(dummy == 'Not', 0)
cards = cards.assign(Blue=dummy)

dummy = cards['colors']
dummy = dummy.where(dummy.str.contains('B', 0))
dummy = dummy.fillna('Not')
dummy = dummy.where(dummy == 'Not', 1)
dummy = dummy.mask(dummy == 'Not', 0)
cards = cards.assign(Black=dummy)

dummy = cards['colors']
dummy = dummy.where(dummy.str.contains('R', 0))
dummy = dummy.fillna('Not')
dummy = dummy.where(dummy == 'Not', 1)
dummy = dummy.mask(dummy == 'Not', 0)
cards = cards.assign(Red=dummy)

dummy = cards['colors']
dummy = dummy.where(dummy.str.contains('G', 0))
dummy = dummy.fillna('Not')
dummy = dummy.where(dummy == 'Not', 1)
dummy = dummy.mask(dummy == 'Not', 0)
cards = cards.assign(Green=dummy)

In [9]:
#doing the same for rarity
dummy = cards['rarity']
dummy = dummy.where(dummy == 'uncommon', 0)
dummy = dummy.mask(dummy == 'uncommon', 1)
cards = cards.assign(Uncommon=dummy)

dummy = cards['rarity']
dummy = dummy.where(dummy == 'rare', 0)
dummy = dummy.mask(dummy == 'rare', 1)
cards = cards.assign(Rare=dummy)

dummy = cards['rarity']
dummy = dummy.where(dummy == 'mythic', 0)
dummy = dummy.mask(dummy == 'mythic', 1)
cards = cards.assign(Mythic=dummy)

In [10]:
#drop the columns that were only used to pare down the data
#or for feature enginineering
cards = cards.drop(['borderColor', 'colors', 'rarity', 'type'], 1)

In [11]:
cards = cards.astype(float)

In [12]:
cards.isna().sum(axis=0).compute()

convertedManaCost    0
power                0
toughness            0
White                0
Blue                 0
Black                0
Red                  0
Green                0
Uncommon             0
Rare                 0
Mythic               0
dtype: int64

In [13]:
cards.info()

<class 'dask.dataframe.core.DataFrame'>
Columns: 11 entries, convertedManaCost to Mythic
dtypes: float64(11)

In [14]:
#check correlations
cards.corr().compute()

Unnamed: 0,convertedManaCost,power,toughness,White,Blue,Black,Red,Green,Uncommon,Rare,Mythic
convertedManaCost,1.0,0.728108,0.705189,-0.026265,0.044732,0.022096,0.016329,0.018741,-0.043702,0.212924,0.197592
power,0.728108,1.0,0.735872,-0.046699,-0.031634,0.042911,0.079374,0.076063,-0.084389,0.190101,0.228385
toughness,0.705189,0.735872,1.0,0.00988,0.044838,-0.014636,-0.006189,0.066947,-0.06232,0.177054,0.23625
White,-0.026265,-0.046699,0.00988,1.0,-0.132483,-0.177911,-0.168879,-0.171393,-0.004804,0.007284,0.066827
Blue,0.044732,-0.031634,0.044838,-0.132483,1.0,-0.13271,-0.15926,-0.17167,0.004355,0.015815,0.061497
Black,0.022096,0.042911,-0.014636,-0.177911,-0.13271,1.0,-0.154893,-0.197514,-0.009559,0.006728,0.073647
Red,0.016329,0.079374,-0.006189,-0.168879,-0.15926,-0.154893,1.0,-0.180806,-0.000714,0.0233,0.061352
Green,0.018741,0.076063,0.066947,-0.171393,-0.17167,-0.197514,-0.180806,1.0,-0.020403,0.011025,0.055408
Uncommon,-0.043702,-0.084389,-0.06232,-0.004804,0.004355,-0.009559,-0.000714,-0.020403,1.0,-0.414032,-0.164373
Rare,0.212924,0.190101,0.177054,0.007284,0.015815,0.006728,0.0233,0.011025,-0.414032,1.0,-0.180396


In [15]:
cards.head()

Unnamed: 0,convertedManaCost,power,toughness,White,Blue,Black,Red,Green,Uncommon,Rare,Mythic
0,7.0,4.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,5.0,3.0,3.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,4.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
6,3.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we get into the feature engineering. Below, I try every ml that is available on dask's API. I don't know why I'm getting NaN values when using Dask's ML versions, but I have left them in the code below. I attempt GridsearchCV but that states it is only able to be used on a binary target. I attempted to preprocess the data into dummies as suggested online but to no avail.

In [16]:
#create a feature set, target set, and split them
x = cards.drop(['power'], 1)
y = cards['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [17]:
#now we start running all of the tests
model = RandomForestClassifier()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([18.80123806, 21.42898989, 20.4204669 , 21.24506283, 18.6564908 ]),
 'score_time': array([0.05086398, 0.04590273, 0.04703617, 0.06083775, 0.03091764]),
 'test_score': array([0.68392235, 0.65073438, 0.63256161, 0.66069206, 0.61563356])}

In [18]:
model = LogisticRegression()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.00196624, 0.00199461, 0.00198531, 0.00199652, 0.00099921]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([nan, nan, nan, nan, nan])}

In [19]:
model = LinearRegression()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.00199533, 0.00199461, 0.00099707, 0.00299215, 0.00199485]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([nan, nan, nan, nan, nan])}

In [20]:
model = LogisticRegression()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.values.compute(), y_train.values.compute(), cv=5)
    
scores

{'fit_time': array([2.88786387, 2.54165292, 2.81505919, 2.71933627, 2.93174815]),
 'score_time': array([0.0009973 , 0.00099707, 0.0009973 , 0.0009973 , 0.00101399]),
 'test_score': array([0.14235938, 0.11874533, 0.11774956, 0.12969878, 0.12223052])}

In [34]:
model = LinearRegression()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.values.compute(), y_train.values.compute(), cv=5)
    
scores

AttributeError: 'numpy.ndarray' object has no attribute 'chunks'

In [23]:
model = XGBClassifier()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.values.compute(), y_train.values.compute(), cv=5)
    
scores

{'fit_time': array([0.00099707, 0.00098968, 0.00099754, 0.00099754, 0.00102496]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([nan, nan, nan, nan, nan])}

In [24]:
model = XGBRegressor()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.values.compute(), y_train.values.compute(), cv=5)
    
scores

{'fit_time': array([0.05585337, 0.00897837, 0.006984  , 0.0079813 , 0.0079813 ]),
 'score_time': array([0., 0., 0., 0., 0.]),
 'test_score': array([nan, nan, nan, nan, nan])}

In [25]:
model = DecisionTreeRegressor()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.01495957, 0.01496172, 0.01795459, 0.01495981, 0.01595831]),
 'score_time': array([0.0019958 , 0.00199795, 0.00199461, 0.00199461, 0.00199533]),
 'test_score': array([0.77993283, 0.74453663, 0.71480394, 0.81596902, 0.70560312])}

In [26]:
model = linr()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.01288795, 0.01488304, 0.03612185, 0.0366075 , 0.01169348]),
 'score_time': array([0.00299191, 0.00199437, 0.00299215, 0.00299239, 0.00177741]),
 'test_score': array([0.61882466, 0.6489018 , 0.59800838, 0.6993996 , 0.65554929])}

In [27]:
model = logr()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([2.29988551, 2.3233521 , 2.48542356, 2.64459157, 2.39515924]),
 'score_time': array([0.00199437, 0.00299168, 0.00199485, 0.00199461, 0.00199461]),
 'test_score': array([0.45943255, 0.46303211, 0.45282549, 0.45506597, 0.45307443])}

In [28]:
model = Lasso()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.00797939, 0.00797892, 0.0289228 , 0.01495957, 0.00897646]),
 'score_time': array([0.00099683, 0.00199485, 0.00250435, 0.00299239, 0.00199461]),
 'test_score': array([0.50924776, 0.52243214, 0.48843309, 0.52222291, 0.52320355])}

In [29]:
model = Ridge()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.00398922, 0.00498652, 0.00398898, 0.00456381, 0.00498676]),
 'score_time': array([0.00199485, 0.00099707, 0.00199461, 0.00198579, 0.00099778]),
 'test_score': array([0.61882848, 0.64890224, 0.5980072 , 0.6994014 , 0.65554721])}

In [30]:
model = ElasticNet()
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.004987  , 0.00698137, 0.00498629, 0.00598359, 0.01994705]),
 'score_time': array([0.00199461, 0.00199461, 0.00199461, 0.00199485, 0.00199413]),
 'test_score': array([0.56913133, 0.58634445, 0.5460015 , 0.60658819, 0.5875324 ])}

In [31]:
rf_model = RandomForestClassifier()

rf_params = {"max_depth": [2, 3, 4, 5, 6]}

grid_search_rf = GridSearchCV(rf_model,
                           param_grid=rf_params,
                           return_train_score=True,
                           iid=True,
                           cv=5,
                           n_jobs=-1, 
                           scoring='roc_auc')

In [33]:
with joblib.parallel_backend('dask'):
    grid_search_rf.fit(x_train.compute(), y_train.compute())

ValueError: multiclass format is not supported

In [32]:
y_train = preprocessing.label_binarize(y_train, classes=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16])
with joblib.parallel_backend('dask'):
    grid_search_rf.fit(x_train.compute(), y_train.compute())

AttributeError: 'numpy.ndarray' object has no attribute 'compute'

In [33]:
x = cards.drop(['power'], 1)
y = cards['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [35]:
with joblib.parallel_backend('dask'):
    for i in range(2,11):
        model = DecisionTreeRegressor(max_depth=i)
        scores = cross_val_score(model, x_train.compute(), y_train.compute(), cv=5)
        print('{}: {}'.format(i, scores.mean()))

2: 0.5016280106457216
3: 0.5939752894242563
4: 0.6500058307236097
5: 0.6755798202651473
6: 0.6882172651181988
7: 0.7014247789084215
8: 0.7158109652070305
9: 0.7261490561980972
10: 0.7338547152327358


In [36]:
with joblib.parallel_backend('dask'):
    model = DecisionTreeRegressor(max_depth=7)
    sfm = SelectFromModel(estimator=DecisionTreeRegressor(max_depth=7).fit(x_train, y_train))
    sfm.estimator_.coef_

AttributeError: 'SelectFromModel' object has no attribute 'estimator_'

In [37]:
with joblib.parallel_backend('dask'):
    model = DecisionTreeRegressor(max_depth=7)
    sfm = SelectFromModel(estimator=DecisionTreeRegressor(max_depth=7).fit(x_train, y_train))
    sfm.feature_importances_.coef_

AttributeError: 'SelectFromModel' object has no attribute 'feature_importances_'

In [39]:
with joblib.parallel_backend('dask'):
    model = DecisionTreeRegressor(max_depth=7)
    fitted = model.fit(x_train, y_train)
    importances = fitted.feature_importances_
    for i in range(len(x_train.columns)):
        print('{}: {}'.format(x_train.columns[i], importances[i]))

convertedManaCost: 0.6121185582272138
toughness: 0.35436553618326483
White: 0.005658519314629489
Blue: 0.002372933058897025
Black: 0.0009030950352573218
Red: 0.011098229089182989
Green: 0.005778748994956304
Uncommon: 0.0006914016071593966
Rare: 0.0025863322948592176
Mythic: 0.004426646194579546


In [40]:
testset = cards.drop(['White', 'Blue', 'Black', 'Red', 'Green'], 1)

In [41]:
testset.head()

Unnamed: 0,convertedManaCost,power,toughness,Uncommon,Rare,Mythic
0,7.0,4.0,4.0,1.0,0.0,0.0
1,5.0,3.0,3.0,1.0,0.0,0.0
2,4.0,2.0,2.0,0.0,0.0,0.0
3,4.0,2.0,2.0,1.0,0.0,0.0
6,3.0,2.0,2.0,0.0,0.0,0.0


In [42]:
x = testset.drop(['power'], 1)
y = testset['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [43]:
with joblib.parallel_backend('dask'):
    model = DecisionTreeRegressor(max_depth=7)
    fitted = model.fit(x_train, y_train)
    importances = fitted.feature_importances_
    for i in range(len(x_train.columns)):
        print('{}: {}'.format(x_train.columns[i], importances[i]))

convertedManaCost: 0.6234363616160599
toughness: 0.366240635601655
Uncommon: 0.002910921635538074
Rare: 0.0030904051281666614
Mythic: 0.004321676018580688


In [44]:
model = DecisionTreeRegressor(max_depth=7)
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.00797915, 0.00797915, 0.00695229, 0.00698066, 0.00698113]),
 'score_time': array([0.00099754, 0.00099754, 0.00099754, 0.00099802, 0.00199461]),
 'test_score': array([0.69772935, 0.68276588, 0.66316758, 0.74803006, 0.66294745])}

In [45]:
testset = testset.drop(['Uncommon', 'Rare', 'Mythic'], 1)

In [46]:
x = testset.drop(['power'], 1)
y = testset['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [47]:
model = DecisionTreeRegressor(max_depth=7)
with joblib.parallel_backend('dask'):
    scores = cross_validate(model, x_train.compute(), y_train.compute(), cv=5)
    
scores

{'fit_time': array([0.00498605, 0.00498605, 0.00499153, 0.00498724, 0.00495887]),
 'score_time': array([0.00199461, 0.00099659, 0.00199842, 0.00099802, 0.00099707]),
 'test_score': array([0.6731552 , 0.68095481, 0.64744091, 0.7440943 , 0.68796194])}

In [49]:
x = cards.drop(['power'], 1)
y = cards['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [50]:
with joblib.parallel_backend('dask'):
    fitted = model.fit(x_train, y_train)
    y_train_predictions = fitted.predict(x_train.values.compute())
    y_test_predictions = fitted.predict(x_test.values.compute())
    print("R-squared of the model in the training set is: {}".format(model.score(x_train, y_train)))
    print("R-squared of the model in the test set is: {}".format(model.score(x_test, y_test)))
    print("\nMean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_test_predictions)))
    print("Mean squared error of the prediction is: {:3e}".format(mse(y_test, y_test_predictions)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_test_predictions)))

R-squared of the model in the training set is: 0.7247085136930256
R-squared of the model in the test set is: 0.6917752707003756

Mean absolute error of the prediction is: 0.6712005646681725
Mean squared error of the prediction is: 9.574903e-01
Root mean squared error of the prediction is: 0.9785143116059108


In [55]:
x = testset.drop(['power'], 1)
y = testset['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [48]:
with joblib.parallel_backend('dask'):
    fitted = model.fit(x_train, y_train)
    y_train_predictions = fitted.predict(x_train.values.compute())
    y_test_predictions = fitted.predict(x_test.values.compute())
    print("R-squared of the model in the training set is: {}".format(model.score(x_train, y_train)))
    print("R-squared of the model in the test set is: {}".format(model.score(x_test, y_test)))
    print("\nMean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_test_predictions)))
    print("Mean squared error of the prediction is: {:3e}".format(mse(y_test, y_test_predictions)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_test_predictions)))

R-squared of the model in the training set is: 0.7038099721881113
R-squared of the model in the test set is: 0.6830427739596211

Mean absolute error of the prediction is: 0.69200473613214
Mean squared error of the prediction is: 9.846175e-01
Root mean squared error of the prediction is: 0.9922789324039074


In [51]:
model = linr()

In [52]:
with joblib.parallel_backend('dask'):
    fitted = model.fit(x_train, y_train)
    y_train_predictions = fitted.predict(x_train.values.compute())
    y_test_predictions = fitted.predict(x_test.values.compute())
    print("R-squared of the model in the training set is: {}".format(model.score(x_train, y_train)))
    print("R-squared of the model in the test set is: {}".format(model.score(x_test, y_test)))
    print("\nMean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_test_predictions)))
    print("Mean squared error of the prediction is: {:3e}".format(mse(y_test, y_test_predictions)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_test_predictions)))

R-squared of the model in the training set is: 0.6514982607620847
R-squared of the model in the test set is: 0.618768053412623

Mean absolute error of the prediction is: 0.7746234301461632
Mean squared error of the prediction is: 1.184285e+00
Root mean squared error of the prediction is: 1.088248520424303


In [53]:
x = testset.drop(['power'], 1)
y = testset['power']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=357)

In [54]:
with joblib.parallel_backend('dask'):
    fitted = model.fit(x_train, y_train)
    y_train_predictions = fitted.predict(x_train.values.compute())
    y_test_predictions = fitted.predict(x_test.values.compute())
    print("R-squared of the model in the training set is: {}".format(model.score(x_train, y_train)))
    print("R-squared of the model in the test set is: {}".format(model.score(x_test, y_test)))
    print("\nMean absolute error of the prediction is: {}".format(mean_absolute_error(y_test, y_test_predictions)))
    print("Mean squared error of the prediction is: {:3e}".format(mse(y_test, y_test_predictions)))
    print("Root mean squared error of the prediction is: {}".format(rmse(y_test, y_test_predictions)))

R-squared of the model in the training set is: 0.6340225283197636
R-squared of the model in the test set is: 0.6049393843870137

Mean absolute error of the prediction is: 0.7817405073247994
Mean squared error of the prediction is: 1.227243e+00
Root mean squared error of the prediction is: 1.1078100739748507


In conclusion, when using a DecisionTreeRegressor with a depth of 7, only toughness and convertedManaCost are necessary. Removing all other features had a negligible impact on the metrics. While the difference in time is imperceptible due to the dataset being so small, when working in the future when projects can run hours or days, the difference in time would be far more palpable. However, linear regressions tend to be faster than decision trees, so if 60% is acceptable rather than 69%, a linear regression could suffice.