# Model Preparation and Processing

Now that our data is usable for our model, we can start
going through the steps to make the first set of
models.

In this notebook, we'll make preliminary and final
models for our predictions. We'll convert our data from
the original data set's final structure into a form we
can train our models on. Rather than having a series of
raw text that we would generally be able to read, we
want our model to receive vectorized data. This applies
to several other features as well, which we will
address as they come up.

## Imports

Many of the imports from the first notebook are used
again alongside several additional packages.

Because the previous notebook is dedicated to the major
preprocessing steps and because that same notebook was
used to create a simplified form of our original data
structures, we'll be using that data as our starting
point. Since that file is much smaller, it has been
provided in the data/ folder in this project's
repository. It can also be recreated from the processes
outlined in
[the previous notebook](./1_Data_Prep.ipynb).

As before, required packages are defined in the yaml
file in the _code/ folder in this repository.

In [1]:
import numpy as np
import polars as pl
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.dummy import DummyRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, \
            GradientBoostingRegressor
from sklearn.model_selection import train_test_split, \
            cross_val_score, GridSearchCV
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import CountVectorizer

from joblib import parallel_backend, Parallel, \
            dump, load

from _code.cleaner import preprocess

from IPython.display import Image, Markdown
sns.set()

As mentioned, we'll import our previously-created data
set rather than any of the raw data sets we used
initially.

In [2]:
cards = pd.read_parquet('./data/simplified_cards.parquet')

## Data Engineering

While we have pretty much all of the data we need from
the previous notebook, we can still benefit from having
several other pieces of information that can be
extracted from the data. One thing that we want to
modify is the way our abilities are stored.

### Processing Abilities/Oracle Text

Our abilities are currently stored as one string per
card. While this is useful in reading our abilities,
there is an important thing to note - a card can have
multiple abilities, as denoted by new lines on cards,
or a `\n` in the oracle text.

Since these abilities are mostly independent of one
another, we don't want them to be pushed together when
we convert them into vectorized versions of themselves.
A card with "Flying" and "When CARDNAME enters the
battlefield" shouldn't be read as "Flying when CARDNAME
enters the battlefield," they aren't part of the same
ability.

To get around this, we'll split each card's Oracle text
into an array at each `\n` and we can vectorize the
text from there.

Let's go ahead and look at the before and after of some
of our cards.

First, we'll generate our `processed_abilities` by
sending the Oracle text through our custom `preprocess`
function, which takes every ability we send it and
breaks it out into individual words, reducing them to
their root words, removing stop words, and then putting
them all back together for us.

After that, we'll separate of the returned abilities
into a list of abilities as described by splitting on
`\n`.

Once we finish that, we'll create a new column that is
a count of the total number of a given card has, as
cards that have more abilities are likely to have more
complexities or power and will, by extension, hold more
value. In theory at least.

In [3]:
display(
    cards[[
        'name','oracle_text',
        'median_normal','median_foil'
        ]].head()
)

processed_abilities = preprocess(cards['oracle_text'])
cards['abilities_list'] = [
    abilities.split('\n') 
    for abilities in processed_abilities
    ]

cards['n_abilities'] = cards['abilities_list'].map(len)

display(
    cards[[
        'name','abilities_list',
        'median_normal','median_foil',
        'n_abilities'
        ]].head()
)


Unnamed: 0,name,oracle_text,median_normal,median_foil
0,Fury Sliver,All Sliver creatures have double strike.,0.38,3.95
1,Kor Outfitter,"When CARDNAME enters the battlefield, you may ...",0.24,7.78
2,Siren Lookout,"Flying\nWhen CARDNAME enters the battlefield, ...",0.06,0.23
3,Web,Enchant creature (Target a creature as you cas...,0.64,
4,Venerable Knight,"When CARDNAME dies, put a +1/+1 counter on tar...",0.095,0.28


Unnamed: 0,name,abilities_list,median_normal,median_foil,n_abilities
0,Fury Sliver,[sliver creature double strike],0.38,3.95,1
1,Kor Outfitter,[cardname enters battlefield may attach target...,0.24,7.78,1
2,Siren Lookout,"[fly , cardname enters battlefield explores ]",0.06,0.23,2
3,Web,"[enchant creature , enchant creature get +0/...",0.64,,2
4,Venerable Knight,[cardname die put +1/+1 counter target knight ...,0.095,0.28,1


## Model Preparation

Now that we have our data ready for our modeling steps,
we can go ahead and do our splits for our training and
testing subsets. As mentioned before, we'll be making
sure that our training and testing data have cards from
different sets at similar rates.

We'll also go ahead and get rid of the original
`prices_` columns, as we're only going to focus on the
median prices for this project.

In [5]:
X = cards.drop(columns=['prices_normal','prices_foil'])
y = cards['set']
train, test, _, _ = \
    train_test_split(
        X,y,stratify=y,
        random_state=13,
        test_size=0.2
    )

# we'll reset the indices of both sets to more easily
# translate between polars and pandas in an upcoming
# step

train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

# printing out value counts to demonstrate that the
# splits are even 

display(
    train['set'].value_counts(normalize=True)[:5],
    test['set'].value_counts(normalize=True)[:5]
)

set
mb1      0.027116
plist    0.016935
clb      0.014430
sld      0.013749
j22      0.013388
Name: proportion, dtype: float64

set
mb1      0.027096
plist    0.016995
clb      0.014430
sld      0.013789
j22      0.013388
Name: proportion, dtype: float64

### Ability Vectorization

In order to train a model on our data, we need to have
a feature/column for each word or phrase we want to
target. We'll use the Count Vectorizer provided by
SKlearn to do this.

In [6]:
# Our token pattern needs to be able to account for
# several non-standard things for it to be effective
# for our needs. As normal, it needs to be able to
# match words that are contiguous letters of an
# arbitrary length. However, we also need to be able to
# account for numbers in a few formats. These can be
# wrapped in curly brackets, e.g. {2}, representing 2
# colorless mana.
# It also needs to be able to account for letters
# inside of curly brackets and return them as such.
# It must also match something like +1/+1, -1/-1, or
# several other variations thereof.
# Lastly, it should also ignore any text that is inside
# of parentheses, as this text is reminder text - which
# is text that explains what an ability does but this
# text doesn't actually contribute to explaining what
# a card does in a meaningful way. 

token_pattern = \
    r"([a-zA-Z]+(?:’[a-z]+)?|[+-]?\d\/[+-]?\d|\{\d\d?\}|\{.\s?.?\}|\n)|\(.+?\)"

cvec = CountVectorizer(
        token_pattern=token_pattern,
        # min_df=0.0005, # <= this will mean that the
                    # minimum number of cards that it
                    # takes for an ability to show up
                    # on this list will be 46 after the
                    # explode is run, since it will be
                    # 83,000 entries long. We'll just
                    # limit our overall features since
                    # this is such a small percentage. 
        max_df=0.4,
        ngram_range=(1,5),
        max_features=1500
    )

Currently, our abilities are listed out as lists or
arrays, but the vectorizer needs them as strings rather
than a list of strings. To effectively accomplish this,
we'll need to explode our data frame. This will make
each card have a duplicate entry for each ability it
has, so cards that have 3 or 4 abilities will instead
have 3 or 4 entries with 1 ability each. Because of
this, our data frame will be substantially larger. We
won't be able to reliably use a minimum document
frequency in our vectorizer, but that's okay.

In [10]:
display(train[['name','abilities_list']].head())
explode_train = train.explode('abilities_list')
display(explode_train[['name','abilities_list']].head())

Unnamed: 0,name,abilities_list
0,Stolen Identity,"[create token copy target artifact creature , ..."
1,Ageless Entity,[whenever gain life put many +1/+1 counter car...
2,Izzet Locket,"[{t} add {u} {r} , u r u r u r u r {t} sacrif..."
3,Inspired Charge,[creature control get +2/+1 end turn]
4,"Oros, the Avenger","[fly , whenever cardname deal combat damage p..."


Unnamed: 0,name,abilities_list
0,Stolen Identity,create token copy target artifact creature
0,Stolen Identity,cipher
1,Ageless Entity,whenever gain life put many +1/+1 counter card...
2,Izzet Locket,{t} add {u} {r}
2,Izzet Locket,u r u r u r u r {t} sacrifice cardname draw t...


Now we can perform our vectorization.

In [11]:
explode_vec = cvec.fit_transform(
        explode_train['abilities_list']
    )
explode_vec = pd.DataFrame.sparse.from_spmatrix(
    explode_vec
)
explode_vec.columns = sorted((vocab := cvec.vocabulary_))
explode_vec['uuid'] = explode_train['id'].values
explode_vec.head()

Unnamed: 0,+1/+0,+1/+0 end,+1/+0 end turn,+1/+1,+1/+1 counter,+1/+1 counter cardname,+1/+1 counter creature,+1/+1 counter creature control,+1/+1 counter target,+1/+1 counter target creature,...,{u} {t},{u} {u},{w},{w} cardname,{w} {b},{w} {t},{w} {u},{w} {w},{x},uuid
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2831fe77-ea98-4334-a78e-01580fb002c0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2831fe77-ea98-4334-a78e-01580fb002c0
2,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,f2d9b77b-d775-4546-91c8-df438a1d8dbe
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,a32ecc71-b924-4414-94ee-c6cb0ba752e5
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,a32ecc71-b924-4414-94ee-c6cb0ba752e5


Let's take a look at the number of total entries versus
the number of unique cards in our data set.

In [12]:
explode_vec['uuid'].nunique(),explode_vec.shape[0]

(49896, 88164)

It looks like we have a little bit under 2 abilities
per card, which makes sense considering most cards have
1 ability.

For this next step, we'll convert the data frame we
have from a Pandas data frame into a Polars data frame.
The rest of the processes afterwards will take place in
Pandas, but Polars is able to much more efficiently
handle group-by operations, and we need to have a sum
of each individual word for each unique card.

In [13]:
# convert pandas vectorized dataframe to polars
pl_vec = pl.from_pandas(explode_vec.astype(np.int32,errors='ignore'))
# perform group by and sum aggregation and convert back
# to pandas 
agged_vec = pl_vec.groupby('uuid').sum().to_pandas()

agged_vec.shape

(49896, 1501)

Great. Now we have our data frame back to on entry per
card. We'll keep this data frame to the side for now so
that we can merge it back in later. There are a couple
more vectorizations that need to take place.