# Continued Development

## Purpose

This notebook is intended to provide a continuation on
the project thus far. Rather than deleting or
overwriting sections in the first two notebooks, this
notebook is to act as a way to cover "Next Steps" as
outlined at the end of the second notebook.

Because this notebook exists outside of the scope of
the original project, this notebook may be messier and
won't provide as much extensive detail.

Once this discovery reaches a satisfactory point, the
project will undergo the same restructuring that was
taken during its creation, including a recreation of
the README and presentation.

## Imports

Because work has been done to clean data that exists as
a jumping-off point, this notebook won't recreate the
initial data - though it is important to note that
newer and more accurate data will become available over
time and this may not immediately be taken into account
during the processes outlined within.

In [1]:
import numpy as np
import polars as pl
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.dummy import DummyRegressor
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor, \
            GradientBoostingRegressor
from sklearn.model_selection import train_test_split, \
            cross_val_score, GridSearchCV
from sklearn.linear_model import ElasticNetCV
from sklearn.metrics import mean_squared_error
from sklearn.feature_extraction.text import CountVectorizer

from joblib import parallel_backend, Parallel, \
            dump, load

from _code.cleaner import preprocess

from IPython.display import Image, Markdown

%matplotlib inline

## Recreating Processes

Many of the processes used in 2_Modeling.ipynb are
still valid that lead up to the actual model process,
so these steps will be combined here without explicit
explanation.

### Data importing

In [None]:
# import initial card data set
cards = pd.read_parquet('./data/simplified_cards.parquet')

### Pre-processing

In [None]:
# perform pre-processing of abilities
processed_abilities = preprocess(cards['oracle_text'])
cards['abilities_list'] = [
    abilities.split('\n') 
    for abilities in processed_abilities
    ]

# create an ability count feature
cards['n_abilities'] = cards['abilities_list'].map(len)

### Train-Test-Split

In [None]:
# create train and test sets that are divided based on
# the set that a card is a part of 
X = cards.drop(columns=['prices_normal','prices_foil'])
y = cards['set']
train, test, _, _ = \
    train_test_split(
        X,y,stratify=y,
        random_state=13,
        test_size=0.2
    )

# we'll reset the indices of both sets to more easily
# translate between polars and pandas in an upcoming
# step

train.reset_index(drop=True,inplace=True)
test.reset_index(drop=True,inplace=True)

### Vectorization Process

#### Vectorize Abilities

In [None]:
# Our token pattern needs to be able to account for
# several non-standard things for it to be effective
# for our needs. As normal, it needs to be able to
# match words that are contiguous letters of an
# arbitrary length. However, we also need to be able to
# account for numbers in a few formats. These can be
# wrapped in curly brackets, e.g. {2}, representing 2
# colorless mana.
# It also needs to be able to account for letters
# inside of curly brackets and return them as such.
# It must also match something like +1/+1, -1/-1, or
# several other variations thereof.
# Lastly, it should also ignore any text that is inside
# of parentheses, as this text is reminder text - which
# is text that explains what an ability does but this
# text doesn't actually contribute to explaining what
# a card does in a meaningful way. 

token_pattern = \
    r"([a-zA-Z]+(?:’[a-z]+)?|[+-]?\d\/[+-]?\d|\{\d\d?\}|\{.\s?.?\}|\n)|\(.+?\)"

cvec = CountVectorizer(
        token_pattern=token_pattern,
        # min_df=0.0005, # <= this will mean that the
                    # minimum number of cards that it
                    # takes for an ability to show up
                    # on this list will be 46 after the
                    # explode is run, since it will be
                    # 83,000 entries long. We'll just
                    # limit our overall features since
                    # this is such a small percentage. 
        max_df=0.4,
        ngram_range=(1,5),
        max_features=1500
    )

# Exploding abilities to create a vectorized set
explode_train = train.explode('abilities_list')

explode_vec = cvec.fit_transform(
        explode_train['abilities_list']
    )
explode_vec = pd.DataFrame.sparse.from_spmatrix(
    explode_vec
)

# we save the vocab here to export for later. This is
# so we can bring in new data and make sure it's only
# being segmented into vocab we can "understand"
explode_vec.columns = sorted((vocab := cvec.vocabulary_))
explode_vec['id'] = explode_train['id'].values
explode_vec.head()

# convert pandas vectorized dataframe to polars
pl_vec = pl.from_pandas(explode_vec.astype(np.int32,errors='ignore'))
# perform group by and sum aggregation and convert back
# to pandas 
agged_vec = pl_vec.groupby('id').sum().to_pandas()

#### Vectorizing Types

In [None]:
card_type_cvec = CountVectorizer()

type_frame = pl.from_pandas(train['type_line']).apply(lambda x: x.split('—')[0])
type_df = pd.DataFrame.sparse.from_spmatrix(
    card_type_cvec.fit_transform(type_frame)
)
# we save the type vocab for later to export
type_df.columns = sorted((type_vocab:=card_type_cvec.vocabulary_))
type_df['id'] = train['id']

#### Vectorizing Color Identity

In [None]:
train['str_color_identity'] = \
    train['color_identity'].map(
        lambda x: ' '.join(x)
        )

color_match = CountVectorizer(
    token_pattern=r"[wubrg]"
)
color_id_df = pd.DataFrame.sparse.from_spmatrix(
    color_match.fit_transform(train['str_color_identity'])
)
color_id_df.columns = sorted(
        (color_vocab:=color_match.vocabulary_)
    )
color_id_df['c'] = color_id_df.T.apply(lambda x: 1 if sum(x)==0 else 0)
color_id_df['id'] = train['id']

#### Vectorizing Pseudo-Numbers et al

In [None]:
dummy_vectorizer = CountVectorizer(
    token_pattern = r".*",
    stop_words=[''],
    lowercase=False
    )
dummy_dict = {}
dummy_vocab = {}
dummy_columns = ['rarity','power','toughness','loyalty']
for _col in dummy_columns:
    dummy_column = train[_col].T.apply(
        lambda x: '' if x == None else f'{_col}_{x}'
        )
    dummy_dict[_col] = pd.DataFrame.sparse.from_spmatrix(
        dummy_vectorizer.fit_transform(dummy_column)
    )
    dummy_vocab[_col] = dummy_vectorizer.vocabulary_
    dummy_dict[_col].columns = sorted(dummy_vocab[_col])
    dummy_dict[_col]['id'] = train['id']

dummies = train[['id']]
for _col in dummy_columns:
    dummies = dummies.merge(
        dummy_dict[_col],
        on='id'
    )

### Date-to-Age Conversion

In [None]:
_now = pd.Timestamp.today().floor('D')
train['card_age'] = train['released_at'].apply(lambda x: (_now - x).days)

### Feature Reduction

In [None]:
# purpose for each column is explained in notebook 2
# for simplicity, this won't be repeated here 
used_columns = [
    'id','cmc','promo','reprint','full_art','textless',
    'n_abilities','median_normal','median_foil','card_age'
]
train_reduced = train[used_columns].copy()

### Data Merging and Subsetting

In [None]:
train_combined = train_reduced.merge(
    agged_vec,
    on='id'
).merge(
    type_df,
    on='id'
).merge(
    color_id_df,
    on='id'
).merge(
    dummies,
    on='id'
)

train_combined = train_combined[
    (train_combined[ 'stickers' ] == 0) &
    (train_combined['conspiracy'] == 0)
].drop(columns=['stickers','conspiracy']).copy()

# creating a normal and foil subset
train_norm = train_combined.dropna(
    subset=['median_normal']
    ).drop(columns=['median_foil']
    ).reset_index(drop=True)

train_foil = train_combined.dropna(
    subset=['median_foil']
    ).drop(columns=['median_normal']
    ).reset_index(drop=True)

norm_prices = train_norm['median_normal']
foil_prices = train_foil['median_foil']

## Modeling (Again)

### Dummy Model

We'll recreate the dummy model here to have a baseline
for comparison later on.

In [None]:
dummy = DummyRegressor(strategy='median')
norm_guess = dummy.fit(
    train_norm,norm_prices
    ).predict(train_norm)
foil_guess = dummy.fit(
    train_foil,foil_prices
    ).predict(train_foil)

norm_base_rmse = mean_squared_error(
        norm_prices,norm_guess,
        squared=False
    )
foil_base_rmse = mean_squared_error(
        foil_prices,foil_guess,
        squared=False
    )