<style>
	.title {
		text-align: center;
		font-size: 70px;
		font-weight: bold;
	}
	.subtitle {
		text-align: center;
		font-style: italic;
		font-size: 30px;
	}
	.authors {
		text-align: center;
		font-size: 20px;
		padding: 10px;
	}
	.introduction {
		font-family: Arial, sans-serif;
		font-size: 18px;
		line-height: 1.6;
		text-align: justify;
		margin: 20px;
		padding: 15px;
		background-color:rgb(37, 37, 37);
		border-left: 5px solid #4CAF50;
		color: rgb(222, 222, 222);
		max-width: 90%;
	}
</style>

<hr/>
<div class="title">ENSF 444 Final Project</div>
<div class="subtitle">[ Magic: The Gathering ] Card Color Classification</div>
<div class="authors">Luca Rios & Cody Casselman</div>
<hr/>

<div class="introduction">
	<h2>Introduction</h2>
	<p>
		Magic: The Gathering (MTG) is a trading card game with around 27,000 unique cards. Each card has (at least one) color associated with it, which is a key part of the game. The colors are divided into five categories: White, Blue, Black, Red, and Green. Each color has its own unique playstyle and mechanics; however, for this project, we will be focusing on the color classification of cards.
	</p>
	<p>
		The objective of this project is to predict the color classification of MTG cards based on their oracle text (the descriptive text on the card) using a machine learning model. We will utilize a dataset containing all unique MTG cards, including their oracle text and corresponding color, to train and evaluate our model.
	</p>
	<br/>
</div>

<style>
	.data-collection {
		font-family: Arial, sans-serif;
		font-size: 18px;
		line-height: 1.6;
		text-align: justify;
		margin: 20px;
		padding: 15px;
		background-color:rgb(37, 37, 37);
		border-left: 5px solid rgb(76, 144, 175);
		color: rgb(222, 222, 222);
		max-width: 90%;
	}
</style>

<div class="data-collection">
	<h2>Data Collection</h2>
	<p>
		The dataset used for this project was obtained from the <a href="https://scryfall.com/">Scryfall API</a>. Scryfall is a comprehensive database of Magic: The Gathering cards, and their API provides access to a wealth of information about each card, including its name, type, color identity, and oracle text. The dataset contains over 27,000 unique cards, making it an ideal resource for our project.
	</p>
	<p>
		The data was collected using Python and the requests library to make API calls to Scryfall. We retrieved the card data in JSON format and stored it in a CSV file for further processing. The dataset only includes the oracle text and color identity of each card, as these are the two features we will be using for our machine learning model. The oracle text is the descriptive text on the card that explains its abilities and effects, while the color identity is a list of colors associated with the card. The dataset was cleaned and preprocessed to remove any unnecessary information and ensure that it was in a suitable format for analysis.
	</p>
	<h3>Scope of the Dataset</h3>
	<p>
		Since cards can have multiple colors, we have chosen to forgo any multi-coloured cards in our dataset. This means that we will only be using cards that are classified as mono-colored, which simplifies the classification task. This choice was made to ensure that our model focuses on the core mechanics of each color without the added complexity of multi-colored cards that can tend to blur the lines between the themes of each color.
	</p>
	<br/>
</div>

# Step One: Data Cleaning

In [1]:
# Import necessary libraries
import pandas as pd


In [2]:
# Import data from csv
data = pd.read_csv("ml_data.csv")
# Inspect first few rows
data.head()

Unnamed: 0,color,text
0,black,Legendary Creature — Zombie Wizard When Acerer...
1,black,Legendary Creature — Insect Druid When Aatchik...
2,black,Legendary Creature — Astartes Warrior Trample ...
3,black,Creature — Beholder When Baleful Beholder ente...
4,black,Sorcery As an additional cost to cast this spe...


In [3]:
print(data.info())
#checking for NaN
print(data.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29148 entries, 0 to 29147
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   color   29148 non-null  object
 1   text    29148 non-null  object
dtypes: object(2)
memory usage: 455.6+ KB
None
color    0
text     0
dtype: int64


In [4]:
# Since our data contains relatively few NaN values, we can just remove the rows that contain them
data = data.dropna()
print(data.isnull().sum())
data.info()

color    0
text     0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29148 entries, 0 to 29147
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   color   29148 non-null  object
 1   text    29148 non-null  object
dtypes: object(2)
memory usage: 455.6+ KB


In [5]:
# Creating our feature matrix and target vector
X = data["text"]
y = data["color"]

In [6]:
# Splitting our training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Model One: Multinomial Naive Bayes


In [7]:
#Import necessary libraries
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score

### Naive Approach

In [55]:

mnb_clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', max_df=0.3, min_df=0.0, ngram_range=(1,2))),
    ('clf', MultinomialNB())
])

mnb_clf.fit(X_train, y_train)

y_pred = mnb_clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.6332761578044597
              precision    recall  f1-score   support

       black       0.66      0.69      0.68       633
        blue       0.60      0.71      0.65       620
       green       0.62      0.63      0.63       600
         red       0.75      0.41      0.53       414
       white       0.61      0.65      0.63       648

    accuracy                           0.63      2915
   macro avg       0.65      0.62      0.62      2915
weighted avg       0.64      0.63      0.63      2915



### Determining the Best Encoder Using a Grid Search CV

In [56]:
from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1,2))),
    ("clf", MultinomialNB())
])

param_grid = [{"tfidf":[TfidfVectorizer()],
               "tfidf__max_df":[0.5, 0.3, 0.2, 0.15],
               "tfidf__min_df":[0.0, 0.1],
               }]

grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, return_train_score=True, error_score='raise')
grid.fit(X_train, y_train)


In [57]:
print("Best cross-validation accuracy: {:.2f}".format(grid.best_score_))
print("Training Score: {:.2f}".format(grid.cv_results_['mean_train_score'][grid.best_index_]))
print("Test set score: {:.2f}".format(grid.score(X_test, y_test)))
print("Best parameters: {}".format(grid.best_params_))

Best cross-validation accuracy: 0.61
Training Score: 0.66
Test set score: 0.62
Best parameters: {'tfidf': TfidfVectorizer(), 'tfidf__max_df': 0.2, 'tfidf__min_df': 0.0}


In [29]:
# Determining the best value for alpha using the tfidf parameters above

alpha_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(max_df=grid.best_params_['tfidf__max_df'], min_df=grid.best_params_['tfidf__min_df'])),
    ("clf", MultinomialNB())
])

alpha_param_grid = [{"clf__alpha":[0.8, 0.9, 0.95, 1.0],
                     "tfidf__max_df":[0.2, 0.3, 0.4],
                     "tfidf__min_df":[0.0, 0.05, 0.1]
               }]

alpha_grid = GridSearchCV(alpha_pipe, param_grid=alpha_param_grid, cv=5, return_train_score=True, error_score='raise')
alpha_grid.fit(X_train, y_train)

In [30]:
print("Best cross-validation accuracy: {:.2f}".format(alpha_grid.best_score_))
print("Training Score: {:.2f}".format(alpha_grid.cv_results_['mean_train_score'][alpha_grid.best_index_]))
print("Test set score: {:.2f}".format(alpha_grid.score(X_test, y_test)))
print("Best parameters: {}".format(alpha_grid.best_params_))

Best cross-validation accuracy: 0.61
Training Score: 0.66
Test set score: 0.62
Best parameters: {'clf__alpha': 0.9, 'tfidf__max_df': 0.2, 'tfidf__min_df': 0.0}


# Model Two: Support Vector Machine with NonLinear Kernel

# Model Three: Random Forest Classifier

# Reflection And Conclusions