# Goals:

**1. Develop a natural language processor capable of predicting the COLOR of a magic card based on the rules text of that card.**


In [1]:
# imports and display options

import pandas as pd
import numpy as np
import math
from math import sqrt

from sklearn.model_selection import train_test_split

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

import unicodedata
import re
import nltk
from nltk.corpus import stopwords

import prepare as p
# import explore as e
# import model as m

pd.set_option('display.max_colwidth', -1)

# Acquire

* Used file from previouse 
* A CSV, containing an up to date breakdown of each card that has been printed so far, was obtained from MTGJSON.com
* Each row represents a card or a version of a card
* The CSV was read into a pandas dataframe
* The original dataframe contained 51,430 rows and 73 columns

# Prepare

The following steps were taken to prepare the data:

1. Restricted dataframe to relevant columns
 
2. Restricted dataframe to rows containing cards that exist in physical form

3. Restricted dataframe to rows containing a value for 'text'

4. Restricted dataframe to rows with a single 'color identity' (see data dictionary Color)

5. Dropped all columns other than text and color

6. Renamed columns

7. Changed some of the symble represintations to words to make them machine readable

8. Applied basic cleaning to values in text by, lowercasing letters, converting to ASCII characters, removing non-letter characters, lemmatizeing the words, and removing stopwords

9. Dropped duplicate rows

10. Changed column order

11. Wrote prepared data to ‘mtgprep.csv’ for ease of access. Data consisted of 15380 rows and two columns.

12. Created a test and train group at a 20/80 split

## Data Dictionary:

* Magic: The Gathering
    * a collectable card game developed by Wizards of the Coast Inc. In a typical game, each player combines a selection of cards from their collection into his or her own deck, which they use to compete against other players. Thematically, players assume the roles of powerful mages fighting for supremacy. Cards in each players deck represent spells and other resources each player’s disposal. A game typically ends when all but one of the participating players is reduced to zero “life” or is otherwise eliminated from the game.

### Columns

* Color
    * Spells players can cast are divided into five different colors Each color is thematically distinct
    * White represents order and morality
    * Blue represents cunning and technology
    * Black represents pragmatism and amorality
    * Red represents impulse and chaos
    * Green represents nature and instinct
    * Spells with more than one color were excluded for the study to give a clearer picture of sentiment values for each color
    * The color or colors each spell had was determined by the color of mana symbols that appear on each card, also called its color identity
    <br />   
    
    
* Text
    * The value in text represents the rules text appearing on one Magic the Gathering card

In [2]:
# prepare data 
df = p.get_preped_data()

#create test and train groups
train, test = p.split_data(df)

# Explore

In [3]:
train.head()

Unnamed: 0,color,text
13105,Black,flying tap sacrifice phyrexian debaser target creature get minusandminus end turn
31975,Green,end turn target land becomes creature still land
31493,Black,choose creature type target creature get minusandminus end turn permanent chosen type control
48910,White,search library plain card target opponent control land may search library additional plain card reveal card put hand shuffle library
30260,White,long spectral guardian untapped noncreature artifact shroud target spell ability


In [4]:
labels = pd.concat([df.color.value_counts(),
                    df.color.value_counts(normalize=True)], axis=1)
labels.columns = ['n', 'percent']
labels

Unnamed: 0,n,percent
Black,3119,0.202796
Blue,3081,0.200325
White,3078,0.20013
Red,3075,0.199935
Green,3027,0.196814


In [30]:
all_words = p.word_soup(' '.join(df.text))
blue_words = p.word_soup(' '.join(df[df.color == 'Blue'].text))
green_words = p.word_soup(' '.join(df[df.color == 'Green'].text))
red_words = p.word_soup(' '.join(df[df.color == 'Red'].text))
white_words = p.word_soup(' '.join(df[df.color == 'White'].text))
black_words = p.word_soup(' '.join(df[df.color == 'Black'].text))

all_freq = pd.Series(all_words).value_counts()
blue_freq = pd.Series(blue_words).value_counts()
green_freq = pd.Series(green_words).value_counts()
red_freq = pd.Series(red_words).value_counts()
white_freq = pd.Series(white_words).value_counts()
black_freq = pd.Series(black_words).value_counts()

word_counts = (pd.concat([all_freq,blue_freq,green_freq,red_freq,white_freq,black_freq], axis=1, sort=True)
                .set_axis(['all','blue','green','red','white','black'], axis=1, inplace=False)
                .fillna(0)
                .apply(lambda s: s.astype(int)))

word_counts.sort_values(by='all', ascending=False).head(10)


Unnamed: 0,all,blue,green,red,white,black
creature,14749,2284,3154,2754,3270,3287
card,8005,2335,1475,1145,901,2149
target,6914,1593,1048,1625,1214,1434
control,5033,1040,1064,978,1203,748
turn,4757,749,901,1235,1061,811
battlefield,4148,803,975,648,838,884
player,3551,811,466,824,377,1073
damage,3398,197,424,1547,726,504
plusandplus,3382,278,1061,702,791,550
end,3268,438,616,894,662,658


In [31]:
word_counts[['blue']].sort_values(by='blue', ascending=False).head(10)

Unnamed: 0,blue
card,2335
creature,2284
target,1593
control,1040
spell,1024
may,838
player,811
flying,809
library,804
battlefield,803
