# Creating a Magic: The Gathering Commander card dataset

This notebook protocols my way of creating a Magic: The Gathering dataset containing all Commander/EDH legal cards from the game.

## 1. Data Collection

I am collecting card data using the [Scryfall API](https://scryfall.com/docs/api).  
The Scryfall API provides a way to download bulk data of all the Magic: The Gathering cards.  
We're interested in the **Default Cards** file, as it contains all Magic: The Gathering cards including reprints.  
The download URI for the Default Cards json file can be retrieved from the bulk data request.

In [1]:
import requests
from pathlib import Path

magic_card_file = Path("datasets/magic_cards.json")

# If there is no magic_card.json file, go download it.
if not magic_card_file.is_file():
    r = requests.get('https://api.scryfall.com/bulk-data')

    if r.status_code == 200:
        response_dict = r.json()
        # Get the download uri for the most recent bulk data default cards json file.
        default_cards_download_uri = response_dict['data'][2]['download_uri']
        # Get and save the cards in a json file.
        cards_r = requests.get(default_cards_download_uri)
        if cards_r.status_code == 200:
            with open('datasets/magic_cards.json', 'wb') as f:
                f.write(cards_r.content)

In [2]:
# Imports

import pandas as pd
import numpy as np

In [3]:
df = pd.read_json("datasets/magic_cards.json")

df.head()

Unnamed: 0,object,id,oracle_id,multiverse_ids,mtgo_id,mtgo_foil_id,tcgplayer_id,cardmarket_id,name,lang,...,tcgplayer_etched_id,attraction_lights,color_indicator,life_modifier,hand_modifier,printed_type_line,printed_text,content_warning,flavor_name,variation_of
0,card,0000579f-7b35-4ed3-b44c-db2a538066fe,44623693-51d6-49ad-8cd7-140505caf02f,[109722],25527.0,25528.0,14240.0,13850.0,Fury Sliver,en,...,,,,,,,,,,
1,card,00006596-1166-4a79-8443-ca9f82e6db4e,8ae3562f-28b7-4462-96ed-be0cf7052ccc,[189637],34586.0,34587.0,33347.0,21851.0,Kor Outfitter,en,...,,,,,,,,,,
2,card,0000a54c-a511-4925-92dc-01b937f9afad,dc4e2134-f0c2-49aa-9ea3-ebf83af1445c,[],,,98659.0,,Spirit,en,...,,,,,,,,,,
3,card,0000cd57-91fe-411f-b798-646e965eec37,9f0d82ae-38bf-45d8-8cda-982b6ead1d72,[435231],65170.0,65171.0,145764.0,301766.0,Siren Lookout,en,...,,,,,,,,,,
4,card,00012bd8-ed68-4978-a22d-f450c8a6e048,5aa12aff-db3c-4be5-822b-3afdf536b33e,[1278],,,1623.0,5664.0,Web,en,...,,,,,,,,,,


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77786 entries, 0 to 77785
Data columns (total 84 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   object               77786 non-null  object        
 1   id                   77786 non-null  object        
 2   oracle_id            77773 non-null  object        
 3   multiverse_ids       77786 non-null  object        
 4   mtgo_id              39536 non-null  float64       
 5   mtgo_foil_id         24612 non-null  float64       
 6   tcgplayer_id         66157 non-null  float64       
 7   cardmarket_id        62275 non-null  float64       
 8   name                 77786 non-null  object        
 9   lang                 77786 non-null  object        
 10  released_at          77786 non-null  datetime64[ns]
 11  uri                  77786 non-null  object        
 12  scryfall_uri         77786 non-null  object        
 13  layout               77786 non-

## 2. Cleaning up data

I am using the ScryFall API documentation in conjunction with some basic Magic: The Gathering knowledge for cleaning up the data.

### 2.1 Removing unnecessary attributes

Here is a list of all irrelevant attributes in the dataset.
+ object: Is the same for every card in the dataset, thus can savely be deleted
+ id: 
+ oracle_id:
+ multiverse_ids:
+ mtgo_id:
+ mtgo_foil_id:
+ tcgplayer_id:
+ cardmarket_id:
+ uri:
+ scryfall_uri:
+ highres_image:
+ image_status:
+ image_uris:
+ games:
+ foil: 
+ nonfoil:
+ finishes:
+ oversized:
+ promo:
+ set_uri:
+ set_search_uri:
+ scryfall_set_uri:
+ rulings_uri:
+ prints_search_uri:
+ collector_number:
+ digital:
+ card_back_id:
+ artist_ids:
+ illustration_id:
+ border_color:
+ frame:
+ full_art:
+ textless:
+ edhrec_rank:
+ penny_rank:
+ related_uris:
+ promo_types:
+ arena_id:
+ preview:
+ security_stamp:
+ produced_mana:
+ watermark:
+ frame_effects:
+ printed_name:
+ tcgplayer_etched_id:
+ life_modifier:
+ hand_modifier:
+ printed_type_line:
+ printed_text:
+ content_warning:
+ flavor_name:
+ variation_of:

In [5]:
columns_to_drop = [
    'object',
    'id',
    'oracle_id',
    'multiverse_ids',
    'mtgo_id',
    'mtgo_foil_id',
    'tcgplayer_id',
    'cardmarket_id',
    'uri',
    'scryfall_uri',
    'highres_image',
    'image_status',
    'image_uris',
    'games',
    'foil',
    'nonfoil',
    'finishes',
    'oversized',
    'promo',
    'set_uri',
    'set_search_uri',
    'scryfall_set_uri',
    'rulings_uri',
    'prints_search_uri',
    'collector_number',
    'digital',
    'card_back_id',
    'artist_ids',
    'illustration_id',
    'border_color',
    'frame',
    'full_art',
    'textless',
    'edhrec_rank',
    'penny_rank',
    'related_uris',
    'promo_types',
    'arena_id',
    'preview',
    'security_stamp',
    'produced_mana',
    'watermark',
    'frame_effects',
    'printed_name',
    'tcgplayer_etched_id',
    'life_modifier',
    'hand_modifier',
    'printed_type_line',
    'printed_text',
    'content_warning',
    'flavor_name',
    'variation_of'
]

df = df.drop(columns=columns_to_drop)

(77786, 32)

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77786 entries, 0 to 77785
Data columns (total 32 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   name               77786 non-null  object        
 1   lang               77786 non-null  object        
 2   released_at        77786 non-null  datetime64[ns]
 3   layout             77786 non-null  object        
 4   mana_cost          75706 non-null  object        
 5   cmc                77773 non-null  float64       
 6   type_line          77773 non-null  object        
 7   oracle_text        75283 non-null  object        
 8   power              36359 non-null  object        
 9   toughness          36359 non-null  object        
 10  colors             75706 non-null  object        
 11  color_identity     77786 non-null  object        
 12  keywords           77786 non-null  object        
 13  legalities         77786 non-null  object        
 14  reserv

### 2.2 Taking a deeper look at the remaining attributes