# Table of Contents:
<!-- TOC -->
- [Sprint 5 Task 1 ❄️ Game of Thrones](#sprint-5-task-1--game-of-thrones)
- [Statement](#statement)
- [Exercise 1](#exercise-1)
    - [Part 1 - Simple random sample:](#part-1---simple-random-sample)
    - [Part 2 - Systematic sampling:](#part-2---systematic-sampling)
- [Exercise 2](#exercise-2)
    - [Part 1 - Stratified sampling:](#part-1---stratified-sampling)
    - [Part 2 - SMOTE:](#part-2---smote)
- [Exercise 3 - Reservoir sampling](#exercise-3---reservoir-sampling)
<!-- /TOC -->



# Sprint 5 Task 1 ❄️ Game of Thrones


## *Statement*

Learn how to perform data mining with Python.
Level 1
- Exercise 1

Get a set of sports-themed data that you like. Perform a data sampling generating a simple random sample and a systematic sample.

> **Simple random sample:**
In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population.
To conduct this type of sampling, you can use tools like random number generators or other techniques that are based entirely on chance.

> **Systematic sampling:**
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

Level 2

- Exercise 2

Continue with the set of sports-themed data and generate a stratified sample and a sample using SMOTE (Synthetic Minority Oversampling Technique).

> **Stratified sampling:**
Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample.
To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic (e.g. gender, age range, income bracket, job role).
Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup. Then you use random or systematic sampling to select a sample from each subgroup.

> **Oversampling using SMOTE:**
In SMOTE (Synthetic Minority Oversampling Technique) we synthesize elements for the minority class, in the vicinity of already existing elements.

Level 3

- Exercise 3

Continue with the set of sports data and generate a sample using the Reservoir sampling method.

> **Reservoir Sampling:**
In SMOTE (Synthetic Minority Oversampling Technique) we synthesize elements for the minority class, in the vicinity of already existing elements.

In [1]:
# Import Libraries
import numpy as np
import pandas as pd

In [2]:
# Update Libraries first time
# %pip install upgrade numpy
# %pip install upgrade pandas

In [3]:
# Deactivate / Activate warnings
import warnings
# warnings.filterwarnings('ignore')
warnings.filterwarnings("default")

## Exercise 1

Get a set of sports-themed data that you like. Perform a data sampling generating a simple random sample and a systematic sample.


In [4]:
df = pd.read_csv('./game-of-thrones.csv')
df

Unnamed: 0,Text,Speaker,Episode,Season,Show
0,[First scene opens with three Rangers riding t...,,e1-Winter is Coming,season-01,Game-of-Thrones
1,What d’you expect? They’re savages. One lot s...,WAYMAR ROYCE,e1-Winter is Coming,season-01,Game-of-Thrones
2,I’ve never seen wildlings do a thing like thi...,WILL,e1-Winter is Coming,season-01,Game-of-Thrones
3,How close did you get?,WAYMAR ROYCE,e1-Winter is Coming,season-01,Game-of-Thrones
4,Close as any man would.,WILL,e1-Winter is Coming,season-01,Game-of-Thrones
...,...,...,...,...,...
33193,The Queen in the North! The Queen in the Nort...,ALL,e6,season-08,Game-of-Thrones
33194,ARYA's ship,CUT TO,e6,season-08,Game-of-Thrones
33195,ARYA stands at the prow and the ship sails awa...,,e6,season-08,Game-of-Thrones
33196,Castle Black,CUT TO,e6,season-08,Game-of-Thrones


In [5]:
# Let's explore this dataset
round(df.describe(include="all"),1)

Unnamed: 0,Text,Speaker,Episode,Season,Show
count,33198,24842,33198,33198,33198
unique,31044,957,50,8,1
top,No.,TYRION,e4,season-02,Game-of-Thrones
freq,103,1543,2005,5357,33198


### **Part 1 - Simple random sample:**
In a simple random sample, every member of the population has an equal chance of being selected. Your sampling frame should include the whole population.
To conduct this type of sampling, you can use tools like random number generators or other techniques that are based entirely on chance.
<!-- TOC -->

- [**Part 1 - Simple random sample:**](#part-1---simple-random-sample)

<!-- /TOC -->

In [6]:
# Simple random sample:
SampleRandom = df.sample(5,random_state=1)
SampleRandom

Unnamed: 0,Text,Speaker,Episode,Season,Show
21239,(Smiles). Lord Bolton.,SANSA,e3,season-05,Game-of-Thrones
13383,Karhold!,THEON,e6-The Climb,season-03,Game-of-Thrones
21198,"I'm the sword in the darkness, the watcher on...",OLLY,e3,season-05,Game-of-Thrones
25704,I needed fresh air. The old women stink.,DAENERYS,e4,season-06,Game-of-Thrones
6923,"ARYA sits, expressionless, trying not to be no...",,e4-Garden of Bones,season-02,Game-of-Thrones


### **Part 2 - Systematic sampling:**
Systematic sampling is similar to simple random sampling, but it is usually slightly easier to conduct. Every member of the population is listed with a number, but instead of randomly generating numbers, individuals are chosen at regular intervals.

In [7]:
# Systematic sampling:
def systematic_sampling(df, step):
    
    indexes = np.arange(0,len(df),step=step)
    systematic_sample = df.iloc[indexes]
    return systematic_sample
    
SampleSystematic = systematic_sampling(df, 100)
SampleSystematic

Unnamed: 0,Text,Speaker,Episode,Season,Show
0,[First scene opens with three Rangers riding t...,,e1-Winter is Coming,season-01,Game-of-Thrones
100,You will train them yourselves. You will feed...,NED,e1-Winter is Coming,season-01,Game-of-Thrones
200,The queen has two brothers?,ROS,e1-Winter is Coming,season-01,Game-of-Thrones
300,And Lady Stark is not your mother. Making you...,TYRION,e1-Winter is Coming,season-01,Game-of-Thrones
400,[The scene shifts to Winterfell. Tyrion and th...,,e1-Winter is Coming,season-01,Game-of-Thrones
...,...,...,...,...,...
32700,They took Missandei.,DAENERYS,e4,season-08,Game-of-Thrones
32800,"TYRION walks forward, alone. He turns a corner...",,e6,season-08,Game-of-Thrones
32900,What?,JON,e6,season-08,Game-of-Thrones
33000,SAM stands.,,e6,season-08,Game-of-Thrones


In [8]:
# Let's explore a little bit this sample
SampleSystematic.describe()

Unnamed: 0,Text,Speaker,Episode,Season,Show
count,332,246,332,332,332
unique,331,125,50,8,1
top,No.,TYRION,e2,season-02,Game-of-Thrones
freq,2,20,20,54,332


## Exercise 2
Continue with the set of sports-themed data and generate a stratified sample and a sample using SMOTE (Synthetic Minority Oversampling Technique).

##### Change of dataset:  
---------
We keep the following code as example of stratification done with the WorldCups.csv, but discarded as the dataset was too small.

In [9]:

# OLD CODE - LEFT only for INFORMATION
# Let's stratify by subpopulations of winner's continent (Europe or America)
# df.Winner.unique()

# def continent(country):
#   EuropeanCountries = ('Italy', 'Germany FR','England','France', 'Spain', 'Germany')
#   AmericanCountries = ('Uruguay', 'Brazil', 'Argentina')
#   if country in EuropeanCountries: return "Europe"
#   elif country in AmericanCountries: return "America"
#   else: return "NaN"

# df['Continent'] = df.Winner.apply(continent)
# df.Continent.value_counts()

# Now that we have the stratification (Continent), we just have to choose a random sampling of each continent, 
# based on the overall proportions of the population. (11/9)

# The problem that we found is that our dataset is too small and the number of European Countries are 11, a prime number.
# Let's choose a bigger dataset

In [10]:
df.Season.describe()

count         33198
unique            8
top       season-02
freq           5357
Name: Season, dtype: object

### Part 1 - Stratified sampling:
------
>**Stratified sampling**: Stratified sampling involves dividing the population into subpopulations that may differ in important ways. It allows you draw more precise conclusions by ensuring that every subgroup is properly represented in the sample. To use this sampling method, you divide the population into subgroups (called strata) based on the relevant characteristic (e.g. gender, age range, income bracket, job role). Based on the overall proportions of the population, you calculate how many people should be sampled from each subgroup. Then you use random or systematic sampling to select a sample from each subgroup.

In [11]:
# Let's stratify by Season (season-01 to season-08)
SeasonCounts = pd.DataFrame(df.Season.value_counts()).sort_index(axis=0) 
SeasonCounts

Unnamed: 0,Season
season-01,4479
season-02,5357
season-03,5295
season-04,4606
season-05,4145
season-06,4064
season-07,3085
season-08,2167


In [12]:
DfCounts = df.Season.count()
DfCounts

33198

In [13]:
SeasonCounts['Proportion']=SeasonCounts['Season'] / DfCounts

In [14]:
# So, we have the total count of registers in season column (33198) and each count of every season.
# Now we can design the sample taking the proportional amount of samples from each season.
SeasonCounts['Proportion'] = SeasonCounts.Season / DfCounts
SeasonCounts


Unnamed: 0,Season,Proportion
season-01,4479,0.134918
season-02,5357,0.161365
season-03,5295,0.159498
season-04,4606,0.138743
season-05,4145,0.124857
season-06,4064,0.122417
season-07,3085,0.092927
season-08,2167,0.065275


In [15]:
# Let's choose now a sample of 332 as the SampleSystematic but stratified.
SeasonCounts['Samples'] = round(SeasonCounts.Proportion *331)
SeasonCounts.Samples.sum()


332.0

In [16]:
SeasonCounts.reset_index()

Unnamed: 0,index,Season,Proportion,Samples
0,season-01,4479,0.134918,45.0
1,season-02,5357,0.161365,53.0
2,season-03,5295,0.159498,53.0
3,season-04,4606,0.138743,46.0
4,season-05,4145,0.124857,41.0
5,season-06,4064,0.122417,41.0
6,season-07,3085,0.092927,31.0
7,season-08,2167,0.065275,22.0


In [17]:
SeasonArray=SeasonCounts.index.values
SeasonArray

array(['season-01', 'season-02', 'season-03', 'season-04', 'season-05',
       'season-06', 'season-07', 'season-08'], dtype=object)

In [18]:
# Now we will take a random sample of each season with the number of samples calculated in SeasonCounts

# First Season
NumSamples = int(SeasonCounts.at['season-01','Samples'])
SampleStratified = df[df.Season =='season-01'].sample(NumSamples,random_state=1) 
SampleStratified.count()


Text       45
Speaker    35
Episode    45
Season     45
Show       45
dtype: int64

In [19]:
# Rest of Seasons
for x in range(1,8):
    Season = SeasonArray[x]
    NumSamples = int(SeasonCounts.at[Season,'Samples'])
    SampleStratified = SampleStratified.append(df[df.Season == Season].sample(NumSamples,random_state=1))
      

In [20]:
SampleStratified.count()

Text       332
Speaker    256
Episode    332
Season     332
Show       332
dtype: int64

In [21]:
# Ok, we have our stratified sample. Let's do now Oversampling using SMOTE

### Part 2 - SMOTE:
------- 
>**Oversampling using SMOTE**: In SMOTE (Synthetic Minority Oversampling Technique) we synthesize elements for the minority class, in the vicinity of already existing elements.
-------
**What is SMOTE?**
SMOTE or Synthetic Minority Oversampling Technique is an oversampling technique but SMOTE working differently than your typical oversampling.
In a classic oversampling technique, the minority data is duplicated from the minority data population. While it increases the number of data, it does not give any new information or variation to the machine learning model.

For the reason above, Nitesh Chawla, et al. (2002) introduce a new technique to create synthetic data for oversampling purposes in their SMOTE paper.

SMOTE works by utilizing a k-nearest neighbour algorithm to create synthetic data. SMOTE first start by choosing random data from the minority class, then k-nearest neighbours from the data are set. Synthetic data would then be made between the random data and the randomly selected k-nearest neighbour. Let me show you the example below.

![](2022-01-17-17-39-23.png)

The procedure is repeated enough times until the minority class has the same proportion as the majority class.

🗨️ QUOTE from: https://towardsdatascience.com/5-smote-techniques-for-oversampling-your-imbalance-data-b8155bdbe2b5

**Finally used SMOTENC:**

Synthetic Minority Over-sampling Technique for Nominal and Continuous.

Unlike SMOTE, SMOTE-NC for dataset containing numerical and categorical features. However, it is not designed to work with only categorical features.

🗨️ https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTENC.html

##### What is random_state parameter?: 
-----
If you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn't matter what the actual random_state number is 42, 0, 21, ... The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.

🗨️ Quote from: https://stackoverflow.com/questions/28064634/random-state-pseudo-random-number-in-scikit-learn

In [22]:
## df_res = sm.fit_resample( X , Y )
# X, in numerical datasets is, for example, non fraud samples (0)
# Y, in the previous dataset, is fraud samples (1)

##### Change of dataset:
------
Now, we realize that our actual dataset *game-of-thrones.csv* is not a good one in order to make an SMOTE sampling, because there are no 1/0 columns to make the Y set. 

We decide to change the dataset again. 

We are going to use now *battles.csv* dataset, also GOT related.

In this dataset is going to have more sense to apply afterwards machine learning prediction, between if the battle is won (1) or lost (0).

In [23]:
Battle_df = pd.read_csv('./battles.csv')
Battle_df.columns

Index(['name', 'year', 'battle_number', 'attacker_king', 'defender_king',
       'attacker_1', 'attacker_2', 'attacker_3', 'attacker_4', 'defender_1',
       'defender_2', 'defender_3', 'defender_4', 'attacker_outcome',
       'battle_type', 'major_death', 'major_capture', 'attacker_size',
       'defender_size', 'attacker_commander', 'defender_commander', 'summer',
       'location', 'region', 'note'],
      dtype='object')

In [24]:
#Let's see if our dataset is unbalanced bettween win / loss
Battle_df.attacker_outcome.value_counts()

win     32
loss     5
Name: attacker_outcome, dtype: int64

In [25]:
# Yes, is definitely unbalanced to win result.
# Let's remove now all the columns that we are not interested:
Battle_df.drop(columns=['name','year','battle_number','attacker_1',
                        'attacker_2', 'attacker_3', 'attacker_4', 'defender_1',
                        'defender_2', 'defender_3', 'defender_4', 'battle_type','attacker_commander',
                        'defender_commander','location', 'region', 'note'],inplace=True)

In [26]:
#Pre-Processing
# Let's fill NaN's with 0 in columns 'attacker_size' and 'defender_size' (is plausible NaN as 0 in these columns)
Battle_df.attacker_size.fillna(0,inplace=True); # ; ↔️ to suppress output - Needed, otherwise : ValueError: Input contains NaN
Battle_df.defender_size.fillna(0,inplace=True); # ; ↔️ to suppress output - Needed, otherwise : ValueError: Input contains NaN
# Let's categorize the rest of object's type columns - Not needed
Battle_df.dtypes

attacker_king        object
defender_king        object
attacker_outcome     object
major_death         float64
major_capture       float64
attacker_size       float64
defender_size       float64
summer              float64
dtype: object

In [27]:
Battle_df.iloc[[22,23,24,29,37]]

Unnamed: 0,attacker_king,defender_king,attacker_outcome,major_death,major_capture,attacker_size,defender_size,summer
22,,,win,0.0,0.0,0.0,0.0,1.0
23,Joffrey/Tommen Baratheon,Robb Stark,win,0.0,0.0,0.0,6000.0,
24,Joffrey/Tommen Baratheon,,win,1.0,0.0,0.0,0.0,1.0
29,,,win,0.0,0.0,0.0,0.0,0.0
37,Stannis Baratheon,Joffrey/Tommen Baratheon,,,,5000.0,8000.0,0.0


In [28]:
# We delete these sample because it includes NaN values and doesn't have sense.
Battle_df.drop([22,23,24,29,37],inplace=True) # Needed.
Battle_df

Unnamed: 0,attacker_king,defender_king,attacker_outcome,major_death,major_capture,attacker_size,defender_size,summer
0,Joffrey/Tommen Baratheon,Robb Stark,win,1.0,0.0,15000.0,4000.0,1.0
1,Joffrey/Tommen Baratheon,Robb Stark,win,1.0,0.0,0.0,120.0,1.0
2,Joffrey/Tommen Baratheon,Robb Stark,win,0.0,1.0,15000.0,10000.0,1.0
3,Robb Stark,Joffrey/Tommen Baratheon,loss,1.0,1.0,18000.0,20000.0,1.0
4,Robb Stark,Joffrey/Tommen Baratheon,win,1.0,1.0,1875.0,6000.0,1.0
5,Robb Stark,Joffrey/Tommen Baratheon,win,0.0,0.0,6000.0,12625.0,1.0
6,Joffrey/Tommen Baratheon,Robb Stark,win,0.0,0.0,0.0,0.0,1.0
7,Balon/Euron Greyjoy,Robb Stark,win,0.0,0.0,0.0,0.0,1.0
8,Balon/Euron Greyjoy,Robb Stark,win,0.0,0.0,1000.0,0.0,1.0
9,Balon/Euron Greyjoy,Robb Stark,win,0.0,0.0,264.0,0.0,1.0


In [29]:
## I tried to replace special characters (by hand) but is not needed when SMOTENC is used.
# (maybe this is the problem for ValueError: could not convert string to float: 'Joffrey/Tommen Baratheon')
# Let's take off caracters like '/' in order to clean the samples 

# Battle_df.replace('Joffrey/Tommen Baratheon','joffreytommenbaratheon',inplace=True)
# Battle_df.replace('Robb Stark','robbstark',inplace=True)
# Battle_df.replace('Balon/Euron Greyjoy','baloneurongreyjoy',inplace=True)
# Battle_df.replace('Stannis Baratheon','stannisbaratheon',inplace=True)
# Battle_df.replace('Robb Stark','robbstark',inplace=True)

In [30]:
# Let's prepare X & Y for sampling and posterior model training
X = np.array(Battle_df.loc[:, Battle_df.columns != 'attacker_outcome'])
Y = np.array(Battle_df.loc[:, Battle_df.columns == 'attacker_outcome']).reshape(-1, 1)


In [31]:
# Instantiate SMOTENC algorithm
from imblearn.over_sampling import SMOTENC
sm = SMOTENC(categorical_features=[df.dtypes==object], k_neighbors=2, random_state=123)


##### **Errors encountered:** 
*ValueError: Input contains NaN*

>This error appears if Y has NaN

*TypeError: '<' not supported between instances of 'float' and 'str'*

>This error appeared with NaN values in X. (not very explanatory message, right?)

*float' object has no attribute 'lower'*

>I have encountered this error and realized that is necessary to clean the string data, lower chars, take off special characters (/ ,  ....)

*ValueError: could not convert string to float:*

>🗨️ See: https://stackoverflow.com/questions/65280842/smote-could-not-convert-string-to-float

*SMOTE initialization expects n_neighbors <= n_samples, but n_samples < n_neighbors*
>I had to add k_neighbors=2 in order to eliminate this error.

>🗨️ See: https://stackoverflow.com/questions/49395939/smote-initialisation-expects-n-neighbors-n-samples-but-n-samples-n-neighbo

##### **Solutions not working:**
*Categorize string/object columns*
>Didn't work to solve problems with categorical columns to be passed to SMOTE / SMOTENC

*Vectorizer*
>Tried but didn't work either.
*Using SMOTE*
>Didn't work, I changed to SMOTENC, probably if before SMOTE I would try to hot-encode the category columns problem would be solved.


In [32]:
# Finally, SMOTENC sample worked!!!
X_res, Y_res = sm.fit_resample( X , Y )

In [33]:
#Let's see if our dataset is now balanced bettween win / loss
Y_res

array(['win', 'win', 'win', 'loss', 'win', 'win', 'win', 'win', 'win',
       'win', 'win', 'win', 'win', 'win', 'win', 'win', 'loss', 'win',
       'win', 'loss', 'win', 'loss', 'win', 'win', 'loss', 'win', 'win',
       'win', 'win', 'win', 'win', 'win', 'win', 'loss', 'loss', 'loss',
       'loss', 'loss', 'loss', 'loss', 'loss', 'loss', 'loss', 'loss',
       'loss', 'loss', 'loss', 'loss', 'loss', 'loss', 'loss', 'loss',
       'loss', 'loss', 'loss', 'loss'], dtype=object)

In [34]:
# It seems so!!!!

### Exercise 3 - Reservoir sampling
------- 
**Reservoir sampling** is a family of randomized algorithms for choosing a simple random sample, without replacement, of k items from a population of unknown size n in a single pass over the items. The size of the population n is not known to the algorithm and is typically too large for all n items to fit into main memory. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size k over the part of the population seen so far.

🗨️ QUOTE from: https://en.wikipedia.org/wiki/Reservoir_sampling

**Reservoir sampling**
>A simple random sampling strategy to produce a sample without replacement from a stream of data - that is, in one pass: O(N)

Want to sample instances - uniformly at random without replacement - from a population size of n records, where n is not known.

Figuring out n would require 2 passes. Reservoir sampling achieves this in 1 pass.
A reservoir R here is simply an array of size s. Let D be data stream of size n.

*Algorithm:*
* Store first s elements into R.
* for each element in position k = s+1 to n ,
    * accept it with probability s/k
    * if accepted, choose a random element from R to replace.

*Partial analysis:*

Base case is trivial. For the k+1st case, the probability a given element i with position <= k is in R is s/k. The prob. i is replaced is the probability k+1st element is chosen multiplied by i being chosen to be replaced, which is: s/(k+1) * 1/s = 1/(k+1), and prob that i is not replaced is k/k+1.

So any given element's probability of lasting after k+1 rounds is: (chosen in k steps, and not removed in k steps)

= s/k * k/(k+1), which is s/(k+1).

So, when k+1 = n, any element is present with probability s/n.

>**Distributing Reservoir Sampling**
It is very simple to distribute the reservoir sampling algorithm to n nodes.

Split the data stream into n partitions, one for each node. Apply reservoir sampling with reservoir size s, the final reservoir size, on each of the partitions. Finally, aggregate each reservoir into a final reservoir sample by carrying out reservoir sampling on them.

Lets say you split data of size n into 2 nodes, where each partition is of size n/2. Sub-reservoirs R1 and R2 are each of size s.

Probability that a record will be in sub-reservoir is:
s / (n/2) = 2s/n

The Probability that a record will end up in the final reservoir given it is in a sub-reservoir is: s/(2s) = 1/2.

It follows that the probability any given record will end up in the final reservoir is:
2s/n * 1/2 = s/n

🗨️ QUOTE from: https://docs.microsoft.com/en-us/archive/blogs/spt/reservoir-sampling

Reservoir sampling is super useful when there is an endless stream of data and your goal is to grab a small sample with uniform probability.
The math behind is straightforward. Given a sample of size K with N items processed so far, the chance for any item to be selected is K/N. When the next item comes in, current sample has a chance to survive K/N*N/(N+1)=K/(N+1) while the new item has chance K/(N+1) to be selected.

**Here there is the example code to implement reservoir-sampling:**

🗨️ https://medium.com/100-days-of-algorithms/day-33-reservoir-sampling-252062ce0baa


In [35]:
import random
def reservoir_sampling(iterator, k):
    result = []
    n = 0
    for item in iterator:
        # print(item)
        n = n + 1
        if len(result) < k:
            result.append(item)
        else:
            j = int(random.random() * n)
            if j < k:
                result[j] = item
    return result

In [37]:
Battle_Reservoir_Sampled = reservoir_sampling(Battle_df.iterrows(),10)
Battle_Reservoir_Sampled

[(31,
  attacker_king            Balon/Euron Greyjoy
  defender_king       Joffrey/Tommen Baratheon
  attacker_outcome                         win
  major_death                              0.0
  major_capture                            0.0
  attacker_size                            0.0
  defender_size                            0.0
  summer                                   0.0
  Name: 31, dtype: object),
 (35,
  attacker_king       Joffrey/Tommen Baratheon
  defender_king                     Robb Stark
  attacker_outcome                         win
  major_death                              0.0
  major_capture                            0.0
  attacker_size                         3000.0
  defender_size                            0.0
  summer                                   0.0
  Name: 35, dtype: object),
 (21,
  attacker_king                     Robb Stark
  defender_king       Joffrey/Tommen Baratheon
  attacker_outcome                        loss
  major_death                    