# Final Project
## Oscar Engelbrektson
## CS156, Prof. Sterne, Fall 2019

### 1. Problem Definition

For this final project, I set out to predict the outcome of UFC fights, i.e. which fighter is going to win. If it works well enough, I would like to be able to use it to inform betting strategy. The dataset I used was scraped from ufcstats.com, which has a page of summary statistics for every fight that has taken place since the inception of the UFC - 1997 to 2019.

![scrape_screenshot.png](attachment:scrape_screenshot.png)

Every row in the dataset represents one such fight, and each statistic presented in this screenshot was included.

### 2. Solution Specification


**Approach**

To solve this prediction problem, we want to make use of all relevant data avaiable for each fighter. We use a filtering, as opposed to a smoothing approach to model training. That is, we use all previous information, but never any more than that. We must not use any observations accrued *during or after* the fight we are trying to predict–no such information will be available in real prediction scenarios, such as UFC245. Concretely, we model p(y_t | y_1:y_t-1, x_1:x_t-1) where y is the outcome, x is a vector of explanatory variables and t = {1,2,..t-1,t,...T} denotes the timestep. 

Going from the scraped data to a dataset of this format required a lot of pre-processing. In general, my approach was to compute the cummulative statistics for each fighter. For example, the average striking accuracy across all
previous fights. 

**Data split**

Observations in this data set are not independendent, as later appearances of a fighter depend on his/her earlier fights. Furthermore, this dependence is not only between different observations of the same fighter. If I know that a fighter is facing the world champion next, that will tell me something about his chances of victory–this is explicitly accounted for in the Trueskill model, and may be implicitly captured by some of the models we fit in the next section. Consequently, it does not make sense to perform ordinary cross-validation to optimise model hyperparameters. Viewed from a different perspective, the objective is to train a model to predict the outcome of future fights given past fights. Consequently, it makes no sense to evaluate the model on its ability to predict fights given past *and future* fights. 

Rather, we perform a temporal train-test split of the data. This has the effect of ensuring that the data on which we evaluate model performance is in the future compared to the training data. We set aside the last 200 fights as a dedicated test set. The remaining data will constitute our trainset. To tune hyperparameters, we will hold out the 600 last fights from the train data, train on the remaining and then evaluate accuracy on the held-out set. We repeat this process 2 more times, with the 400 last held out, then the 200 last held out. The hyperparameter settings that maximise the average accuracy across the 3 held out sets will be chosen. This is likely an established method, but on the offchance that it is not, I would like to call it "K-Hold" cross validation.

The hope is that by varying the held out set on which performance is evaluated, albeit only partially, we can mitigate the some of the overfitting that would likely occur if we only had 1 held out set. This is the approach that most closely mirrors ordinary cross validation, without violating the temporal constraints of this problem.

Once the hyperparameters have been chosen, the model is then trained on all the training data and used to predict the test set outcomes. This result will be the final reported accuracy, used to compare and choose model.


**Evaluation metric**

We are looking to build a model that will predict the probability of a fighter winning a fight before it happens, so as to help us make bets that maximise our expected returns. The payoff of a bet is binary, if you win the bet you get your money back plus profit, but if you lose the bet you loose all the money. With respect the model, the payoff function depends only on whether a prediction was correct or incorrect–it does not distinguish between different types of error and neither should we. As such, prediction accuracy is the metric used to evaluate model performance.

### 3. Testing and Analysis

**Models and results**

The primary model I implemented was the Trueskill model. It was developed by Microsoft to model the skill of players in hopes to make good matches in online game like Halo by matching equally skilled opponents. It is assumed that every player has some time-varying skill, which is unobserved. To objective of the model is to infer the latent player skill from the observed performances–wins, losses and draws. In terms of modelling, skill s for player i at time t, N(s_ti|mu, sigma) represents our belief of the players true skill level. sigma, the uncertainty of our belief decreases as we observe more games of the player. Performance p of player i at time t, in turn is assumed to be a drawn from a normal distribution centered on s_ti with variance beta. The key modelling assumption here is that, whilst the  player skill is specific to each player, the variance of performance around that skill is specific to the game, i.e. the same for all players. I perform cross validation to to set this variance parameter, beta. This model is optimal for this type of match prediction, because it is explicitly made for inferring the skill of players.

The trueskill model only uses the win-loss-draw history of fighters. As I have access to quite detailed data above and beyond that, my idea was to use trueskill win probabilty together with other features as input in a classifier. On the rationale that access to fighter characteristics would complement the trueskill ratings and enable improved prediction accuracy. 

Unfortunately, I was unable to realize this objective. The best performing model I was able to produce was the trueskill model on its own, with test set accuracy of 58.2% (58.5% on training set). Whilst it is unclear what consitutes good prediction performance on this specific problem (due to lack of publically available work), I am confident that this number can be improved further. I firmly believe that there is predictive power contained in the that goes above and beyond that contained only in winning history, I have just not been able to unlock it. I have some more ideas for how to improve in the future (see next section).


My results are summarized below:

| Model | Hyperparameters | Training Accuracy | Test Accuracy | 
| --- | --- | --- | --- | 
| Trueskill | beta=2 | 0.585 | 0.582 | 
| SVM | kernel=RBF, C=1, gamma=0.0001 | 0.560 | 0.478 | 
| SVM, w. Trueskill as feature |kernel=RBF, C=1, gamma=1 | 0.558 | 0.502 | 

Note that performance was markedly worse on the test set as compared to the training set for both the SVMs. The trueskill model, however, maintained . This is expected, as there is no tuning of the parameters of the model per se, it is deterministic given the 

## Ideas for future improvement

Here are some ideas I came up with to late to try out, but would be very interesting to investigate:

1. It is not sufficient to model the offensive output of a fighter, we must also. Looking only at offensive output, a world-class striker like Israel Adesanya, may look mediocre. His power comes from his ability to hit and not get hit–the opponents lacking offensive output is not only indicative of his offensive skill, but also Adesanyas defensive skill. Conversely, other fighters rely on volume to overwhelm, exhaust and ultimately  defeat their opponents. We expect such fighters to get hit more often (making the opponent strike more often that normal), *regardless of the opponent*. Therefore, it is important that we model the defense of fighters also. Concretely, we could interpret every attack attempted but not landed by the opponent to have been successfully defended.

2. What if you ran TrueSkill on strikes: every strike attemt is considered a game between two players; the defender has a defensive skill and the attacker has an offensive skill. Same can be applied to takedowns etc. Naturally, the trueskill rating update rule would have to be modified, to account for the skill level being constant *within* each fight. For example, if there were 100 strikes thrown in a fight, we assume that each one is a performance drawn from the same unobserved striking skill. After having observed each of the 100 strikes, we would have to perform some expectation-maximisation procedure to update our beliefs about the fighters latent striking skill. However, in fighting specifically, there might be some problem with the independence assumption: if you get hit hard in the head, your defensive skill will likely temporarily decrease. Nevertheless, this seems doable and worth exploring. If it works it would have many benefits. Currently, **the largest short-coming of my approach, I believe, is that it does not account for the opposition against which the statistics where recorded.** For example, if you tell me that fighter A landed 4/5 takedowns in his fight against fighter B, my view of his wrestling skill, and ability to score takedowns in future fights, will be vastly different if fighter B is a world-class wrestler as compared to if he was notoriously awful at takedown defence. The trueskill model takes this into account when computing the skill. My idea then would be to compute the striking, wrestling etc offense and defence for every fighter, and use that along with the ordinary trueskill model trained only on the outcome of fights.

**Ramblings of a madman**

As a concluding note, this was a hard project for me. It has taken much longer than I foresaw. I have easily spent more time on this than the other final projects combined and I still don't feel I had enough time. I feel my theoretical approach to the problem is the right one, but I wish I had more time to try out more feature engineering and exploring other models. For example, I was thinking that computing a output/minute statistic and then pairing that with an accuracy statistic might be a promising approach.

Either way, I don't have much in ways of comparison to evaluate what constitutes good accuracy (although I suspect you can get it a lot higher than I was able to). There aren't many other publically available attempts at predicting UFC fights, and the ones that I have been able to find always do something silly (something that seems to me like it invalidated the results). For example, almost everyone I have seen attempt this problem  does a random train-test split–which does not make sense because it turns the prediction task into a smoothing problem, which is different from predicting things in the future given the past–and then gets something like 89% accuracy and calls it a day.

# References:

Data set: 
https://www.kaggle.com/rajeevw/ufcdata#raw_total_fight_data.csv

Trueskill model.
https://trueskill.org/


Bunkera, R. Thabtah, A. (2019). Machine learning framework for sport result prediction. Applied Computing and Informatics.

    Retrieved
    https://www.sciencedirect.com/science/article/pii/S2210832717301485



# 5. Appendices

Below is my jupyter notebook, in which I have done all the work. I have spent more time than I would like to admit trying massage (the approach taking in some of the below cells is more brutish) into the desired format.

## Column definitions:
1. R_ and B_ prefix signifies red and blue corner fighter stats respectively
1. _opp_ containing columns is the average of damage done by the opponent on the fighter
1. KD is number of knockdowns
1. SIG_STR is no. of significant strikes 'landed of attempted'
1. SIG_STR_pct is significant strikes percentage
1. TOTAL_STR is total strikes 'landed of attempted'
1. TD is no. of takedowns
1. TD_pct is takedown percentages
1. SUB_ATT is no. of submission attempts
1. PASS is no. times the guard was passed?
1. REV is the no. of Reversals landed
1. HEAD is no. of significant strinks to the head 'landed of attempted'
1. BODY is no. of significant strikes to the body 'landed of attempted'
1. CLINCH is no. of significant strikes in the clinch 'landed of attempted'
1. GROUND is no. of significant strikes on the ground 'landed of attempted'
1. win_by is method of win
1. last_round is last round of the fight (ex. if it was a KO in 1st, then this will be 1)
1. last_round_time is when the fight ended in the last round
1. Format is the format of the fight (3 rounds, 5 rounds etc.)
1. Referee is the name of the Ref
1. date is the date of the fight
1. location is the location in which the event took place
1. Fight_type is which weight class and whether it's a title bout or not
1. Winner is the winner of the fight
1. Stance is the stance of the fighter (orthodox, southpaw, etc.)
1. Height_cms is the height in centimeter
1. Reach_cms is the reach of the fighter (arm span) in centimeter
1. Weight_lbs is the weight of the fighter in pounds (lbs)
1. age is the age of the fighter
1. title_bout Boolean value of whether it is title fight or not
1. weight_class is which weight class the fight is in (Bantamweight, heavyweight, Women's flyweight, etc.)
1. no_of_rounds is the number of rounds the fight was scheduled for
1. current_lose_streak is the count of current concurrent losses of the fighter
1. current_win_streak is the count of current concurrent wins of the fighter
1. draw is the number of draws in the fighter's ufc career
1. wins is the number of wins in the fighter's ufc career
1. losses is the number of losses in the fighter's ufc career
1. total_rounds_fought is the average of total rounds fought by the fighter
1. total_time_fought(seconds) is the count of total time spent fighting in seconds
1. total_title_bouts is the total number of title bouts taken part in by the fighter
1. win_by_Decision_Majority is the number of wins by majority judges decision in the fighter's ufc career
1. win_by_Decision_Split is the number of wins by split judges decision in the fighter's ufc career
1. win_by_Decision_Unanimous is the number of wins by unanimous judges decision in the fighter's ufc career
1. win_by_KO/TKO is the number of wins by knockout in the fighter's ufc career
1. win_by_Submission is the number of wins by submission in the fighter's ufc career
1. win_by_TKO_Doctor_Stoppage is the number of wins by doctor stoppage in the fighter's ufc career

### Modelling the outcome of fights
"*If you know the enemy and know yourself, you need not fear the result of a hundred battles.*" - Sun Tzu

In [2]:
import pandas as pd
import numpy as np
import pandas_profiling as pp

In [300]:
fight_df = pd.read_csv("/Users/oscarengelbrektson/Downloads/ufcdata/raw_total_fight_data.csv", delimiter=";")

In [301]:
fight_df.loc[fight_df["R_fighter"]=="Israel Adesanya"]

Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,R_SIG_STR_pct,B_SIG_STR_pct,R_TOTAL_STR.,B_TOTAL_STR.,R_TD,B_TD,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,R_HEAD,B_HEAD,R_BODY,B_BODY,R_LEG,B_LEG,R_DISTANCE,B_DISTANCE,R_CLINCH,B_CLINCH,R_GROUND,B_GROUND,win_by,last_round,last_round_time,Format,Referee,date,location,Fight_type,Winner
186,Israel Adesanya,Anderson Silva,0,0,65 of 132,31 of 72,49%,43%,65 of 132,34 of 75,0 of 0,0 of 0,0%,0%,0,0,0,0,0,0,27 of 79,16 of 49,9 of 22,8 of 12,29 of 31,7 of 11,63 of 129,26 of 67,2 of 3,5 of 5,0 of 0,0 of 0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Herb Dean,"February 09, 2019","Melbourne, Victoria, Australia",Middleweight Bout,Israel Adesanya
573,Israel Adesanya,Marvin Vettori,0,0,57 of 123,46 of 154,46%,29%,67 of 134,60 of 169,0 of 1,2 of 6,0%,33%,0,0,0,1,0,0,31 of 92,20 of 122,11 of 12,7 of 8,15 of 19,19 of 24,57 of 123,44 of 152,0 of 0,2 of 2,0 of 0,0 of 0,Decision - Split,3,5:00,3 Rnd (5-5-5),Herb Dean,"April 14, 2018","Glendale, Arizona, USA",Middleweight Bout,Israel Adesanya


In [302]:
pp.ProfileReport(fight_df)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,41
Number of observations,5144
Total Missing (%),0.1%
Total size in memory,1.6 MiB
Average record size in memory,328.0 B

0,1
Numeric,9
Categorical,32
Boolean,0
Date,0
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,1334
Unique (%),25.9%
Missing (%),0.0%
Missing (n),0

0,1
Jim Miller,23
Donald Cerrone,22
Diego Sanchez,21
Other values (1331),5078

Value,Count,Frequency (%),Unnamed: 3
Jim Miller,23,0.4%,
Donald Cerrone,22,0.4%,
Diego Sanchez,21,0.4%,
Demian Maia,21,0.4%,
Michael Bisping,21,0.4%,
Matt Hughes,20,0.4%,
Anderson Silva,20,0.4%,
Frankie Edgar,19,0.4%,
Georges St-Pierre,19,0.4%,
Joe Lauzon,19,0.4%,

0,1
Distinct count,1774
Unique (%),34.5%
Missing (%),0.0%
Missing (n),0

0,1
Jeremy Stephens,19
Charles Oliveira,17
Nik Lentz,14
Other values (1771),5094

Value,Count,Frequency (%),Unnamed: 3
Jeremy Stephens,19,0.4%,
Charles Oliveira,17,0.3%,
Nik Lentz,14,0.3%,
Rafael Dos Anjos,13,0.3%,
Tim Boetsch,13,0.3%,
Rick Story,12,0.2%,
Chris Lytle,12,0.2%,
Kevin Lee,12,0.2%,
Gleison Tibau,12,0.2%,
Evan Dunham,12,0.2%,

0,1
Distinct count,6
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.25233
Minimum,0
Maximum,5
Zeros (%),78.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,5
Range,5
Interquartile range,0

0,1
Standard deviation,0.52332
Coef of variation,2.0739
Kurtosis,7.364
Mean,0.25233
MAD,0.39528
Skewness,2.3604
Sum,1298
Variance,0.27386
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4029,78.3%,
1,960,18.7%,
2,133,2.6%,
3,18,0.3%,
5,2,0.0%,
4,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,4029,78.3%,
1,960,18.7%,
2,133,2.6%,
3,18,0.3%,
4,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
1,960,18.7%,
2,133,2.6%,
3,18,0.3%,
4,2,0.0%,
5,2,0.0%,

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.1804
Minimum,0
Maximum,4
Zeros (%),84.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,4
Range,4
Interquartile range,0

0,1
Standard deviation,0.45965
Coef of variation,2.5479
Kurtosis,10.453
Mean,0.1804
MAD,0.30526
Skewness,2.9399
Sum,928
Variance,0.21127
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4352,84.6%,
1,680,13.2%,
2,91,1.8%,
3,18,0.3%,
4,3,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,4352,84.6%,
1,680,13.2%,
2,91,1.8%,
3,18,0.3%,
4,3,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,4352,84.6%,
1,680,13.2%,
2,91,1.8%,
3,18,0.3%,
4,3,0.1%,

0,1
Distinct count,3038
Unique (%),59.1%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,35
0 of 1,32
1 of 1,24
Other values (3035),5053

Value,Count,Frequency (%),Unnamed: 3
0 of 0,35,0.7%,
0 of 1,32,0.6%,
1 of 1,24,0.5%,
2 of 3,18,0.3%,
2 of 4,17,0.3%,
1 of 3,17,0.3%,
3 of 6,16,0.3%,
6 of 8,14,0.3%,
5 of 7,14,0.3%,
6 of 10,14,0.3%,

0,1
Distinct count,2903
Unique (%),56.4%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,52
0 of 1,43
1 of 2,42
Other values (2900),5007

Value,Count,Frequency (%),Unnamed: 3
0 of 0,52,1.0%,
0 of 1,43,0.8%,
1 of 2,42,0.8%,
1 of 3,34,0.7%,
0 of 2,33,0.6%,
1 of 1,29,0.6%,
0 of 3,28,0.5%,
1 of 4,26,0.5%,
2 of 4,24,0.5%,
3 of 6,22,0.4%,

0,1
Distinct count,95
Unique (%),1.8%
Missing (%),0.0%
Missing (n),0

0,1
50%,242
40%,155
42%,144
Other values (92),4603

Value,Count,Frequency (%),Unnamed: 3
50%,242,4.7%,
40%,155,3.0%,
42%,144,2.8%,
47%,137,2.7%,
33%,136,2.6%,
44%,135,2.6%,
52%,133,2.6%,
37%,130,2.5%,
41%,125,2.4%,
46%,122,2.4%,

0,1
Distinct count,90
Unique (%),1.7%
Missing (%),0.0%
Missing (n),0

0,1
50%,256
0%,199
33%,193
Other values (87),4496

Value,Count,Frequency (%),Unnamed: 3
50%,256,5.0%,
0%,199,3.9%,
33%,193,3.8%,
40%,168,3.3%,
37%,138,2.7%,
42%,133,2.6%,
44%,132,2.6%,
38%,131,2.5%,
36%,129,2.5%,
41%,125,2.4%,

0,1
Distinct count,3681
Unique (%),71.6%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,24
0 of 1,18
1 of 1,17
Other values (3678),5085

Value,Count,Frequency (%),Unnamed: 3
0 of 0,24,0.5%,
0 of 1,18,0.3%,
1 of 1,17,0.3%,
4 of 6,12,0.2%,
2 of 3,12,0.2%,
1 of 3,12,0.2%,
5 of 7,11,0.2%,
9 of 12,10,0.2%,
11 of 15,10,0.2%,
1 of 2,10,0.2%,

0,1
Distinct count,3479
Unique (%),67.6%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,34
0 of 1,32
1 of 2,27
Other values (3476),5051

Value,Count,Frequency (%),Unnamed: 3
0 of 0,34,0.7%,
0 of 1,32,0.6%,
1 of 2,27,0.5%,
1 of 3,24,0.5%,
0 of 2,21,0.4%,
1 of 1,18,0.3%,
1 of 4,18,0.3%,
0 of 3,17,0.3%,
2 of 2,16,0.3%,
3 of 6,15,0.3%,

0,1
Distinct count,157
Unique (%),3.1%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1595
1 of 1,547
0 of 1,425
Other values (154),2577

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1595,31.0%,
1 of 1,547,10.6%,
0 of 1,425,8.3%,
1 of 2,247,4.8%,
0 of 2,179,3.5%,
2 of 2,154,3.0%,
1 of 3,125,2.4%,
2 of 3,121,2.4%,
2 of 4,94,1.8%,
1 of 4,87,1.7%,

0,1
Distinct count,154
Unique (%),3.0%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1902
0 of 1,532
1 of 1,357
Other values (151),2353

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1902,37.0%,
0 of 1,532,10.3%,
1 of 1,357,6.9%,
0 of 2,239,4.6%,
1 of 2,219,4.3%,
1 of 3,139,2.7%,
0 of 3,136,2.6%,
2 of 2,95,1.8%,
1 of 4,88,1.7%,
2 of 3,79,1.5%,

0,1
Distinct count,69
Unique (%),1.3%
Missing (%),0.0%
Missing (n),0

0,1
0%,2456
100%,837
50%,426
Other values (66),1425

Value,Count,Frequency (%),Unnamed: 3
0%,2456,47.7%,
100%,837,16.3%,
50%,426,8.3%,
33%,189,3.7%,
66%,177,3.4%,
25%,113,2.2%,
40%,81,1.6%,
75%,81,1.6%,
20%,75,1.5%,
60%,64,1.2%,

0,1
Distinct count,62
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0

0,1
0%,3028
100%,535
50%,358
Other values (59),1223

Value,Count,Frequency (%),Unnamed: 3
0%,3028,58.9%,
100%,535,10.4%,
50%,358,7.0%,
33%,211,4.1%,
25%,127,2.5%,
66%,108,2.1%,
20%,78,1.5%,
16%,65,1.3%,
40%,65,1.3%,
75%,59,1.1%,

0,1
Distinct count,11
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.51089
Minimum,0
Maximum,10
Zeros (%),68.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,2
Maximum,10
Range,10
Interquartile range,1

0,1
Standard deviation,0.94814
Coef of variation,1.8559
Kurtosis,11.594
Mean,0.51089
MAD,0.69601
Skewness,2.7887
Sum,2628
Variance,0.89897
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3504,68.1%,
1,1033,20.1%,
2,389,7.6%,
3,129,2.5%,
4,50,1.0%,
5,21,0.4%,
6,8,0.2%,
7,6,0.1%,
8,2,0.0%,
10,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,3504,68.1%,
1,1033,20.1%,
2,389,7.6%,
3,129,2.5%,
4,50,1.0%,

Value,Count,Frequency (%),Unnamed: 3
6,8,0.2%,
7,6,0.1%,
8,2,0.0%,
9,1,0.0%,
10,1,0.0%,

0,1
Distinct count,8
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.361
Minimum,0
Maximum,7
Zeros (%),77.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,2
Maximum,7
Range,7
Interquartile range,0

0,1
Standard deviation,0.8098
Coef of variation,2.2432
Kurtosis,11.269
Mean,0.361
MAD,0.56017
Skewness,2.97
Sum,1857
Variance,0.65577
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3991,77.6%,
1,699,13.6%,
2,293,5.7%,
3,106,2.1%,
4,33,0.6%,
5,14,0.3%,
7,4,0.1%,
6,4,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,3991,77.6%,
1,699,13.6%,
2,293,5.7%,
3,106,2.1%,
4,33,0.6%,

Value,Count,Frequency (%),Unnamed: 3
3,106,2.1%,
4,33,0.6%,
5,14,0.3%,
6,4,0.1%,
7,4,0.1%,

0,1
Distinct count,21
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.4014
Minimum,0
Maximum,26
Zeros (%),52.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,2
95-th percentile,6
Maximum,26
Range,26
Interquartile range,2

0,1
Standard deviation,2.3016
Coef of variation,1.6423
Kurtosis,10.856
Mean,1.4014
MAD,1.6124
Skewness,2.7203
Sum,7209
Variance,5.2973
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2712,52.7%,
1,863,16.8%,
2,549,10.7%,
3,327,6.4%,
4,234,4.5%,
5,145,2.8%,
6,90,1.7%,
7,67,1.3%,
8,54,1.0%,
9,31,0.6%,

Value,Count,Frequency (%),Unnamed: 3
0,2712,52.7%,
1,863,16.8%,
2,549,10.7%,
3,327,6.4%,
4,234,4.5%,

Value,Count,Frequency (%),Unnamed: 3
16,2,0.0%,
17,1,0.0%,
19,2,0.0%,
20,1,0.0%,
26,1,0.0%,

0,1
Distinct count,15
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.83126
Minimum,0
Maximum,14
Zeros (%),66.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,4
Maximum,14
Range,14
Interquartile range,1

0,1
Standard deviation,1.6515
Coef of variation,1.9868
Kurtosis,11.632
Mean,0.83126
MAD,1.0972
Skewness,3.0114
Sum,4276
Variance,2.7275
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3395,66.0%,
1,786,15.3%,
2,381,7.4%,
3,213,4.1%,
4,140,2.7%,
5,81,1.6%,
6,58,1.1%,
7,34,0.7%,
8,18,0.3%,
9,16,0.3%,

Value,Count,Frequency (%),Unnamed: 3
0,3395,66.0%,
1,786,15.3%,
2,381,7.4%,
3,213,4.1%,
4,140,2.7%,

Value,Count,Frequency (%),Unnamed: 3
10,9,0.2%,
11,5,0.1%,
12,2,0.0%,
13,3,0.1%,
14,3,0.1%,

0,1
Distinct count,6
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.14172
Minimum,0
Maximum,5
Zeros (%),88.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,5
Range,5
Interquartile range,0

0,1
Standard deviation,0.42745
Coef of variation,3.0162
Kurtosis,18.432
Mean,0.14172
MAD,0.25016
Skewness,3.743
Sum,729
Variance,0.18271
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4540,88.3%,
1,503,9.8%,
2,84,1.6%,
3,11,0.2%,
4,5,0.1%,
5,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,4540,88.3%,
1,503,9.8%,
2,84,1.6%,
3,11,0.2%,
4,5,0.1%,

Value,Count,Frequency (%),Unnamed: 3
1,503,9.8%,
2,84,1.6%,
3,11,0.2%,
4,5,0.1%,
5,1,0.0%,

0,1
Distinct count,4
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.13414
Minimum,0
Maximum,3
Zeros (%),89.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,0
95-th percentile,1
Maximum,3
Range,3
Interquartile range,0

0,1
Standard deviation,0.4168
Coef of variation,3.1073
Kurtosis,14.025
Mean,0.13414
MAD,0.23902
Skewness,3.5462
Sum,690
Variance,0.17372
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,4583,89.1%,
1,451,8.8%,
2,91,1.8%,
3,19,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0,4583,89.1%,
1,451,8.8%,
2,91,1.8%,
3,19,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0,4583,89.1%,
1,451,8.8%,
2,91,1.8%,
3,19,0.4%,

0,1
Distinct count,2620
Unique (%),50.9%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,59
0 of 1,50
1 of 3,24
Other values (2617),5011

Value,Count,Frequency (%),Unnamed: 3
0 of 0,59,1.1%,
0 of 1,50,1.0%,
1 of 3,24,0.5%,
2 of 5,24,0.5%,
1 of 1,24,0.5%,
1 of 2,23,0.4%,
0 of 3,21,0.4%,
0 of 2,20,0.4%,
2 of 6,19,0.4%,
2 of 4,18,0.3%,

0,1
Distinct count,2475
Unique (%),48.1%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,104
0 of 1,93
0 of 2,63
Other values (2472),4884

Value,Count,Frequency (%),Unnamed: 3
0 of 0,104,2.0%,
0 of 1,93,1.8%,
0 of 2,63,1.2%,
0 of 3,42,0.8%,
1 of 2,37,0.7%,
1 of 3,34,0.7%,
1 of 4,34,0.7%,
1 of 5,32,0.6%,
0 of 4,31,0.6%,
0 of 5,28,0.5%,

0,1
Distinct count,541
Unique (%),10.5%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,635
1 of 1,332
2 of 2,215
Other values (538),3962

Value,Count,Frequency (%),Unnamed: 3
0 of 0,635,12.3%,
1 of 1,332,6.5%,
2 of 2,215,4.2%,
1 of 2,145,2.8%,
0 of 1,139,2.7%,
2 of 3,128,2.5%,
3 of 3,126,2.4%,
4 of 4,87,1.7%,
3 of 4,84,1.6%,
3 of 5,79,1.5%,

0,1
Distinct count,508
Unique (%),9.9%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,756
1 of 1,349
2 of 2,191
Other values (505),3848

Value,Count,Frequency (%),Unnamed: 3
0 of 0,756,14.7%,
1 of 1,349,6.8%,
2 of 2,191,3.7%,
0 of 1,173,3.4%,
1 of 2,153,3.0%,
2 of 3,126,2.4%,
3 of 4,115,2.2%,
3 of 3,114,2.2%,
4 of 5,96,1.9%,
4 of 4,69,1.3%,

0,1
Distinct count,393
Unique (%),7.6%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1117
1 of 1,502
2 of 2,342
Other values (390),3183

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1117,21.7%,
1 of 1,502,9.8%,
2 of 2,342,6.6%,
3 of 3,215,4.2%,
4 of 4,173,3.4%,
2 of 3,122,2.4%,
1 of 2,121,2.4%,
0 of 1,109,2.1%,
5 of 5,103,2.0%,
3 of 4,100,1.9%,

0,1
Distinct count,355
Unique (%),6.9%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1110
1 of 1,529
2 of 2,298
Other values (352),3207

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1110,21.6%,
1 of 1,529,10.3%,
2 of 2,298,5.8%,
3 of 3,189,3.7%,
0 of 1,157,3.1%,
4 of 4,146,2.8%,
1 of 2,138,2.7%,
2 of 3,131,2.5%,
3 of 4,105,2.0%,
4 of 5,87,1.7%,

0,1
Distinct count,2382
Unique (%),46.3%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,108
0 of 1,79
1 of 1,51
Other values (2379),4906

Value,Count,Frequency (%),Unnamed: 3
0 of 0,108,2.1%,
0 of 1,79,1.5%,
1 of 1,51,1.0%,
1 of 2,51,1.0%,
1 of 3,47,0.9%,
0 of 2,46,0.9%,
2 of 3,42,0.8%,
1 of 4,36,0.7%,
2 of 4,34,0.7%,
3 of 6,32,0.6%,

0,1
Distinct count,2391
Unique (%),46.5%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,94
0 of 1,86
0 of 2,61
Other values (2388),4903

Value,Count,Frequency (%),Unnamed: 3
0 of 0,94,1.8%,
0 of 1,86,1.7%,
0 of 2,61,1.2%,
1 of 2,59,1.1%,
1 of 1,57,1.1%,
1 of 3,57,1.1%,
0 of 3,53,1.0%,
1 of 4,41,0.8%,
2 of 4,37,0.7%,
0 of 4,27,0.5%,

0,1
Distinct count,472
Unique (%),9.2%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1167
1 of 1,329
1 of 2,180
Other values (469),3468

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1167,22.7%,
1 of 1,329,6.4%,
1 of 2,180,3.5%,
2 of 2,174,3.4%,
0 of 1,157,3.1%,
3 of 3,126,2.4%,
2 of 3,125,2.4%,
3 of 4,86,1.7%,
1 of 3,81,1.6%,
2 of 4,78,1.5%,

0,1
Distinct count,469
Unique (%),9.1%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1178
1 of 1,336
0 of 1,186
Other values (466),3444

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1178,22.9%,
1 of 1,336,6.5%,
0 of 1,186,3.6%,
2 of 2,184,3.6%,
1 of 2,152,3.0%,
2 of 3,135,2.6%,
1 of 3,116,2.3%,
3 of 4,108,2.1%,
3 of 3,99,1.9%,
2 of 4,80,1.6%,

0,1
Distinct count,641
Unique (%),12.5%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,1608
1 of 1,270
2 of 2,193
Other values (638),3073

Value,Count,Frequency (%),Unnamed: 3
0 of 0,1608,31.3%,
1 of 1,270,5.2%,
2 of 2,193,3.8%,
0 of 1,116,2.3%,
1 of 2,101,2.0%,
3 of 3,96,1.9%,
2 of 3,84,1.6%,
4 of 4,71,1.4%,
3 of 4,67,1.3%,
2 of 4,64,1.2%,

0,1
Distinct count,465
Unique (%),9.0%
Missing (%),0.0%
Missing (n),0

0,1
0 of 0,2252
1 of 1,321
2 of 2,182
Other values (462),2389

Value,Count,Frequency (%),Unnamed: 3
0 of 0,2252,43.8%,
1 of 1,321,6.2%,
2 of 2,182,3.5%,
0 of 1,149,2.9%,
1 of 2,138,2.7%,
3 of 3,94,1.8%,
2 of 3,91,1.8%,
1 of 3,73,1.4%,
2 of 4,60,1.2%,
4 of 4,52,1.0%,

0,1
Distinct count,10
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
Decision - Unanimous,1737
KO/TKO,1647
Submission,1083
Other values (7),677

Value,Count,Frequency (%),Unnamed: 3
Decision - Unanimous,1737,33.8%,
KO/TKO,1647,32.0%,
Submission,1083,21.1%,
Decision - Split,486,9.4%,
TKO - Doctor's Stoppage,70,1.4%,
Decision - Majority,56,1.1%,
Overturned,35,0.7%,
DQ,15,0.3%,
Could Not Continue,13,0.3%,
Other,2,0.0%,

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.2883
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,3
Q3,3
95-th percentile,3
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.0037
Coef of variation,0.43864
Kurtosis,-0.39115
Mean,2.2883
MAD,0.88236
Skewness,0.14972
Sum,11771
Variance,1.0075
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
3,2526,49.1%,
1,1568,30.5%,
2,865,16.8%,
5,155,3.0%,
4,30,0.6%,

Value,Count,Frequency (%),Unnamed: 3
1,1568,30.5%,
2,865,16.8%,
3,2526,49.1%,
4,30,0.6%,
5,155,3.0%,

Value,Count,Frequency (%),Unnamed: 3
1,1568,30.5%,
2,865,16.8%,
3,2526,49.1%,
4,30,0.6%,
5,155,3.0%,

0,1
Distinct count,333
Unique (%),6.5%
Missing (%),0.0%
Missing (n),0

0,1
5:00,2315
4:59,32
3:00,29
Other values (330),2768

Value,Count,Frequency (%),Unnamed: 3
5:00,2315,45.0%,
4:59,32,0.6%,
3:00,29,0.6%,
1:54,23,0.4%,
2:38,20,0.4%,
0:39,19,0.4%,
2:18,18,0.3%,
1:01,18,0.3%,
4:21,18,0.3%,
1:07,17,0.3%,

0,1
Distinct count,19
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
3 Rnd (5-5-5),4502
5 Rnd (5-5-5-5-5),423
1 Rnd + OT (12-3),79
Other values (16),140

Value,Count,Frequency (%),Unnamed: 3
3 Rnd (5-5-5),4502,87.5%,
5 Rnd (5-5-5-5-5),423,8.2%,
1 Rnd + OT (12-3),79,1.5%,
No Time Limit,37,0.7%,
3 Rnd + OT (5-5-5-5),22,0.4%,
1 Rnd (20),20,0.4%,
1 Rnd + 2OT (15-3-3),20,0.4%,
2 Rnd (5-5),11,0.2%,
1 Rnd (15),8,0.2%,
1 Rnd (10),6,0.1%,

0,1
Distinct count,191
Unique (%),3.7%
Missing (%),0.4%
Missing (n),23

0,1
Herb Dean,726
John McCarthy,634
Mario Yamasaki,391
Other values (187),3370

Value,Count,Frequency (%),Unnamed: 3
Herb Dean,726,14.1%,
John McCarthy,634,12.3%,
Mario Yamasaki,391,7.6%,
Dan Miragliotta,347,6.7%,
Marc Goddard,276,5.4%,
Yves Lavigne,259,5.0%,
Steve Mazzagatti,201,3.9%,
Leon Roberts,179,3.5%,
Keith Peterson,139,2.7%,
Josh Rosenthal,119,2.3%,

0,1
Distinct count,476
Unique (%),9.3%
Missing (%),0.0%
Missing (n),0

0,1
"November 19, 2016",25
"October 04, 2014",23
"May 31, 2014",22
Other values (473),5074

Value,Count,Frequency (%),Unnamed: 3
"November 19, 2016",25,0.5%,
"October 04, 2014",23,0.4%,
"May 31, 2014",22,0.4%,
"August 23, 2014",21,0.4%,
"June 28, 2014",21,0.4%,
"March 11, 1994",15,0.3%,
"September 22, 2018",14,0.3%,
"April 14, 2018",14,0.3%,
"December 02, 2017",13,0.3%,
"June 08, 2019",13,0.3%,

0,1
Distinct count,157
Unique (%),3.1%
Missing (%),0.0%
Missing (n),0

0,1
"Las Vegas, Nevada, USA",1216
"London, England, United Kingdom",114
"Montreal, Quebec, Canada",81
Other values (154),3733

Value,Count,Frequency (%),Unnamed: 3
"Las Vegas, Nevada, USA",1216,23.6%,
"London, England, United Kingdom",114,2.2%,
"Montreal, Quebec, Canada",81,1.6%,
"Chicago, Illinois, USA",81,1.6%,
"Atlantic City, New Jersey, USA",80,1.6%,
"Los Angeles, California, USA",79,1.5%,
"Newark, New Jersey, USA",78,1.5%,
"Denver, Colorado, USA",74,1.4%,
"Toronto, Ontario, Canada",74,1.4%,
"Anaheim, California, USA",72,1.4%,

0,1
Distinct count,112
Unique (%),2.2%
Missing (%),0.0%
Missing (n),0

0,1
Lightweight Bout,947
Welterweight Bout,915
Middleweight Bout,684
Other values (109),2598

Value,Count,Frequency (%),Unnamed: 3
Lightweight Bout,947,18.4%,
Welterweight Bout,915,17.8%,
Middleweight Bout,684,13.3%,
Heavyweight Bout,453,8.8%,
Light Heavyweight Bout,453,8.8%,
Featherweight Bout,423,8.2%,
Bantamweight Bout,360,7.0%,
Flyweight Bout,173,3.4%,
Women's Strawweight Bout,132,2.6%,
Women's Bantamweight Bout,98,1.9%,

0,1
Distinct count,1268
Unique (%),24.7%
Missing (%),1.6%
Missing (n),83

0,1
Donald Cerrone,23
Georges St-Pierre,20
Demian Maia,20
Other values (1264),4998
(Missing),83

Value,Count,Frequency (%),Unnamed: 3
Donald Cerrone,23,0.4%,
Georges St-Pierre,20,0.4%,
Demian Maia,20,0.4%,
Michael Bisping,20,0.4%,
Jim Miller,19,0.4%,
Jon Jones,18,0.3%,
Rafael Dos Anjos,18,0.3%,
Matt Hughes,18,0.3%,
Diego Sanchez,18,0.3%,
Frankie Edgar,17,0.3%,

Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,R_SIG_STR_pct,B_SIG_STR_pct,R_TOTAL_STR.,B_TOTAL_STR.,R_TD,B_TD,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,R_HEAD,B_HEAD,R_BODY,B_BODY,R_LEG,B_LEG,R_DISTANCE,B_DISTANCE,R_CLINCH,B_CLINCH,R_GROUND,B_GROUND,win_by,last_round,last_round_time,Format,Referee,date,location,Fight_type,Winner
0,Henry Cejudo,Marlon Moraes,0,0,90 of 171,57 of 119,52%,47%,99 of 182,59 of 121,1 of 4,0 of 2,25%,0%,1,0,1,0,0,0,73 of 150,35 of 89,13 of 16,7 of 8,4 of 5,15 of 22,45 of 118,54 of 116,19 of 23,2 of 2,26 of 30,1 of 1,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),Marc Goddard,"June 08, 2019","Chicago, Illinois, USA",UFC Bantamweight Title Bout,Henry Cejudo
1,Valentina Shevchenko,Jessica Eye,1,0,8 of 11,2 of 12,72%,16%,37 of 40,42 of 52,2 of 2,0 of 0,100%,0%,1,0,3,0,0,0,4 of 5,0 of 7,4 of 6,0 of 2,0 of 0,2 of 3,5 of 8,2 of 12,2 of 2,0 of 0,1 of 1,0 of 0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),Robert Madrigal,"June 08, 2019","Chicago, Illinois, USA",UFC Women's Flyweight Title Bout,Valentina Shevchenko
2,Tony Ferguson,Donald Cerrone,0,0,104 of 200,68 of 185,52%,36%,104 of 200,68 of 185,0 of 0,1 of 1,0%,100%,0,0,0,0,0,0,65 of 144,43 of 152,25 of 37,15 of 23,14 of 19,10 of 10,103 of 198,68 of 184,1 of 2,0 of 1,0 of 0,0 of 0,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),Dan Miragliotta,"June 08, 2019","Chicago, Illinois, USA",Lightweight Bout,Tony Ferguson
3,Jimmie Rivera,Petr Yan,0,2,73 of 192,56 of 189,38%,29%,76 of 195,58 of 192,0 of 3,1 of 3,0%,33%,0,0,0,1,0,0,42 of 145,40 of 166,15 of 24,13 of 19,16 of 23,3 of 4,60 of 173,42 of 167,9 of 15,10 of 12,4 of 4,4 of 10,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Kevin MacDonald,"June 08, 2019","Chicago, Illinois, USA",Bantamweight Bout,Petr Yan
4,Tai Tuivasa,Blagoy Ivanov,0,1,64 of 144,73 of 123,44%,59%,66 of 146,81 of 131,0 of 0,2 of 2,0%,100%,0,0,0,0,0,0,39 of 114,65 of 114,6 of 7,7 of 8,19 of 23,1 of 1,50 of 126,62 of 111,14 of 18,5 of 6,0 of 0,6 of 6,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Dan Miragliotta,"June 08, 2019","Chicago, Illinois, USA",Heavyweight Bout,Blagoy Ivanov


In [303]:
#Fighter DF
fighter_df = pd.read_csv("/Users/oscarengelbrektson/Downloads/ufcdata/raw_fighter_details.csv", delimiter=",", 
                        header=0)
fighter_df.head()

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB
0,AJ Fonseca,"5' 4""",145 lbs.,,,
1,AJ Matthews,"5' 11""",185 lbs.,,,
2,AJ McKee,"5' 10""",145 lbs.,,,
3,AJ Siscoe,"5' 7""",135 lbs.,,,
4,Aalon Cruz,"6' 0""",145 lbs.,,,


# Pre-processing

The objective is to build a model that is able to predict the outcomes of future fights. To get the data into a format where it can facilitate such a model, several pre-processing steps must be taken:

**Column transformations for fight_dataset**
  1. Separate every column of the format #landed of #attempted  into two columns, one with #landed and one with #attempted
  2. Remove Fight_type, create a column that details weight_class
  3. Convert percentages to decimal fractions
  
**Column transformations for fighter dataset**
  1. Express height in centimeters, not feet and inches.
  2. Express reach in centimeters
  3. Express weight in kilos
  4. Convert DOB to datetime
  
  
**Dealing with missing values**
  5. If reach is missing, impute reach = height. Vice versa if height is missing. If both are missing, impute median.

**Merging the data frames: Desired Structure of pre-processed dataframe**
1. Each row should represent a fight, including all relevant information needed available to predict that fight and the outcome.
2. As the goal is to predict future fights, we cannot include any information accruad *during* the fight. However, it is crucial that it contain all information on the fighter leading up to the fight. Consequently, we must modify the structure of the data such that the row representing the nth fight for a certain figher contains the cumulative statistics for that fighter across his n-1 historical fights. The same holds for the opponent.
3. We must join the dataframe with information on the fighters with the dataframe with fight statistics, such as to abide by 1. and 2.

**Feature engineering**
  1. Compute number of wins for each fighter, use it to compute TrueSkill Rating for each fighter.
  2. Use DOB to compute age. Drop DOB.
  3. Compute time since last fight.
  4. Winstreak?? Or captured by Trueskill

# Column Transformations for Fight Dataset

In [304]:
pd.set_option('display.max_columns', fight_df.shape[1])
fight_df.head()

Unnamed: 0,R_fighter,B_fighter,R_KD,B_KD,R_SIG_STR.,B_SIG_STR.,R_SIG_STR_pct,B_SIG_STR_pct,R_TOTAL_STR.,B_TOTAL_STR.,R_TD,B_TD,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,R_HEAD,B_HEAD,R_BODY,B_BODY,R_LEG,B_LEG,R_DISTANCE,B_DISTANCE,R_CLINCH,B_CLINCH,R_GROUND,B_GROUND,win_by,last_round,last_round_time,Format,Referee,date,location,Fight_type,Winner
0,Henry Cejudo,Marlon Moraes,0,0,90 of 171,57 of 119,52%,47%,99 of 182,59 of 121,1 of 4,0 of 2,25%,0%,1,0,1,0,0,0,73 of 150,35 of 89,13 of 16,7 of 8,4 of 5,15 of 22,45 of 118,54 of 116,19 of 23,2 of 2,26 of 30,1 of 1,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),Marc Goddard,"June 08, 2019","Chicago, Illinois, USA",UFC Bantamweight Title Bout,Henry Cejudo
1,Valentina Shevchenko,Jessica Eye,1,0,8 of 11,2 of 12,72%,16%,37 of 40,42 of 52,2 of 2,0 of 0,100%,0%,1,0,3,0,0,0,4 of 5,0 of 7,4 of 6,0 of 2,0 of 0,2 of 3,5 of 8,2 of 12,2 of 2,0 of 0,1 of 1,0 of 0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),Robert Madrigal,"June 08, 2019","Chicago, Illinois, USA",UFC Women's Flyweight Title Bout,Valentina Shevchenko
2,Tony Ferguson,Donald Cerrone,0,0,104 of 200,68 of 185,52%,36%,104 of 200,68 of 185,0 of 0,1 of 1,0%,100%,0,0,0,0,0,0,65 of 144,43 of 152,25 of 37,15 of 23,14 of 19,10 of 10,103 of 198,68 of 184,1 of 2,0 of 1,0 of 0,0 of 0,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),Dan Miragliotta,"June 08, 2019","Chicago, Illinois, USA",Lightweight Bout,Tony Ferguson
3,Jimmie Rivera,Petr Yan,0,2,73 of 192,56 of 189,38%,29%,76 of 195,58 of 192,0 of 3,1 of 3,0%,33%,0,0,0,1,0,0,42 of 145,40 of 166,15 of 24,13 of 19,16 of 23,3 of 4,60 of 173,42 of 167,9 of 15,10 of 12,4 of 4,4 of 10,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Kevin MacDonald,"June 08, 2019","Chicago, Illinois, USA",Bantamweight Bout,Petr Yan
4,Tai Tuivasa,Blagoy Ivanov,0,1,64 of 144,73 of 123,44%,59%,66 of 146,81 of 131,0 of 0,2 of 2,0%,100%,0,0,0,0,0,0,39 of 114,65 of 114,6 of 7,7 of 8,19 of 23,1 of 1,50 of 126,62 of 111,14 of 18,5 of 6,0 of 0,6 of 6,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Dan Miragliotta,"June 08, 2019","Chicago, Illinois, USA",Heavyweight Bout,Blagoy Ivanov


In [305]:
def landed(x):
    '''
    Takes x of format #landed of #attempted
    return #landed as numeric
    '''
    if type(x) is not str:
        return x
    
    landed, of, attempted = x.split()
    
    return int(landed)


def attempted(x):
    '''
    Takes x of format #landed of #attempted
    return #attempted as numeric
    '''
    if type(x) is not str:
        return x
    
    landed, of, attempted = x.split()
    
    return int(attempted)


attempted_landed_columns = ["R_KD", "B_KD","R_SIG_STR.","B_SIG_STR.", "R_TOTAL_STR.", "B_TOTAL_STR.", 
                            "R_TD", "B_TD", "R_HEAD", "B_HEAD", "R_BODY", "B_BODY", "R_LEG", "B_LEG",
                            "R_DISTANCE","B_DISTANCE", "R_CLINCH","B_CLINCH", "R_GROUND","B_GROUND"]

#Split into attempted and landed, then drop original column
for col_name in attempted_landed_columns:
    fight_df[col_name+"_LANDED"] = fight_df[col_name].apply(landed)
    fight_df[col_name+"_ATT"] = fight_df[col_name].apply(attempted)
    
fight_df = fight_df.drop(columns=attempted_landed_columns, axis=1)

In [306]:
def percent_to_decimal(x):
    '''
    Takes x of format x%
    returns x as decimal
    '''
    if type(x) is not str:
        return x
    
    percentage = x.replace("%", "")
    
    return float(percentage)/100

percent_columns = ["R_SIG_STR_pct", "B_SIG_STR_pct", "R_TD_pct", "B_TD_pct"]
for col_name in percent_columns:
    fight_df[col_name] = fight_df[col_name].apply(percent_to_decimal)

In [307]:
#Convert date to datetime
fight_df['date'] = pd.to_datetime(fight_df['date'])

In [308]:
#Convert "Fight_type" to weight_class. This will be used to join the Fighter and the fight dataset

weight_classes = ["Strawweight", "Flyweight", "Bantamweight", "Featherweight",
                       "Lightweight", "Welterweight", "Middleweight", "Light heavyweight",
                       "Heavyweight"]

womens_weight_classes = ["Women's Strawweight", "Women's Flyweight",
                       "Women's Bantamweight", "Women's Featherweight"]

def get_weight_class(x):
    '''
    Takes a str x with fight type
    returns weight class str
    '''
    if "Women's" in x:
        for weight_class in womens_weight_classes:
            if weight_class in x:
                return weight_class
    
   
    for weight_class in weight_classes:
        if weight_class in x:
            return weight_class
    
    #Catch weight is not a real weight class but when a special weight was agreed for
    #a specific fight, in the UFC it has typically happened when one fighter failed to make
    #weight and the other fighter agreed to go ahead with the fight anyways.
    #We impute the middle weight class, middle weight.
    if "Catch Weight" in x:
        return "Middle Weight"
    
    #Open weight used to exist in the beginnings of UFC, but no longer
    #Any weight was accepted, typically meaning big fighters, we return heavyweight instead
    if "Open Weight" in x:
        return "Heavyweight"
    
    #Some of the original UFC contests where tournaments with no weight limits
    else:
        return "Heavyweight"
    
            
fight_df["weight_class"] = fight_df["Fight_type"].apply(get_weight_class)

In [309]:
#Drop Fight_type now that we have extracted the relevant info
fight_df.drop(columns=["Fight_type"], axis=1, inplace=True)

In [310]:
fight_df.head()

Unnamed: 0,R_fighter,B_fighter,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,win_by,last_round,last_round_time,Format,Referee,date,location,Winner,...,R_BODY_ATT,B_BODY_LANDED,B_BODY_ATT,R_LEG_LANDED,R_LEG_ATT,B_LEG_LANDED,B_LEG_ATT,R_DISTANCE_LANDED,R_DISTANCE_ATT,B_DISTANCE_LANDED,B_DISTANCE_ATT,R_CLINCH_LANDED,R_CLINCH_ATT,B_CLINCH_LANDED,B_CLINCH_ATT,R_GROUND_LANDED,R_GROUND_ATT,B_GROUND_LANDED,B_GROUND_ATT,weight_class
0,Henry Cejudo,Marlon Moraes,0.52,0.47,0.25,0.0,1,0,1,0,0,0,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),Marc Goddard,2019-06-08,"Chicago, Illinois, USA",Henry Cejudo,...,16,7,8,4,5,15,22,45,118,54,116,19,23,2,2,26,30,1,1,Bantamweight
1,Valentina Shevchenko,Jessica Eye,0.72,0.16,1.0,0.0,1,0,3,0,0,0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),Robert Madrigal,2019-06-08,"Chicago, Illinois, USA",Valentina Shevchenko,...,6,0,2,0,0,2,3,5,8,2,12,2,2,0,0,1,1,0,0,Women's Flyweight
2,Tony Ferguson,Donald Cerrone,0.52,0.36,0.0,1.0,0,0,0,0,0,0,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Tony Ferguson,...,37,15,23,14,19,10,10,103,198,68,184,1,2,0,1,0,0,0,0,Lightweight
3,Jimmie Rivera,Petr Yan,0.38,0.29,0.0,0.33,0,0,0,1,0,0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Kevin MacDonald,2019-06-08,"Chicago, Illinois, USA",Petr Yan,...,24,13,19,16,23,3,4,60,173,42,167,9,15,10,12,4,4,4,10,Bantamweight
4,Tai Tuivasa,Blagoy Ivanov,0.44,0.59,0.0,1.0,0,0,0,0,0,0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),Dan Miragliotta,2019-06-08,"Chicago, Illinois, USA",Blagoy Ivanov,...,7,7,8,19,23,1,1,50,126,62,111,14,18,5,6,0,0,6,6,Heavyweight


## Dealing with missing values for fight_df

For some fights, there are no observations for certain variables. In total, there are 106 missing values.

In [311]:
print("Number of missing values:", np.sum(fight_df.isnull().sum()))

Number of missing values: 106


In [312]:
print("Columns with missing values:", list(fight_df.columns[list(set(np.where(fight_df.isnull())[1]))]))

Columns with missing values: ['Referee', 'Winner']


Making a judgement call from personal experience, the referee is not predictive of what fighter is going to win. I will just drop the referee column.

In [313]:
#Drop referee column
fight_df = fight_df.drop(columns=["Referee"], axis=1)

#For which fights is the winner missing?
fight_df[["R_fighter","B_fighter"]].loc[np.where(fight_df.isnull())[0]]

Unnamed: 0,R_fighter,B_fighter
228,Andrei Arlovski,Walt Harris
331,Matt Frevola,Lando Vannata
363,Randa Markos,Marina Rodriguez
702,Khalil Rountree Jr.,Michal Oleksiejczuk
704,Marvin Vettori,Omari Akhmedov
...,...,...
4944,Tim Lajcik,Ron Waterman
4950,Jens Pulver,Alfonso Alcarez
5003,Kazushi Sakuraba,Marcus Silveira
5082,Ken Shamrock,Oleg Taktarov


Some googling reveals that the thing these fights have in common is that they are all draws. I will have to create a numerical representation of the who wins, the outcome variable I will aim to predict/classify with my model, before I can do any model fitting. I will deal with the draws then. For now, I will just impute "Draw" where winner is missing.

In [314]:
fight_df["Winner"].fillna("Draw", inplace=True)

In [315]:
#Save to csv
fight_df.to_csv(r"/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/clean_fight_df.csv")

## Computing the cummulative stats of a fighter up to each fight

In predicting the outcome of future fights, we want to draw on all available information about the fighter up until that fight, a starting step for this process is computing the cumulative stats for the fighter. To compute this, for all count variables, I will sum all observations of a certain variable made at a date that is lower than the date of te current row.

I will compute the cumulative count up to but not including the current fight for all count variables, e.g. STRIKES_ATT and STRIKES_LANDED. The percentage variables, e.g. SIG_STR_pct, will then be computed as Landed / attempted for the current row.

In [239]:
fight_df = pd.read_csv("/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/clean_fight_df.csv", header=0)
cum_df = fight_df

In [240]:
def get_cum_stats(stat, fighter, date):
    '''
    Takes a count statistic (without color prefix), a fighter and a date
    returns cumulative count of stat for fighter up to but not including date
    '''
    stats_as_blue = np.sum(fight_df["B_"+stat].loc[((fight_df["B_fighter"]==fighter) & (fight_df["date"] < date))])
    stats_as_red = np.sum(fight_df["R_"+stat].loc[((fight_df["R_fighter"]==fighter) & (fight_df["date"] < date))])
    return stats_as_blue + stats_as_red

In [241]:
stat_list = ["SUB_ATT", "PASS", "REV", "KD_LANDED", "KD_ATT", "SIG_STR._LANDED", 
             "SIG_STR._LANDED", "TOTAL_STR._LANDED", "TOTAL_STR._ATT", "TD_ATT", 
             "TD_LANDED", "HEAD_LANDED", "HEAD_ATT", "BODY_LANDED", "BODY_ATT",
             "LEG_LANDED", "LEG_ATT", "DISTANCE_LANDED", "DISTANCE_ATT", 
             "CLINCH_LANDED", "CLINCH_ATT", "GROUND_LANDED", "GROUND_ATT"]


#Traverse from bottom to top, oldest to newest
for row in range(cum_df.shape[0]):
    
    #Date of current fight, used to find previous fights
    date = cum_df["date"].loc[row]
    
    #Compute cumulative counts of each count stat for each fighter
    r_fighter = cum_df["R_fighter"].loc[row]
    b_fighter = cum_df["B_fighter"].loc[row]
    for stat in stat_list:
        cum_df["R_"+stat].loc[row] = get_cum_stats(stat, r_fighter, date)
        cum_df["B_"+stat].loc[row] = get_cum_stats(stat, b_fighter, date)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [242]:
cum_df.to_csv(r"/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/partial_cum_df1.csv", header=True)
cum_df.head()

Unnamed: 0.1,Unnamed: 0,R_fighter,B_fighter,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,win_by,last_round,last_round_time,Format,date,location,Winner,R_KD_LANDED,R_KD_ATT,B_KD_LANDED,B_KD_ATT,R_SIG_STR._LANDED,R_SIG_STR._ATT,B_SIG_STR._LANDED,B_SIG_STR._ATT,R_TOTAL_STR._LANDED,R_TOTAL_STR._ATT,B_TOTAL_STR._LANDED,B_TOTAL_STR._ATT,R_TD_LANDED,R_TD_ATT,B_TD_LANDED,B_TD_ATT,R_HEAD_LANDED,R_HEAD_ATT,B_HEAD_LANDED,B_HEAD_ATT,R_BODY_LANDED,R_BODY_ATT,B_BODY_LANDED,B_BODY_ATT,R_LEG_LANDED,R_LEG_ATT,B_LEG_LANDED,B_LEG_ATT,R_DISTANCE_LANDED,R_DISTANCE_ATT,B_DISTANCE_LANDED,B_DISTANCE_ATT,R_CLINCH_LANDED,R_CLINCH_ATT,B_CLINCH_LANDED,B_CLINCH_ATT,R_GROUND_LANDED,R_GROUND_ATT,B_GROUND_LANDED,B_GROUND_ATT,weight_class
0,0,Henry Cejudo,Marlon Moraes,0.52,0.47,0.25,0.0,1,2,12,2,0,0,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Henry Cejudo,4,4,4,4,440,171,113,119,691,1299,118,332,19,53,1,4,239,742,56,243,164,219,30,46,37,53,27,38,265,750,103,313,110,170,0,1,65,94,10,13,Bantamweight
1,1,Valentina Shevchenko,Jessica Eye,0.72,0.16,1.0,0.0,3,7,12,8,1,0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Valentina Shevchenko,0,0,0,0,416,11,513,12,720,1131,696,1587,17,36,5,10,259,592,320,1120,54,84,91,146,103,135,102,123,253,617,421,1247,48,65,73,118,115,129,19,24,Women's Flyweight
2,2,Tony Ferguson,Donald Cerrone,0.52,0.36,0.0,1.0,15,11,5,29,2,3,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Tony Ferguson,3,3,20,20,832,200,1450,185,951,1995,1629,3215,6,14,21,67,534,1451,721,2097,130,208,351,476,168,206,378,434,742,1742,1196,2627,26,43,136,209,64,80,118,171,Lightweight
3,3,Jimmie Rivera,Petr Yan,0.38,0.29,0.0,0.33,0,1,1,2,0,1,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Petr Yan,3,3,2,2,366,192,281,189,406,939,347,619,5,18,5,10,192,644,215,465,82,146,56,68,92,104,10,12,328,839,195,438,33,47,44,55,5,8,42,52,Bantamweight
4,4,Tai Tuivasa,Blagoy Ivanov,0.44,0.59,0.0,1.0,0,0,1,0,0,0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Blagoy Ivanov,2,2,0,0,130,144,123,123,131,254,124,408,0,2,0,0,91,203,90,369,27,31,29,34,12,15,4,4,99,203,119,402,29,44,4,5,2,2,0,0,Heavyweight


### Add number of fights and wins

In [248]:
cum_df["R_wins"] = np.zeros(cum_df.shape[0])
cum_df["B_wins"] = np.zeros(cum_df.shape[0])
cum_df["R_fights"] = np.zeros(cum_df.shape[0])
cum_df["B_fights"] = np.zeros(cum_df.shape[0])

def get_wins(fighter, date):
    wins_as_red = len(np.where((fight_df["Winner"] == fighter) & (fight_df["date"]<date))[0])
    wins_as_blue = len(np.where((fight_df["Winner"] == fighter) & (fight_df["date"]<date))[0])
    return wins_as_red + wins_as_blue

def get_fights(fighter, date):
    fights_as_red = len(np.where((fight_df["R_fighter"] == fighter) & (fight_df["date"]<date))[0])
    fights_as_blue = len(np.where((fight_df["B_fighter"] == fighter) & (fight_df["date"]<date))[0])
    return fights_as_red + fights_as_blue

In [249]:
stat_list = ["wins", "fights"]

#Traverse from bottom to top, oldest to newest
for row in reversed(range(cum_df.shape[0])):
    
    #Date of current fight, used to find previous fights
    date = cum_df["date"].loc[row]
    
    #Compute cumulative counts of each count stat for each fighter
    r_fighter = cum_df["R_fighter"].loc[row]
    b_fighter = cum_df["B_fighter"].loc[row]
    
    #Get wins
    cum_df["R_wins"].loc[row] = get_wins(r_fighter, date)
    cum_df["B_wins"].loc[row] = get_wins(b_fighter, date)
    
    #Get fights
    cum_df["R_fights"].loc[row] = get_fights(r_fighter, date)
    cum_df["B_fights"].loc[row] = get_fights(b_fighter, date)

In [250]:
#Save to csv
cum_df.to_csv(r"/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/partial_cum_df2.csv", header=True)

In [359]:
cum_df = pd.read_csv("/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/partial_cum_df2.csv", header=0)
cum_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,R_fighter,B_fighter,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,win_by,last_round,last_round_time,Format,date,location,Winner,R_KD_LANDED,R_KD_ATT,B_KD_LANDED,B_KD_ATT,R_SIG_STR_LANDED,R_SIG_STR_ATT,B_SIG_STR_LANDED,B_SIG_STR_ATT,R_TOTAL_STR_LANDED,R_TOTAL_STR_ATT,B_TOTAL_STR_LANDED,B_TOTAL_STR_ATT,R_TD_LANDED,R_TD_ATT,B_TD_LANDED,B_TD_ATT,R_HEAD_LANDED,R_HEAD_ATT,B_HEAD_LANDED,B_HEAD_ATT,R_BODY_LANDED,R_BODY_ATT,B_BODY_LANDED,B_BODY_ATT,R_LEG_LANDED,R_LEG_ATT,B_LEG_LANDED,B_LEG_ATT,R_DISTANCE_LANDED,R_DISTANCE_ATT,B_DISTANCE_LANDED,B_DISTANCE_ATT,R_CLINCH_LANDED,R_CLINCH_ATT,B_CLINCH_LANDED,B_CLINCH_ATT,R_GROUND_LANDED,R_GROUND_ATT,B_GROUND_LANDED,B_GROUND_ATT,weight_class,R_wins,B_wins,R_fights,B_fights,R_wins_pct,B_wins_pct,R_TOTAL_STR_pct,B_TOTAL_STR_pct,R_HEAD_pct,B_HEAD_pct,R_BODY_pct,B_BODY_pct,R_LEG_pct,B_LEG_pct,R_DISTANCE_pct,B_DISTANCE_pct,R_CLINCH_pct,B_CLINCH_pct,R_GROUND_pct,B_GROUND_pct
0,0,0,0,0,Henry Cejudo,Marlon Moraes,2.573099,0.94958,0.358491,0.25,1,2,12,2,0,0,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Henry Cejudo,4,4,4,4,440,171,113,119,691,1299,118,332,19,53,1,4,239,742,56,243,164,219,30,46,37,53,27,38,265,750,103,313,110,170,0,1,65,94,10,13,Bantamweight,16.0,8.0,10.0,5.0,1.6,1.6,0.531948,0.355422,0.322102,0.230453,0.748858,0.652174,0.698113,0.710526,0.353333,0.329073,0.647059,0.0,0.691489,0.769231
1,1,1,1,1,Valentina Shevchenko,Jessica Eye,37.818182,42.75,0.472222,0.5,3,7,12,8,1,0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Valentina Shevchenko,0,0,0,0,416,11,513,12,720,1131,696,1587,17,36,5,10,259,592,320,1120,54,84,91,146,103,135,102,123,253,617,421,1247,48,65,73,118,115,129,19,24,Women's Flyweight,10.0,8.0,7.0,10.0,1.428571,0.8,0.636605,0.438563,0.4375,0.285714,0.642857,0.623288,0.762963,0.829268,0.410049,0.33761,0.738462,0.618644,0.891473,0.791667
2,2,2,2,2,Tony Ferguson,Donald Cerrone,4.16,14.183784,0.428571,0.313433,15,22,5,58,2,6,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Tony Ferguson,3,3,36,36,832,200,2624,185,951,1995,2978,5704,6,14,42,134,534,1451,1268,3590,130,208,668,902,168,206,688,800,742,1742,2134,4566,26,43,266,412,64,80,224,314,Lightweight,28.0,46.0,15.0,31.0,1.866667,1.483871,0.476692,0.52209,0.368022,0.353203,0.625,0.740576,0.815534,0.86,0.425947,0.467367,0.604651,0.645631,0.8,0.713376
3,3,3,3,3,Jimmie Rivera,Petr Yan,1.90625,1.486772,0.277778,0.5,0,1,1,2,0,1,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Petr Yan,3,3,2,2,366,192,281,189,406,939,347,619,5,18,5,10,192,644,215,465,82,146,56,68,92,104,10,12,328,839,195,438,33,47,44,55,5,8,42,52,Bantamweight,12.0,8.0,8.0,4.0,1.5,2.0,0.432375,0.560582,0.298137,0.462366,0.561644,0.823529,0.884615,0.833333,0.390942,0.445205,0.702128,0.8,0.625,0.807692
4,4,4,4,4,Tai Tuivasa,Blagoy Ivanov,0.902778,1.0,0.0,0.0,0,0,1,0,0,0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Blagoy Ivanov,2,2,0,0,130,144,123,123,131,254,124,408,0,2,0,0,91,203,90,369,27,31,29,34,12,15,4,4,99,203,119,402,29,44,4,5,2,2,0,0,Heavyweight,6.0,2.0,4.0,2.0,1.5,1.0,0.515748,0.303922,0.448276,0.243902,0.870968,0.852941,0.8,1.0,0.487685,0.29602,0.659091,0.8,1.0,0.0


### Compute Avg for each of the LANDED of ATT stats
Same wins /fights to get win rate

In [360]:
pd.set_option('display.max_columns', cum_df.shape[1])

#Remove the dot in the following column names
new_colnames={"R_SIG_STR._LANDED":"R_SIG_STR_LANDED",
              "R_SIG_STR._ATT":"R_SIG_STR_ATT",
              "B_SIG_STR._LANDED":"B_SIG_STR_LANDED",
              "B_SIG_STR._ATT":"B_SIG_STR_ATT",
              "R_TOTAL_STR._LANDED":"R_TOTAL_STR_LANDED",
              "R_TOTAL_STR._ATT": "R_TOTAL_STR_ATT",
              "B_TOTAL_STR._LANDED": "B_TOTAL_STR_LANDED",
              "B_TOTAL_STR._ATT": "B_TOTAL_STR_ATT"}

cum_df.rename(columns=new_colnames, inplace=True)

#Initialize the new columns
cum_df["R_wins_pct"] = np.zeros(cum_df.shape[0])
cum_df["B_wins_pct"] = np.zeros(cum_df.shape[0])

pct_list = ["SIG_STR",  "TOTAL_STR", "TOTAL_STR", "TD", 
            "HEAD", "BODY", "LEG",  "DISTANCE", "CLINCH", "GROUND"]

for stat in pct_list:
    if stat+"_pct" in cum_df.columns:
        continue
    else:
        cum_df["R_"+stat+"_pct"] = np.zeros(cum_df.shape[0])
        cum_df["B_"+stat+"_pct"] = np.zeros(cum_df.shape[0])

In [361]:
def get_pct(stat, prefix, row):
    '''
    Takes row with cumulative data, a statistic whose pct to compute, and prefix
    Returns fraction of successes
    '''
    attempted = cum_df[prefix+stat+"_ATT"].loc[row]
    landed = cum_df[prefix+stat+"_LANDED"].loc[row]
    
    if attempted == 0:
        return 0
    else:
        return landed / attempted

#For every row and pct statistic
for row in range(cum_df.shape[0]):
    for stat in pct_list:
        #Compute pct for R_fighter
        cum_df["R_"+stat+"_pct"].loc[row] = get_pct(stat, "R_", row)
        #Compute pct for B_fighter
        cum_df["B_"+stat+"_pct"].loc[row] = get_pct(stat, "B_", row)
    
    #Compute win rate for R_fighter and B_fighter
    cum_df["R_wins_pct"].loc[row] = cum_df["R_wins"].loc[row] / cum_df["R_fights"].loc[row]
    cum_df["B_wins_pct"].loc[row] = cum_df["B_wins"].loc[row] / cum_df["B_fights"].loc[row]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [362]:
cum_df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,Unnamed: 0.1.1,Unnamed: 0.1.1.1,R_fighter,B_fighter,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,win_by,last_round,last_round_time,Format,date,location,Winner,R_KD_LANDED,R_KD_ATT,B_KD_LANDED,B_KD_ATT,R_SIG_STR_LANDED,R_SIG_STR_ATT,B_SIG_STR_LANDED,B_SIG_STR_ATT,R_TOTAL_STR_LANDED,R_TOTAL_STR_ATT,B_TOTAL_STR_LANDED,B_TOTAL_STR_ATT,R_TD_LANDED,R_TD_ATT,B_TD_LANDED,B_TD_ATT,R_HEAD_LANDED,R_HEAD_ATT,B_HEAD_LANDED,B_HEAD_ATT,R_BODY_LANDED,R_BODY_ATT,B_BODY_LANDED,B_BODY_ATT,R_LEG_LANDED,R_LEG_ATT,B_LEG_LANDED,B_LEG_ATT,R_DISTANCE_LANDED,R_DISTANCE_ATT,B_DISTANCE_LANDED,B_DISTANCE_ATT,R_CLINCH_LANDED,R_CLINCH_ATT,B_CLINCH_LANDED,B_CLINCH_ATT,R_GROUND_LANDED,R_GROUND_ATT,B_GROUND_LANDED,B_GROUND_ATT,weight_class,R_wins,B_wins,R_fights,B_fights,R_wins_pct,B_wins_pct,R_TOTAL_STR_pct,B_TOTAL_STR_pct,R_HEAD_pct,B_HEAD_pct,R_BODY_pct,B_BODY_pct,R_LEG_pct,B_LEG_pct,R_DISTANCE_pct,B_DISTANCE_pct,R_CLINCH_pct,B_CLINCH_pct,R_GROUND_pct,B_GROUND_pct
0,0,0,0,0,Henry Cejudo,Marlon Moraes,2.573099,0.94958,0.358491,0.25,1,2,12,2,0,0,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Henry Cejudo,4,4,4,4,440,171,113,119,691,1299,118,332,19,53,1,4,239,742,56,243,164,219,30,46,37,53,27,38,265,750,103,313,110,170,0,1,65,94,10,13,Bantamweight,16.0,8.0,10.0,5.0,1.6,1.6,0.531948,0.355422,0.322102,0.230453,0.748858,0.652174,0.698113,0.710526,0.353333,0.329073,0.647059,0.0,0.691489,0.769231
1,1,1,1,1,Valentina Shevchenko,Jessica Eye,37.818182,42.75,0.472222,0.5,3,7,12,8,1,0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Valentina Shevchenko,0,0,0,0,416,11,513,12,720,1131,696,1587,17,36,5,10,259,592,320,1120,54,84,91,146,103,135,102,123,253,617,421,1247,48,65,73,118,115,129,19,24,Women's Flyweight,10.0,8.0,7.0,10.0,1.428571,0.8,0.636605,0.438563,0.4375,0.285714,0.642857,0.623288,0.762963,0.829268,0.410049,0.33761,0.738462,0.618644,0.891473,0.791667
2,2,2,2,2,Tony Ferguson,Donald Cerrone,4.16,14.183784,0.428571,0.313433,15,22,5,58,2,6,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Tony Ferguson,3,3,36,36,832,200,2624,185,951,1995,2978,5704,6,14,42,134,534,1451,1268,3590,130,208,668,902,168,206,688,800,742,1742,2134,4566,26,43,266,412,64,80,224,314,Lightweight,28.0,46.0,15.0,31.0,1.866667,1.483871,0.476692,0.52209,0.368022,0.353203,0.625,0.740576,0.815534,0.86,0.425947,0.467367,0.604651,0.645631,0.8,0.713376
3,3,3,3,3,Jimmie Rivera,Petr Yan,1.90625,1.486772,0.277778,0.5,0,1,1,2,0,1,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Petr Yan,3,3,2,2,366,192,281,189,406,939,347,619,5,18,5,10,192,644,215,465,82,146,56,68,92,104,10,12,328,839,195,438,33,47,44,55,5,8,42,52,Bantamweight,12.0,8.0,8.0,4.0,1.5,2.0,0.432375,0.560582,0.298137,0.462366,0.561644,0.823529,0.884615,0.833333,0.390942,0.445205,0.702128,0.8,0.625,0.807692
4,4,4,4,4,Tai Tuivasa,Blagoy Ivanov,0.902778,1.0,0.0,0.0,0,0,1,0,0,0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Blagoy Ivanov,2,2,0,0,130,144,123,123,131,254,124,408,0,2,0,0,91,203,90,369,27,31,29,34,12,15,4,4,99,203,119,402,29,44,4,5,2,2,0,0,Heavyweight,6.0,2.0,4.0,2.0,1.5,1.0,0.515748,0.303922,0.448276,0.243902,0.870968,0.852941,0.8,1.0,0.487685,0.29602,0.659091,0.8,1.0,0.0


# Column Transformations for Fighter Dataset

In [365]:
def height_to_cm(x):
    '''
    Takes string x of format feet' inches"
    Returns length as numeric in cm
    '''
    if type(x) is not str:
        return x
    
    feet, inches = x.split("' ")
    inches = inches.replace("\"", "")
    
    return 30.48*int(feet) + 2.54*float(inches)

def reach_to_cm(x):
    '''
    Takes string x of format inches"
    Returns reach as numeric in cm
    '''
    if type(x) is not str:
        return x

    inches = x.replace("\"", "")
    
    return 2.54*float(inches)

def weight_to_kg(x):
    '''
    Takes string x of format x lbs
    Returns weight as numeric in kg
    '''
    if type(x) is not str:
        return x
    
    lbs, _ = x.split()
    
    return 0.453592*float(lbs)

fighter_df["Height"] = fighter_df["Height"].apply(height_to_cm)
fighter_df["Reach"] = fighter_df["Reach"].apply(reach_to_cm)
fighter_df["Weight"] = fighter_df["Weight"].apply(weight_to_kg)

#Convert DOB to datetime
fighter_df['DOB'] = pd.to_datetime(fighter_df['DOB'])

### Dealing with missing values in fighter_df

There are missing values in each of the columns, except fighter_name. As these all denote physical characteristics of fighters, which tend to be quite similar controlling for weight class, we just impute the median (less sensitive to outliers than mean) value for the corresponding weight class. This procedure obviously fails when it is the weight class itself that is missing. To circumvent this issue, we cross reference the name with the fight_df and get the weight class from there and then impute the upper limit for that weight class (UFC fighters typically weigh in right on weight or just below).

In [366]:
weight_class_dict =  {"Strawweight": 52, 
                      "Flyweight": 57, 
                      "Bantamweight":61, 
                      "Featherweight": 66,
                      "Lightweight": 70, 
                      "Welterweight": 77, 
                      "Middleweight": 84, 
                      "Light heavyweight": 93,
                      "Heavyweight": 120, 
                      "Women's Strawweight": 52, 
                      "Women's Flyweight": 57,
                      "Women's Bantamweight": 61, 
                      "Women's Featherweight": 66}

#Get missing weights from 
for na_idx in np.where(fighter_df["Weight"].isnull()):
    name=str(fighter_df["fighter_name"].loc[na_idx])

    #If fighter exists in fight df, get the weightclass from there        
    if name in list(cum_df["R_fighter"]):
        find = np.where((cum_df["R_fighter"]==name) | (cum_df["B_fighter"]==name))
        weight_class = cum_df["Weight_class"].loc[find[0][0]]
        
    elif name in list(cum_df["B_fighter"]):
        find = np.where(cum_df["B_fighter"]==name)
        weight_class = cum_df["Weight_class"].loc[find[0][0]]
        
    #Else, assume the fighter is Lightweight (median weight class)
    else:
        weight_class = "Lightweight"
    
    fighter_df["Weight"].loc[na_idx] = weight_class_dict[weight_class]

In [367]:
#Imputing weight class median for missing values
columns = ["Height", "Reach", "Stance", "DOB"]

for col in columns:
    if col == "Stance":
        median = "Orthodox"
    
    elif col == "DOB":
        median = fighter_df[col].quantile(.5)
        
    else:
        median = fighter_df[col].median()
        
    for na_idx in np.where(fighter_df[col].isnull()):
        fighter_df[col].loc[na_idx] = median

In [368]:
fighter_df.tail()

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB
3309,Zhang Lipeng,180.34,70.30676,180.34,Southpaw,1990-03-10
3310,Zoila Frausto,162.56,61.23492,182.88,Orthodox,1984-01-15
3311,Zu Anyanwu,185.42,113.851592,195.58,Orthodox,1981-08-05
3312,Zubaira Tukhugov,172.72,65.77084,172.72,Orthodox,1991-01-15
3313,Kennedy Nzechukwu,196.0,93.0,211.0,Orthodox,1992-06-13


In [369]:
"Kennedy Nzechukwu" in str(fighter_df["fighter_name"])

True

There is one fighter in fight_df which does not exist in fighter_df - "Kennedy Nzechukwu". Becuase it is only one fighter missing, I looked him up manually and added him to fighter_df. Source: https://www.ufc.com/athlete/kennedy-nzechukwu

In [370]:
#Add "Kennedy Nzechukwu" as last entry in fighter_df
fighter_df.loc[fighter_df.shape[0]] = ["Kennedy Nzechukwu"]+[196]+[93]+[211]+["Orthodox"]+[pd.to_datetime("1992-06-13")]

In [371]:
fighter_df.tail()

Unnamed: 0,fighter_name,Height,Weight,Reach,Stance,DOB
3310,Zoila Frausto,162.56,61.23492,182.88,Orthodox,1984-01-15
3311,Zu Anyanwu,185.42,113.851592,195.58,Orthodox,1981-08-05
3312,Zubaira Tukhugov,172.72,65.77084,172.72,Orthodox,1991-01-15
3313,Kennedy Nzechukwu,196.0,93.0,211.0,Orthodox,1992-06-13
3314,Kennedy Nzechukwu,196.0,93.0,211.0,Orthodox,1992-06-13


In [378]:
type(pd.to_datetime(cum_df["R_DOB"][0]))

pandas._libs.tslibs.timestamps.Timestamp

# Join fight and Fighter Datasets

We know want to make the informations from both fighter_df and fight_df accessible in the same dataframe, i.e. for every fight we want to know each fighters physical traits. 

Because of the unique layout of the fight_df, a simple join will not cut it. A specific fighter may have been blue side in one fight and red-side in another. Consequently, we will add columns to the fight_df and then populate them on a row-by-row basis with data from the fighter_df.

In [373]:
#Create empty columns for Height, Weight, Reach, Stance, DOB
for col in fighter_df.columns[1:]:
    for prefix in ["R_", "B_"]:
        cum_df[prefix+col] = np.full(cum_df.shape[0], np.nan)
    
#For every fighter in each fight
for row in range(cum_df.shape[0]):
    for prefix in ["R_", "B_"]:
        #Find position of fighter in fighter_df
        idx = np.where(fighter_df["fighter_name"]==cum_df["R_fighter"].loc[row])
        
        #If fighter not found in fighter_df, report and continue
        if len(idx[0]) == 0:
            print(cum_df["R_fighter"].loc[row], "not in fighter_df")
            continue
            
        #Transcribe stats from fighter_df to cum_df
        for col in fighter_df.columns[1:]:
            cum_df[prefix+col].loc[row] = fighter_df[col].loc[idx[0][0]]

#### Computing Age from DOB

We have no reason to believe there to be an intrinsic predictive value in the DOB, some kind of seasonality effect or the like. We do however, have reason to believe that the age of the fighter at the time of the fight may hold value in predicting the outcome of a fight. Consequently, we use DOB and date to compute the age of each fighter at the time of the fight, and then discard DOB.

In [383]:
from dateutil import relativedelta

def year_diff(start, end):
    '''
    Takes a datetime start and datetime end, 
    returns years between start and end
    '''
    return relativedelta.relativedelta(pd.to_datetime(end), pd.to_datetime(start)).years


for prefix in ["R_", "B_"]:
    cum_df[prefix+"Age"] = np.full(cum_df.shape[0], np.nan)

for row in range(cum_df.shape[0]):
    for prefix in ["R_", "B_"]:
        cum_df[prefix+"Age"].loc[row] = year_diff(cum_df[prefix+"DOB"].loc[row], cum_df["date"].loc[row])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [384]:
cum_df.drop(columns=["R_DOB", "B_DOB"], axis=1, inplace=True)

In [387]:
#Save DF
cum_df.drop(columns=["Unnamed: 0", "Unnamed: 0.1", "Unnamed: 0.1.1", "Unnamed: 0.1.1.1"], axis=1, inplace=True)
cum_df.to_csv(r"/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/pre-processed_df.csv", index=False)

In [3]:
cum_df = pd.read_csv("/Users/oscarengelbrektson/Documents/Minerva/CS156 - ML/pre-processed_df.csv")

In [3]:
pp.ProfileReport(cum_df)

  variable_stats = pd.concat(ldesc, join_axes=pd.Index([names]), axis=1)


0,1
Number of variables,93
Number of observations,5144
Total Missing (%),0.4%
Total size in memory,3.6 MiB
Average record size in memory,744.0 B

0,1
Numeric,47
Categorical,13
Boolean,0
Date,0
Text (Unique),0
Rejected,33
Unsupported,0

0,1
Distinct count,5144
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2571.5
Minimum,0
Maximum,5143
Zeros (%),0.0%

0,1
Minimum,0.0
5-th percentile,257.15
Q1,1285.8
Median,2571.5
Q3,3857.2
95-th percentile,4885.8
Maximum,5143.0
Range,5143.0
Interquartile range,2571.5

0,1
Standard deviation,1485.1
Coef of variation,0.57752
Kurtosis,-1.2
Mean,2571.5
MAD,1286
Skewness,0
Sum,13227796
Variance,2205500
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
2047,1,0.0%,
641,1,0.0%,
2612,1,0.0%,
565,1,0.0%,
4663,1,0.0%,
2616,1,0.0%,
569,1,0.0%,
4667,1,0.0%,
2620,1,0.0%,
573,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0,1,0.0%,
1,1,0.0%,
2,1,0.0%,
3,1,0.0%,
4,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
5139,1,0.0%,
5140,1,0.0%,
5141,1,0.0%,
5142,1,0.0%,
5143,1,0.0%,

0,1
Distinct count,1334
Unique (%),25.9%
Missing (%),0.0%
Missing (n),0

0,1
Jim Miller,23
Donald Cerrone,22
Michael Bisping,21
Other values (1331),5078

Value,Count,Frequency (%),Unnamed: 3
Jim Miller,23,0.4%,
Donald Cerrone,22,0.4%,
Michael Bisping,21,0.4%,
Diego Sanchez,21,0.4%,
Demian Maia,21,0.4%,
Matt Hughes,20,0.4%,
Anderson Silva,20,0.4%,
Andrei Arlovski,19,0.4%,
Frankie Edgar,19,0.4%,
BJ Penn,19,0.4%,

0,1
Distinct count,1774
Unique (%),34.5%
Missing (%),0.0%
Missing (n),0

0,1
Jeremy Stephens,19
Charles Oliveira,17
Nik Lentz,14
Other values (1771),5094

Value,Count,Frequency (%),Unnamed: 3
Jeremy Stephens,19,0.4%,
Charles Oliveira,17,0.3%,
Nik Lentz,14,0.3%,
Tim Boetsch,13,0.3%,
Rafael Dos Anjos,13,0.3%,
Rick Story,12,0.2%,
Chris Lytle,12,0.2%,
Evan Dunham,12,0.2%,
Gleison Tibau,12,0.2%,
Kevin Lee,12,0.2%,

0,1
Distinct count,3744
Unique (%),72.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,6.2953
Minimum,0
Maximum,642
Zeros (%),14.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.44444
Median,1.7903
Q3,4.7233
95-th percentile,23.106
Maximum,642.0
Range,642.0
Interquartile range,4.2789

0,1
Standard deviation,24.053
Coef of variation,3.8207
Kurtosis,333.33
Mean,6.2953
MAD,7.4497
Skewness,15.844
Sum,32383
Variance,578.53
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,718,14.0%,
1.0,16,0.3%,
4.0,15,0.3%,
0.3333333333333333,14,0.3%,
0.5,13,0.3%,
2.0,11,0.2%,
5.0,10,0.2%,
0.25,10,0.2%,
2.5,9,0.2%,
1.3333333333333333,8,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,718,14.0%,
0.0094339622641509,1,0.0%,
0.0095238095238095,2,0.0%,
0.010204081632653,1,0.0%,
0.0106382978723404,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
442.5,1,0.0%,
510.0,1,0.0%,
542.0,1,0.0%,
630.0,1,0.0%,
642.0,1,0.0%,

0,1
Distinct count,3131
Unique (%),60.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.1058
Minimum,0
Maximum,527
Zeros (%),26.1%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,1.0
Q3,3.3524
95-th percentile,18.882
Maximum,527.0
Range,527.0
Interquartile range,3.3524

0,1
Standard deviation,19.34
Coef of variation,3.7879
Kurtosis,226.91
Mean,5.1058
MAD,6.5845
Skewness,12.49
Sum,26264
Variance,374.04
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1342,26.1%,
1.0,23,0.4%,
2.0,19,0.4%,
1.5,11,0.2%,
3.0,9,0.2%,
0.5,9,0.2%,
11.0,9,0.2%,
0.6666666666666666,8,0.2%,
4.0,8,0.2%,
0.2,8,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1342,26.1%,
0.005181347150259,1,0.0%,
0.005813953488372,1,0.0%,
0.0066225165562913,1,0.0%,
0.0072463768115942,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
275.0,1,0.0%,
331.0,1,0.0%,
334.6666666666667,1,0.0%,
445.0,1,0.0%,
527.0,1,0.0%,

0,1
Distinct count,604
Unique (%),11.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.3698
Minimum,0
Maximum,1
Zeros (%),26.6%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.375
Q3,0.55556
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.55556

0,1
Standard deviation,0.29423
Coef of variation,0.79564
Kurtosis,-0.66814
Mean,0.3698
MAD,0.2415
Skewness,0.34079
Sum,1902.3
Variance,0.08657
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1370,26.6%,
1.0,314,6.1%,
0.5,302,5.9%,
0.3333333333333333,185,3.6%,
0.6666666666666666,157,3.1%,
0.25,113,2.2%,
0.75,104,2.0%,
0.4,103,2.0%,
0.42857142857142855,72,1.4%,
0.6,69,1.3%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1370,26.6%,
0.0666666666666666,1,0.0%,
0.0689655172413793,1,0.0%,
0.0714285714285714,2,0.0%,
0.0769230769230769,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.875,5,0.1%,
0.8888888888888888,3,0.1%,
0.9,4,0.1%,
0.9285714285714286,1,0.0%,
1.0,314,6.1%,

0,1
Distinct count,450
Unique (%),8.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.29882
Minimum,0
Maximum,1
Zeros (%),41.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.27586
Q3,0.5
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.5

0,1
Standard deviation,0.30765
Coef of variation,1.0295
Kurtosis,-0.55911
Mean,0.29882
MAD,0.26372
Skewness,0.66672
Sum,1537.1
Variance,0.094646
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,2129,41.4%,
1.0,310,6.0%,
0.5,293,5.7%,
0.3333333333333333,158,3.1%,
0.6666666666666666,144,2.8%,
0.25,109,2.1%,
0.4,96,1.9%,
0.6,75,1.5%,
0.75,68,1.3%,
0.2,59,1.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,2129,41.4%,
0.05,1,0.0%,
0.0588235294117647,1,0.0%,
0.074074074074074,1,0.0%,
0.0769230769230769,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.875,4,0.1%,
0.8888888888888888,6,0.1%,
0.9090909090909092,2,0.0%,
0.9166666666666666,1,0.0%,
1.0,310,6.0%,

0,1
Distinct count,36
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.0412
Minimum,0
Maximum,40
Zeros (%),40.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,4
95-th percentile,12
Maximum,40
Range,40
Interquartile range,4

0,1
Standard deviation,4.6328
Coef of variation,1.5233
Kurtosis,9.461
Mean,3.0412
MAD,3.2282
Skewness,2.6119
Sum,15644
Variance,21.462
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2090,40.6%,
1,696,13.5%,
2,491,9.5%,
3,360,7.0%,
4,285,5.5%,
5,259,5.0%,
6,167,3.2%,
7,131,2.5%,
8,116,2.3%,
10,96,1.9%,

Value,Count,Frequency (%),Unnamed: 3
0,2090,40.6%,
1,696,13.5%,
2,491,9.5%,
3,360,7.0%,
4,285,5.5%,

Value,Count,Frequency (%),Unnamed: 3
31,1,0.0%,
34,2,0.0%,
35,2,0.0%,
39,2,0.0%,
40,2,0.0%,

0,1
Distinct count,33
Unique (%),0.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.1858
Minimum,0
Maximum,39
Zeros (%),52.1%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,3
95-th percentile,10
Maximum,39
Range,39
Interquartile range,3

0,1
Standard deviation,3.9447
Coef of variation,1.8047
Kurtosis,13.61
Mean,2.1858
MAD,2.6136
Skewness,3.1184
Sum,11244
Variance,15.561
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2681,52.1%,
1,656,12.8%,
2,452,8.8%,
3,292,5.7%,
4,202,3.9%,
5,191,3.7%,
6,119,2.3%,
8,98,1.9%,
7,87,1.7%,
9,74,1.4%,

Value,Count,Frequency (%),Unnamed: 3
0,2681,52.1%,
1,656,12.8%,
2,452,8.8%,
3,292,5.7%,
4,202,3.9%,

Value,Count,Frequency (%),Unnamed: 3
30,1,0.0%,
31,1,0.0%,
35,2,0.0%,
36,1,0.0%,
39,2,0.0%,

0,1
Distinct count,76
Unique (%),1.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,8.0004
Minimum,0
Maximum,117
Zeros (%),28.7%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,4
Q3,11
95-th percentile,32
Maximum,117
Range,117
Interquartile range,11

0,1
Standard deviation,11.666
Coef of variation,1.4581
Kurtosis,12.393
Mean,8.0004
MAD,7.9869
Skewness,2.847
Sum,41154
Variance,136.09
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1477,28.7%,
1,426,8.3%,
3,339,6.6%,
2,307,6.0%,
4,252,4.9%,
5,233,4.5%,
8,171,3.3%,
6,168,3.3%,
7,163,3.2%,
9,149,2.9%,

Value,Count,Frequency (%),Unnamed: 3
0,1477,28.7%,
1,426,8.3%,
2,307,6.0%,
3,339,6.6%,
4,252,4.9%,

Value,Count,Frequency (%),Unnamed: 3
86,1,0.0%,
89,3,0.1%,
107,3,0.1%,
115,1,0.0%,
117,1,0.0%,

0,1
Distinct count,62
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,5.16
Minimum,0
Maximum,117
Zeros (%),41.9%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,7
95-th percentile,23
Maximum,117
Range,117
Interquartile range,7

0,1
Standard deviation,8.8136
Coef of variation,1.7081
Kurtosis,15.253
Mean,5.16
MAD,5.8359
Skewness,3.1573
Sum,26543
Variance,77.679
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2154,41.9%,
1,477,9.3%,
2,304,5.9%,
3,298,5.8%,
4,233,4.5%,
5,227,4.4%,
6,153,3.0%,
7,139,2.7%,
8,130,2.5%,
10,102,2.0%,

Value,Count,Frequency (%),Unnamed: 3
0,2154,41.9%,
1,477,9.3%,
2,304,5.9%,
3,298,5.8%,
4,233,4.5%,

Value,Count,Frequency (%),Unnamed: 3
62,1,0.0%,
64,2,0.0%,
70,2,0.0%,
89,1,0.0%,
117,1,0.0%,

0,1
Distinct count,12
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.77877
Minimum,0
Maximum,13
Zeros (%),62.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,3
Maximum,13
Range,13
Interquartile range,1

0,1
Standard deviation,1.3583
Coef of variation,1.7441
Kurtosis,9.8637
Mean,0.77877
MAD,0.97104
Skewness,2.6367
Sum,4006
Variance,1.8449
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3207,62.3%,
1,924,18.0%,
2,489,9.5%,
3,283,5.5%,
4,111,2.2%,
5,60,1.2%,
6,31,0.6%,
8,17,0.3%,
7,11,0.2%,
10,5,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,3207,62.3%,
1,924,18.0%,
2,489,9.5%,
3,283,5.5%,
4,111,2.2%,

Value,Count,Frequency (%),Unnamed: 3
7,11,0.2%,
8,17,0.3%,
9,4,0.1%,
10,5,0.1%,
13,2,0.0%,

0,1
Distinct count,12
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.60653
Minimum,0
Maximum,13
Zeros (%),69.5%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,3
Maximum,13
Range,13
Interquartile range,1

0,1
Standard deviation,1.2143
Coef of variation,2.0021
Kurtosis,11.924
Mean,0.60653
MAD,0.8433
Skewness,2.9656
Sum,3120
Variance,1.4746
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3576,69.5%,
1,799,15.5%,
2,390,7.6%,
3,194,3.8%,
4,76,1.5%,
5,54,1.0%,
6,28,0.5%,
7,13,0.3%,
8,7,0.1%,
10,3,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0,3576,69.5%,
1,799,15.5%,
2,390,7.6%,
3,194,3.8%,
4,76,1.5%,

Value,Count,Frequency (%),Unnamed: 3
7,13,0.3%,
8,7,0.1%,
9,3,0.1%,
10,3,0.1%,
13,1,0.0%,

0,1
Distinct count,10
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
Decision - Unanimous,1737
KO/TKO,1647
Submission,1083
Other values (7),677

Value,Count,Frequency (%),Unnamed: 3
Decision - Unanimous,1737,33.8%,
KO/TKO,1647,32.0%,
Submission,1083,21.1%,
Decision - Split,486,9.4%,
TKO - Doctor's Stoppage,70,1.4%,
Decision - Majority,56,1.1%,
Overturned,35,0.7%,
DQ,15,0.3%,
Could Not Continue,13,0.3%,
Other,2,0.0%,

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,2.2883
Minimum,1
Maximum,5
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,1
Median,3
Q3,3
95-th percentile,3
Maximum,5
Range,4
Interquartile range,2

0,1
Standard deviation,1.0037
Coef of variation,0.43864
Kurtosis,-0.39115
Mean,2.2883
MAD,0.88236
Skewness,0.14972
Sum,11771
Variance,1.0075
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
3,2526,49.1%,
1,1568,30.5%,
2,865,16.8%,
5,155,3.0%,
4,30,0.6%,

Value,Count,Frequency (%),Unnamed: 3
1,1568,30.5%,
2,865,16.8%,
3,2526,49.1%,
4,30,0.6%,
5,155,3.0%,

Value,Count,Frequency (%),Unnamed: 3
1,1568,30.5%,
2,865,16.8%,
3,2526,49.1%,
4,30,0.6%,
5,155,3.0%,

0,1
Distinct count,333
Unique (%),6.5%
Missing (%),0.0%
Missing (n),0

0,1
5:00,2315
4:59,32
3:00,29
Other values (330),2768

Value,Count,Frequency (%),Unnamed: 3
5:00,2315,45.0%,
4:59,32,0.6%,
3:00,29,0.6%,
1:54,23,0.4%,
2:38,20,0.4%,
0:39,19,0.4%,
4:21,18,0.3%,
1:01,18,0.3%,
2:18,18,0.3%,
1:07,17,0.3%,

0,1
Distinct count,19
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0

0,1
3 Rnd (5-5-5),4502
5 Rnd (5-5-5-5-5),423
1 Rnd + OT (12-3),79
Other values (16),140

Value,Count,Frequency (%),Unnamed: 3
3 Rnd (5-5-5),4502,87.5%,
5 Rnd (5-5-5-5-5),423,8.2%,
1 Rnd + OT (12-3),79,1.5%,
No Time Limit,37,0.7%,
3 Rnd + OT (5-5-5-5),22,0.4%,
1 Rnd (20),20,0.4%,
1 Rnd + 2OT (15-3-3),20,0.4%,
2 Rnd (5-5),11,0.2%,
1 Rnd (15),8,0.2%,
1 Rnd (10),6,0.1%,

0,1
Distinct count,476
Unique (%),9.3%
Missing (%),0.0%
Missing (n),0

0,1
2016-11-19,25
2014-10-04,23
2014-05-31,22
Other values (473),5074

Value,Count,Frequency (%),Unnamed: 3
2016-11-19,25,0.5%,
2014-10-04,23,0.4%,
2014-05-31,22,0.4%,
2014-08-23,21,0.4%,
2014-06-28,21,0.4%,
1994-03-11,15,0.3%,
2018-04-14,14,0.3%,
2018-09-22,14,0.3%,
2018-07-22,13,0.3%,
2015-11-21,13,0.3%,

0,1
Distinct count,157
Unique (%),3.1%
Missing (%),0.0%
Missing (n),0

0,1
"Las Vegas, Nevada, USA",1216
"London, England, United Kingdom",114
"Chicago, Illinois, USA",81
Other values (154),3733

Value,Count,Frequency (%),Unnamed: 3
"Las Vegas, Nevada, USA",1216,23.6%,
"London, England, United Kingdom",114,2.2%,
"Chicago, Illinois, USA",81,1.6%,
"Montreal, Quebec, Canada",81,1.6%,
"Atlantic City, New Jersey, USA",80,1.6%,
"Los Angeles, California, USA",79,1.5%,
"Newark, New Jersey, USA",78,1.5%,
"Denver, Colorado, USA",74,1.4%,
"Toronto, Ontario, Canada",74,1.4%,
"Stockholm, Sweden",72,1.4%,

0,1
Distinct count,1268
Unique (%),24.7%
Missing (%),0.0%
Missing (n),0

0,1
Draw,83
Donald Cerrone,23
Demian Maia,20
Other values (1265),5018

Value,Count,Frequency (%),Unnamed: 3
Draw,83,1.6%,
Donald Cerrone,23,0.4%,
Demian Maia,20,0.4%,
Georges St-Pierre,20,0.4%,
Michael Bisping,20,0.4%,
Jim Miller,19,0.4%,
Diego Sanchez,18,0.3%,
Rafael Dos Anjos,18,0.3%,
Jon Jones,18,0.3%,
Matt Hughes,18,0.3%,

0,1
Distinct count,19
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.5632
Minimum,0
Maximum,18
Zeros (%),48.2%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,2
95-th percentile,7
Maximum,18
Range,18
Interquartile range,2

0,1
Standard deviation,2.4008
Coef of variation,1.5358
Kurtosis,6.9613
Mean,1.5632
MAD,1.713
Skewness,2.3583
Sum,8041
Variance,5.7638
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2481,48.2%,
1,937,18.2%,
2,617,12.0%,
3,341,6.6%,
4,242,4.7%,
6,126,2.4%,
5,126,2.4%,
8,79,1.5%,
7,73,1.4%,
9,36,0.7%,

Value,Count,Frequency (%),Unnamed: 3
0,2481,48.2%,
1,937,18.2%,
2,617,12.0%,
3,341,6.6%,
4,242,4.7%,

Value,Count,Frequency (%),Unnamed: 3
14,5,0.1%,
15,1,0.0%,
16,2,0.0%,
17,5,0.1%,
18,2,0.0%,

0,1
Correlation,1

0,1
Distinct count,18
Unique (%),0.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.0931
Minimum,0
Maximum,36
Zeros (%),60.3%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,0
Q3,1
95-th percentile,5
Maximum,36
Range,36
Interquartile range,1

0,1
Standard deviation,2.0989
Coef of variation,1.9201
Kurtosis,26.643
Mean,1.0931
MAD,1.3476
Skewness,3.7743
Sum,5623
Variance,4.4053
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,3101,60.3%,
1,820,15.9%,
2,467,9.1%,
3,265,5.2%,
4,165,3.2%,
5,101,2.0%,
6,59,1.1%,
7,49,1.0%,
8,40,0.8%,
9,25,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0,3101,60.3%,
1,820,15.9%,
2,467,9.1%,
3,265,5.2%,
4,165,3.2%,

Value,Count,Frequency (%),Unnamed: 3
13,7,0.1%,
14,5,0.1%,
17,3,0.1%,
18,5,0.1%,
36,1,0.0%,

0,1
Correlation,1

0,1
Distinct count,773
Unique (%),15.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,182.06
Minimum,0
Maximum,1627
Zeros (%),13.5%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,29.0
Median,108.0
Q3,262.0
95-th percentile,620.85
Maximum,1627.0
Range,1627.0
Interquartile range,233.0

0,1
Standard deviation,212.28
Coef of variation,1.166
Kurtosis,5.017
Mean,182.06
MAD,156.75
Skewness,1.9727
Sum,936503
Variance,45065
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,697,13.5%,
12,29,0.6%,
25,28,0.5%,
10,28,0.5%,
36,27,0.5%,
38,27,0.5%,
9,27,0.5%,
19,27,0.5%,
47,26,0.5%,
22,26,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0,697,13.5%,
1,21,0.4%,
2,21,0.4%,
3,24,0.5%,
4,25,0.5%,

Value,Count,Frequency (%),Unnamed: 3
1405,1,0.0%,
1414,1,0.0%,
1533,1,0.0%,
1560,1,0.0%,
1627,1,0.0%,

0,1
Distinct count,322
Unique (%),6.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,77.073
Minimum,0
Maximum,490
Zeros (%),0.7%

0,1
Minimum,0
5-th percentile,5
Q1,26
Median,61
Q3,109
95-th percentile,206
Maximum,490
Range,490
Interquartile range,83

0,1
Standard deviation,66.137
Coef of variation,0.85812
Kurtosis,2.8629
Mean,77.073
MAD,50.92
Skewness,1.4479
Sum,396463
Variance,4374.2
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
8,66,1.3%,
28,62,1.2%,
17,62,1.2%,
4,59,1.1%,
10,59,1.1%,
1,56,1.1%,
11,55,1.1%,
7,54,1.0%,
12,52,1.0%,
9,52,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0,35,0.7%,
1,56,1.1%,
2,24,0.5%,
3,49,1.0%,
4,59,1.1%,

Value,Count,Frequency (%),Unnamed: 3
441,1,0.0%,
446,1,0.0%,
454,1,0.0%,
456,1,0.0%,
490,1,0.0%,

0,1
Distinct count,653
Unique (%),12.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,126.17
Minimum,0
Maximum,2624
Zeros (%),25.8%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,60
Q3,172
95-th percentile,487
Maximum,2624
Range,2624
Interquartile range,172

0,1
Standard deviation,179.26
Coef of variation,1.4208
Kurtosis,14.102
Mean,126.17
MAD,124.71
Skewness,2.7894
Sum,649011
Variance,32135
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1326,25.8%,
33,32,0.6%,
16,31,0.6%,
32,31,0.6%,
2,30,0.6%,
11,30,0.6%,
1,29,0.6%,
38,28,0.5%,
13,28,0.5%,
4,28,0.5%,

Value,Count,Frequency (%),Unnamed: 3
0,1326,25.8%,
1,29,0.6%,
2,30,0.6%,
3,21,0.4%,
4,28,0.5%,

Value,Count,Frequency (%),Unnamed: 3
1267,1,0.0%,
1285,1,0.0%,
1312,1,0.0%,
1393,1,0.0%,
2624,1,0.0%,

0,1
Distinct count,317
Unique (%),6.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,71.416
Minimum,0
Maximum,495
Zeros (%),1.0%

0,1
Minimum,0
5-th percentile,3
Q1,21
Median,54
Q3,105
95-th percentile,197
Maximum,495
Range,495
Interquartile range,84

0,1
Standard deviation,64.64
Coef of variation,0.90511
Kurtosis,2.3684
Mean,71.416
MAD,50.587
Skewness,1.3822
Sum,367366
Variance,4178.3
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
2,93,1.8%,
3,88,1.7%,
4,79,1.5%,
6,78,1.5%,
1,72,1.4%,
7,72,1.4%,
5,69,1.3%,
10,66,1.3%,
9,62,1.2%,
16,59,1.1%,

Value,Count,Frequency (%),Unnamed: 3
0,52,1.0%,
1,72,1.4%,
2,93,1.8%,
3,88,1.7%,
4,79,1.5%,

Value,Count,Frequency (%),Unnamed: 3
377,1,0.0%,
381,1,0.0%,
401,2,0.0%,
403,1,0.0%,
495,1,0.0%,

0,1
Correlation,0.93411

0,1
Correlation,0.96943

0,1
Correlation,0.94282

0,1
Correlation,0.97676

0,1
Distinct count,74
Unique (%),1.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.4774
Minimum,0
Maximum,84
Zeros (%),26.6%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,3
Q3,10
95-th percentile,29
Maximum,84
Range,84
Interquartile range,10

0,1
Standard deviation,10.659
Coef of variation,1.4254
Kurtosis,8.1784
Mean,7.4774
MAD,7.3823
Skewness,2.5355
Sum,38464
Variance,113.61
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1370,26.6%,
1,509,9.9%,
2,373,7.3%,
3,328,6.4%,
4,254,4.9%,
6,250,4.9%,
5,244,4.7%,
7,162,3.1%,
8,160,3.1%,
9,143,2.8%,

Value,Count,Frequency (%),Unnamed: 3
0,1370,26.6%,
1,509,9.9%,
2,373,7.3%,
3,328,6.4%,
4,254,4.9%,

Value,Count,Frequency (%),Unnamed: 3
71,2,0.0%,
73,1,0.0%,
75,2,0.0%,
82,1,0.0%,
84,1,0.0%,

0,1
Correlation,0.93083

0,1
Distinct count,64
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.9374
Minimum,0
Maximum,87
Zeros (%),41.4%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,1
Q3,6
95-th percentile,22
Maximum,87
Range,87
Interquartile range,6

0,1
Standard deviation,8.5708
Coef of variation,1.7359
Kurtosis,15.558
Mean,4.9374
MAD,5.582
Skewness,3.2871
Sum,25398
Variance,73.458
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,2129,41.4%,
1,497,9.7%,
2,368,7.2%,
3,304,5.9%,
4,233,4.5%,
5,212,4.1%,
6,170,3.3%,
7,147,2.9%,
8,113,2.2%,
9,109,2.1%,

Value,Count,Frequency (%),Unnamed: 3
0,2129,41.4%,
1,497,9.7%,
2,368,7.2%,
3,304,5.9%,
4,233,4.5%,

Value,Count,Frequency (%),Unnamed: 3
69,1,0.0%,
80,1,0.0%,
82,1,0.0%,
84,2,0.0%,
87,1,0.0%,

0,1
Correlation,0.94271

0,1
Correlation,0.95468

0,1
Correlation,0.96439

0,1
Correlation,0.96123

0,1
Correlation,0.97322

0,1
Correlation,0.9117

0,1
Correlation,0.9856

0,1
Correlation,0.91697

0,1
Correlation,0.98776

0,1
Distinct count,228
Unique (%),4.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29.439
Minimum,0
Maximum,358
Zeros (%),19.0%

0,1
Minimum,0
5-th percentile,0
Q1,2
Median,13
Q3,39
95-th percentile,116
Maximum,358
Range,358
Interquartile range,37

0,1
Standard deviation,42.219
Coef of variation,1.4341
Kurtosis,9.5079
Mean,29.439
MAD,28.977
Skewness,2.6663
Sum,151434
Variance,1782.4
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,978,19.0%,
1,225,4.4%,
2,185,3.6%,
3,175,3.4%,
4,127,2.5%,
5,124,2.4%,
7,110,2.1%,
6,107,2.1%,
8,106,2.1%,
10,104,2.0%,

Value,Count,Frequency (%),Unnamed: 3
0,978,19.0%,
1,225,4.4%,
2,185,3.6%,
3,175,3.4%,
4,127,2.5%,

Value,Count,Frequency (%),Unnamed: 3
323,1,0.0%,
324,1,0.0%,
336,1,0.0%,
337,1,0.0%,
358,1,0.0%,

0,1
Correlation,0.99334

0,1
Distinct count,194
Unique (%),3.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,19.613
Minimum,0
Maximum,688
Zeros (%),31.8%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,6
Q3,25
95-th percentile,82
Maximum,688
Range,688
Interquartile range,25

0,1
Standard deviation,34.26
Coef of variation,1.7468
Kurtosis,42.53
Mean,19.613
MAD,21.536
Skewness,4.4853
Sum,100889
Variance,1173.7
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1638,31.8%,
1,209,4.1%,
2,175,3.4%,
3,164,3.2%,
5,140,2.7%,
7,138,2.7%,
4,135,2.6%,
6,127,2.5%,
8,102,2.0%,
10,89,1.7%,

Value,Count,Frequency (%),Unnamed: 3
0,1638,31.8%,
1,209,4.1%,
2,175,3.4%,
3,164,3.2%,
4,135,2.6%,

Value,Count,Frequency (%),Unnamed: 3
310,2,0.0%,
336,1,0.0%,
340,1,0.0%,
344,1,0.0%,
688,1,0.0%,

0,1
Correlation,0.99405

0,1
Correlation,0.93848

0,1
Correlation,0.97415

0,1
Correlation,0.94553

0,1
Correlation,0.97875

0,1
Distinct count,206
Unique (%),4.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,30.905
Minimum,0
Maximum,352
Zeros (%),18.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,3.0
Median,16.5
Q3,44.0
95-th percentile,111.0
Maximum,352.0
Range,352.0
Interquartile range,41.0

0,1
Standard deviation,39.342
Coef of variation,1.273
Kurtosis,7.0315
Mean,30.905
MAD,28.71
Skewness,2.196
Sum,158973
Variance,1547.8
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,949,18.4%,
1,160,3.1%,
2,136,2.6%,
4,135,2.6%,
3,135,2.6%,
5,121,2.4%,
7,112,2.2%,
6,108,2.1%,
9,105,2.0%,
12,85,1.7%,

Value,Count,Frequency (%),Unnamed: 3
0,949,18.4%,
1,160,3.1%,
2,136,2.6%,
3,135,2.6%,
4,135,2.6%,

Value,Count,Frequency (%),Unnamed: 3
315,1,0.0%,
330,1,0.0%,
334,1,0.0%,
342,1,0.0%,
352,1,0.0%,

0,1
Correlation,0.98721

0,1
Distinct count,187
Unique (%),3.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,22.174
Minimum,0
Maximum,310
Zeros (%),31.0%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,8
Q3,30
95-th percentile,95
Maximum,310
Range,310
Interquartile range,30

0,1
Standard deviation,33.646
Coef of variation,1.5173
Kurtosis,7.6821
Mean,22.174
MAD,23.568
Skewness,2.4754
Sum,114065
Variance,1132.1
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1594,31.0%,
1,173,3.4%,
3,159,3.1%,
2,140,2.7%,
4,126,2.4%,
5,120,2.3%,
7,101,2.0%,
9,94,1.8%,
6,93,1.8%,
8,91,1.8%,

Value,Count,Frequency (%),Unnamed: 3
0,1594,31.0%,
1,173,3.4%,
2,140,2.7%,
3,159,3.1%,
4,126,2.4%,

Value,Count,Frequency (%),Unnamed: 3
221,1,0.0%,
228,1,0.0%,
235,1,0.0%,
266,1,0.0%,
310,1,0.0%,

0,1
Correlation,0.9889

0,1
Distinct count,241
Unique (%),4.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,36.762
Minimum,0
Maximum,441
Zeros (%),20.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,2.0
Median,18.0
Q3,51.0
95-th percentile,137.85
Maximum,441.0
Range,441.0
Interquartile range,49.0

0,1
Standard deviation,50.199
Coef of variation,1.3655
Kurtosis,9.0343
Mean,36.762
MAD,35.346
Skewness,2.5337
Sum,189105
Variance,2519.9
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1048,20.4%,
1,153,3.0%,
2,132,2.6%,
3,120,2.3%,
4,110,2.1%,
5,101,2.0%,
7,92,1.8%,
6,89,1.7%,
16,83,1.6%,
10,83,1.6%,

Value,Count,Frequency (%),Unnamed: 3
0,1048,20.4%,
1,153,3.0%,
2,132,2.6%,
3,120,2.3%,
4,110,2.1%,

Value,Count,Frequency (%),Unnamed: 3
362,2,0.0%,
364,1,0.0%,
383,2,0.0%,
409,1,0.0%,
441,1,0.0%,

0,1
Correlation,0.98943

0,1
Distinct count,204
Unique (%),4.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,24.529
Minimum,0
Maximum,442
Zeros (%),33.8%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,7.5
Q3,32.0
95-th percentile,106.85
Maximum,442.0
Range,442.0
Interquartile range,32.0

0,1
Standard deviation,40.223
Coef of variation,1.6398
Kurtosis,12.941
Mean,24.529
MAD,26.912
Skewness,3.0168
Sum,126177
Variance,1617.9
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0,1741,33.8%,
1,169,3.3%,
2,140,2.7%,
3,117,2.3%,
4,113,2.2%,
6,99,1.9%,
7,97,1.9%,
5,96,1.9%,
11,87,1.7%,
8,77,1.5%,

Value,Count,Frequency (%),Unnamed: 3
0,1741,33.8%,
1,169,3.3%,
2,140,2.7%,
3,117,2.3%,
4,113,2.2%,

Value,Count,Frequency (%),Unnamed: 3
310,1,0.0%,
312,2,0.0%,
314,2,0.0%,
335,1,0.0%,
442,1,0.0%,

0,1
Correlation,0.99001

0,1
Distinct count,12
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0

0,1
Heavyweight,1117
Lightweight,989
Welterweight,969
Other values (9),2069

Value,Count,Frequency (%),Unnamed: 3
Heavyweight,1117,21.7%,
Lightweight,989,19.2%,
Welterweight,969,18.8%,
Middleweight,725,14.1%,
Featherweight,442,8.6%,
Bantamweight,379,7.4%,
Flyweight,187,3.6%,
Women's Strawweight,143,2.8%,
Women's Bantamweight,111,2.2%,
Women's Flyweight,50,1.0%,

0,1
Distinct count,21
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,7.1734
Minimum,0
Maximum,40
Zeros (%),21.2%

0,1
Minimum,0
5-th percentile,0
Q1,2
Median,4
Q3,10
95-th percentile,22
Maximum,40
Range,40
Interquartile range,8

0,1
Standard deviation,7.4256
Coef of variation,1.0352
Kurtosis,1.568
Mean,7.1734
MAD,5.8292
Skewness,1.3452
Sum,36900
Variance,55.139
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1089,21.2%,
2.0,870,16.9%,
4.0,643,12.5%,
6.0,545,10.6%,
8.0,421,8.2%,
10.0,333,6.5%,
12.0,259,5.0%,
14.0,211,4.1%,
16.0,172,3.3%,
18.0,158,3.1%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1089,21.2%,
2.0,870,16.9%,
4.0,643,12.5%,
6.0,545,10.6%,
8.0,421,8.2%,

Value,Count,Frequency (%),Unnamed: 3
32.0,21,0.4%,
34.0,10,0.2%,
36.0,7,0.1%,
38.0,7,0.1%,
40.0,4,0.1%,

0,1
Distinct count,23
Unique (%),0.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,4.9487
Minimum,0
Maximum,46
Zeros (%),35.8%

0,1
Minimum,0
5-th percentile,0
Q1,0
Median,2
Q3,8
95-th percentile,18
Maximum,46
Range,46
Interquartile range,8

0,1
Standard deviation,6.4394
Coef of variation,1.3012
Kurtosis,4.0168
Mean,4.9487
MAD,4.8159
Skewness,1.8821
Sum,25456
Variance,41.466
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1839,35.8%,
2.0,920,17.9%,
4.0,604,11.7%,
6.0,459,8.9%,
8.0,328,6.4%,
10.0,214,4.2%,
12.0,201,3.9%,
14.0,140,2.7%,
16.0,114,2.2%,
18.0,101,2.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1839,35.8%,
2.0,920,17.9%,
4.0,604,11.7%,
6.0,459,8.9%,
8.0,328,6.4%,

Value,Count,Frequency (%),Unnamed: 3
36.0,1,0.0%,
38.0,3,0.1%,
42.0,1,0.0%,
44.0,1,0.0%,
46.0,1,0.0%,

0,1
Correlation,0.95943

0,1
Correlation,0.96216

0,1
Distinct count,115
Unique (%),2.2%
Missing (%),13.1%
Missing (n),673
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.2471
Minimum,0
Maximum,2
Zeros (%),8.1%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,1.0
Median,1.3333
Q3,1.6
95-th percentile,2.0
Maximum,2.0
Range,2.0
Interquartile range,0.6

0,1
Standard deviation,0.55381
Coef of variation,0.44406
Kurtosis,0.15351
Mean,1.2471
MAD,0.42323
Skewness,-0.68554
Sum,5576
Variance,0.30671
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
2.0,784,15.2%,
1.0,646,12.6%,
1.3333333333333333,458,8.9%,
0.0,416,8.1%,
1.5,297,5.8%,
1.2,185,3.6%,
0.6666666666666666,162,3.1%,
1.6,141,2.7%,
1.4285714285714286,118,2.3%,
1.1428571428571428,94,1.8%,

Value,Count,Frequency (%),Unnamed: 3
0.0,416,8.1%,
0.3333333333333333,4,0.1%,
0.4,13,0.3%,
0.5,40,0.8%,
0.5714285714285714,8,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1.8461538461538465,3,0.1%,
1.8571428571428568,2,0.0%,
1.8666666666666667,2,0.0%,
1.875,1,0.0%,
2.0,784,15.2%,

0,1
Distinct count,105
Unique (%),2.0%
Missing (%),25.1%
Missing (n),1291
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.1727
Minimum,0
Maximum,2
Zeros (%),10.7%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.94118
Median,1.2308
Q3,1.6
95-th percentile,2.0
Maximum,2.0
Range,2.0
Interquartile range,0.65882

0,1
Standard deviation,0.626
Coef of variation,0.53382
Kurtosis,-0.55571
Mean,1.1727
MAD,0.49756
Skewness,-0.50232
Sum,4518.3
Variance,0.39187
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
2.0,746,14.5%,
1.0,593,11.5%,
0.0,548,10.7%,
1.3333333333333333,354,6.9%,
1.5,187,3.6%,
0.6666666666666666,177,3.4%,
1.2,161,3.1%,
1.6,122,2.4%,
0.8,77,1.5%,
1.1428571428571428,70,1.4%,

Value,Count,Frequency (%),Unnamed: 3
0.0,548,10.7%,
0.2857142857142857,1,0.0%,
0.3333333333333333,6,0.1%,
0.4,17,0.3%,
0.5,52,1.0%,

Value,Count,Frequency (%),Unnamed: 3
1.818181818181818,2,0.0%,
1.8333333333333333,3,0.1%,
1.8571428571428568,1,0.0%,
1.8823529411764703,2,0.0%,
2.0,746,14.5%,

0,1
Distinct count,3814
Unique (%),74.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.48717
Minimum,0
Maximum,1
Zeros (%),13.3%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.4274
Median,0.53191
Q3,0.625
95-th percentile,0.77674
Maximum,1.0
Range,1.0
Interquartile range,0.1976

0,1
Standard deviation,0.22593
Coef of variation,0.46375
Kurtosis,0.43253
Mean,0.48717
MAD,0.1659
Skewness,-0.98544
Sum,2506
Variance,0.051042
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,685,13.3%,
0.5,45,0.9%,
0.6,17,0.3%,
0.8,15,0.3%,
0.75,12,0.2%,
0.6666666666666666,11,0.2%,
0.5454545454545454,10,0.2%,
1.0,10,0.2%,
0.625,9,0.2%,
0.627906976744186,8,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,685,13.3%,
0.0833333333333333,1,0.0%,
0.0909090909090909,1,0.0%,
0.0952380952380952,1,0.0%,
0.1333333333333333,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9333333333333332,1,0.0%,
0.934640522875817,1,0.0%,
0.9411764705882352,1,0.0%,
0.945945945945946,1,0.0%,
1.0,10,0.2%,

0,1
Distinct count,3175
Unique (%),61.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.41624
Minimum,0
Maximum,1
Zeros (%),25.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.49973
Q3,0.606
95-th percentile,0.76071
Maximum,1.0
Range,1.0
Interquartile range,0.606

0,1
Standard deviation,0.26842
Coef of variation,0.64488
Kurtosis,-1.005
Mean,0.41624
MAD,0.22476
Skewness,-0.54405
Sum,2141.1
Variance,0.07205
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1308,25.4%,
0.5,37,0.7%,
0.6,25,0.5%,
0.5714285714285714,16,0.3%,
0.625,14,0.3%,
0.75,14,0.3%,
0.3333333333333333,14,0.3%,
0.8,13,0.3%,
0.6666666666666666,12,0.2%,
0.5555555555555556,10,0.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1308,25.4%,
0.0666666666666666,1,0.0%,
0.0909090909090909,1,0.0%,
0.1111111111111111,2,0.0%,
0.1300813008130081,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9666666666666668,1,0.0%,
0.9705882352941176,1,0.0%,
0.9761904761904762,1,0.0%,
0.9863013698630136,1,0.0%,
1.0,8,0.2%,

0,1
Distinct count,3267
Unique (%),63.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.32989
Minimum,0
Maximum,1
Zeros (%),13.8%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.27095
Median,0.34838
Q3,0.42618
95-th percentile,0.58333
Maximum,1.0
Range,1.0
Interquartile range,0.15523

0,1
Standard deviation,0.17166
Coef of variation,0.52035
Kurtosis,0.56549
Mean,0.32989
MAD,0.12564
Skewness,-0.29245
Sum,1697
Variance,0.029466
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,709,13.8%,
0.5,55,1.1%,
0.3333333333333333,44,0.9%,
0.25,31,0.6%,
0.4,27,0.5%,
0.2857142857142857,20,0.4%,
0.6666666666666666,20,0.4%,
0.42857142857142855,16,0.3%,
0.6,13,0.3%,
0.5714285714285714,13,0.3%,

Value,Count,Frequency (%),Unnamed: 3
0.0,709,13.8%,
0.0344827586206896,1,0.0%,
0.0714285714285714,1,0.0%,
0.0721649484536082,1,0.0%,
0.0786516853932584,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.8421052631578947,1,0.0%,
0.85,1,0.0%,
0.875,4,0.1%,
0.9166666666666666,1,0.0%,
1.0,10,0.2%,

0,1
Correlation,0.90464

0,1
Distinct count,1320
Unique (%),25.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.59661
Minimum,0
Maximum,1
Zeros (%),16.8%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.52941
Median,0.67974
Q3,0.79273
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.26331

0,1
Standard deviation,0.30134
Coef of variation,0.50508
Kurtosis,-0.058613
Mean,0.59661
MAD,0.23063
Skewness,-1.0216
Sum,3069
Variance,0.090804
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,866,16.8%,
1.0,307,6.0%,
0.6666666666666666,152,3.0%,
0.5,118,2.3%,
0.75,112,2.2%,
0.8,93,1.8%,
0.6,67,1.3%,
0.7142857142857143,59,1.1%,
0.8333333333333334,57,1.1%,
0.8571428571428571,46,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,866,16.8%,
0.1666666666666666,1,0.0%,
0.2,5,0.1%,
0.25,11,0.2%,
0.2608695652173913,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9813084112149532,1,0.0%,
0.9818181818181818,1,0.0%,
0.9824561403508772,1,0.0%,
0.9827586206896552,1,0.0%,
1.0,307,6.0%,

0,1
Distinct count,1039
Unique (%),20.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.50922
Minimum,0
Maximum,1
Zeros (%),29.2%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.63985
Q3,0.77273
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.77273

0,1
Standard deviation,0.35257
Coef of variation,0.69237
Kurtosis,-1.2892
Mean,0.50922
MAD,0.30827
Skewness,-0.49417
Sum,2619.4
Variance,0.1243
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1501,29.2%,
1.0,315,6.1%,
0.6666666666666666,130,2.5%,
0.5,117,2.3%,
0.75,112,2.2%,
0.8,77,1.5%,
0.6,65,1.3%,
0.7142857142857143,63,1.2%,
0.8333333333333334,52,1.0%,
0.625,50,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1501,29.2%,
0.125,3,0.1%,
0.1666666666666666,1,0.0%,
0.2,4,0.1%,
0.2142857142857142,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9696969696969696,2,0.0%,
0.9722222222222222,1,0.0%,
0.9824561403508772,1,0.0%,
0.9838709677419356,1,0.0%,
1.0,315,6.1%,

0,1
Distinct count,959
Unique (%),18.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.65682
Minimum,0
Maximum,1
Zeros (%),19.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.6
Median,0.78462
Q3,0.88249
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.28249

0,1
Standard deviation,0.34239
Coef of variation,0.52128
Kurtosis,-0.21593
Mean,0.65682
MAD,0.27191
Skewness,-1.1329
Sum,3378.7
Variance,0.11723
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,978,19.0%,
1.0,648,12.6%,
0.8,116,2.3%,
0.6666666666666666,114,2.2%,
0.75,105,2.0%,
0.5,102,2.0%,
0.8333333333333334,91,1.8%,
0.8571428571428571,86,1.7%,
0.875,71,1.4%,
0.7777777777777778,60,1.2%,

Value,Count,Frequency (%),Unnamed: 3
0.0,978,19.0%,
0.1818181818181818,1,0.0%,
0.2,3,0.1%,
0.25,7,0.1%,
0.28,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9827586206896552,1,0.0%,
0.9830508474576272,1,0.0%,
0.9836065573770492,3,0.1%,
0.9863013698630136,1,0.0%,
1.0,648,12.6%,

0,1
Distinct count,739
Unique (%),14.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.55404
Minimum,0
Maximum,1
Zeros (%),31.8%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.74419
Q3,0.86785
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.86785

0,1
Standard deviation,0.39697
Coef of variation,0.71649
Kurtosis,-1.4636
Mean,0.55404
MAD,0.36046
Skewness,-0.52883
Sum,2850
Variance,0.15758
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1638,31.8%,
1.0,621,12.1%,
0.8,122,2.4%,
0.75,103,2.0%,
0.5,103,2.0%,
0.6666666666666666,103,2.0%,
0.8333333333333334,91,1.8%,
0.8571428571428571,86,1.7%,
0.875,65,1.3%,
0.7142857142857143,51,1.0%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1638,31.8%,
0.1428571428571428,1,0.0%,
0.1666666666666666,1,0.0%,
0.2,2,0.0%,
0.25,5,0.1%,

Value,Count,Frequency (%),Unnamed: 3
0.9807692307692308,1,0.0%,
0.9827586206896552,1,0.0%,
0.9830508474576272,1,0.0%,
0.9833333333333332,1,0.0%,
1.0,621,12.1%,

0,1
Distinct count,3107
Unique (%),60.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.31926
Minimum,0
Maximum,1
Zeros (%),14.5%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.26531
Median,0.34459
Q3,0.41341
95-th percentile,0.53232
Maximum,1.0
Range,1.0
Interquartile range,0.14811

0,1
Standard deviation,0.16678
Coef of variation,0.52239
Kurtosis,1.1319
Mean,0.31926
MAD,0.12156
Skewness,-0.26358
Sum,1642.3
Variance,0.027816
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,744,14.5%,
0.5,77,1.5%,
0.3333333333333333,71,1.4%,
0.4,35,0.7%,
0.25,32,0.6%,
0.2857142857142857,24,0.5%,
1.0,24,0.5%,
0.375,23,0.4%,
0.4545454545454545,19,0.4%,
0.42857142857142855,19,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0.0,744,14.5%,
0.027027027027027,1,0.0%,
0.0555555555555555,1,0.0%,
0.0681818181818181,1,0.0%,
0.0729166666666666,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.8181818181818182,2,0.0%,
0.8333333333333334,1,0.0%,
0.8571428571428571,1,0.0%,
0.9285714285714286,1,0.0%,
1.0,24,0.5%,

0,1
Distinct count,2514
Unique (%),48.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.26885
Minimum,0
Maximum,1
Zeros (%),26.9%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.32
Q3,0.39669
95-th percentile,0.51515
Maximum,1.0
Range,1.0
Interquartile range,0.39669

0,1
Standard deviation,0.18766
Coef of variation,0.69802
Kurtosis,-0.41659
Mean,0.26885
MAD,0.15591
Skewness,-0.13906
Sum,1382.9
Variance,0.035217
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1383,26.9%,
0.5,74,1.4%,
0.3333333333333333,59,1.1%,
0.25,36,0.7%,
0.4,29,0.6%,
0.2,24,0.5%,
0.42857142857142855,23,0.4%,
0.375,21,0.4%,
0.2857142857142857,18,0.3%,
0.3636363636363637,16,0.3%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1383,26.9%,
0.0476190476190476,1,0.0%,
0.0526315789473684,1,0.0%,
0.0555555555555555,1,0.0%,
0.0625,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.8367346938775511,1,0.0%,
0.8571428571428571,2,0.0%,
0.875,1,0.0%,
0.9,1,0.0%,
1.0,15,0.3%,

0,1
Distinct count,1194
Unique (%),23.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.55225
Minimum,0
Maximum,1
Zeros (%),18.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.48044
Median,0.63889
Q3,0.73913
95-th percentile,0.94961
Maximum,1.0
Range,1.0
Interquartile range,0.25869

0,1
Standard deviation,0.2957
Coef of variation,0.53545
Kurtosis,-0.31823
Mean,0.55225
MAD,0.23106
Skewness,-0.87485
Sum,2840.8
Variance,0.087441
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,949,18.4%,
1.0,243,4.7%,
0.6666666666666666,187,3.6%,
0.5,161,3.1%,
0.75,116,2.3%,
0.6,86,1.7%,
0.8,61,1.2%,
0.5714285714285714,56,1.1%,
0.7142857142857143,53,1.0%,
0.7,46,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,949,18.4%,
0.1666666666666666,4,0.1%,
0.1875,1,0.0%,
0.2,10,0.2%,
0.2222222222222222,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9761904761904762,1,0.0%,
0.9803921568627452,1,0.0%,
0.981132075471698,1,0.0%,
0.9824561403508772,2,0.0%,
1.0,243,4.7%,

0,1
Distinct count,977
Unique (%),19.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.46665
Minimum,0
Maximum,1
Zeros (%),31.0%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.6
Q3,0.7234
95-th percentile,0.91783
Maximum,1.0
Range,1.0
Interquartile range,0.7234

0,1
Standard deviation,0.33844
Coef of variation,0.72524
Kurtosis,-1.3537
Mean,0.46665
MAD,0.30002
Skewness,-0.38329
Sum,2400.5
Variance,0.11454
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1594,31.0%,
1.0,212,4.1%,
0.5,161,3.1%,
0.6666666666666666,155,3.0%,
0.75,125,2.4%,
0.6,78,1.5%,
0.8,65,1.3%,
0.3333333333333333,56,1.1%,
0.625,53,1.0%,
0.8333333333333334,45,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1594,31.0%,
0.1111111111111111,1,0.0%,
0.125,1,0.0%,
0.1666666666666666,2,0.0%,
0.1818181818181818,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9782608695652174,1,0.0%,
0.98,1,0.0%,
0.9803921568627452,1,0.0%,
0.9824561403508772,1,0.0%,
1.0,212,4.1%,

0,1
Distinct count,1235
Unique (%),24.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.54159
Minimum,0
Maximum,1
Zeros (%),20.4%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.47354
Median,0.63158
Q3,0.73643
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.26289

0,1
Standard deviation,0.30427
Coef of variation,0.56181
Kurtosis,-0.51644
Mean,0.54159
MAD,0.24094
Skewness,-0.80749
Sum,2785.9
Variance,0.092579
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1048,20.4%,
1.0,262,5.1%,
0.5,132,2.6%,
0.6666666666666666,130,2.5%,
0.75,93,1.8%,
0.6,69,1.3%,
0.7142857142857143,59,1.1%,
0.8,58,1.1%,
0.5714285714285714,51,1.0%,
0.8333333333333334,48,0.9%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1048,20.4%,
0.1,1,0.0%,
0.1538461538461538,1,0.0%,
0.1666666666666666,1,0.0%,
0.1875,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9523809523809524,1,0.0%,
0.9545454545454546,1,0.0%,
0.96,1,0.0%,
0.9696969696969696,1,0.0%,
1.0,262,5.1%,

0,1
Distinct count,955
Unique (%),18.6%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.45295
Minimum,0
Maximum,1
Zeros (%),33.8%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,0.0
Median,0.58333
Q3,0.72
95-th percentile,1.0
Maximum,1.0
Range,1.0
Interquartile range,0.72

0,1
Standard deviation,0.34816
Coef of variation,0.76865
Kurtosis,-1.4446
Mean,0.45295
MAD,0.31327
Skewness,-0.27754
Sum,2330
Variance,0.12122
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,1741,33.8%,
1.0,277,5.4%,
0.5,139,2.7%,
0.6666666666666666,130,2.5%,
0.75,96,1.9%,
0.6,73,1.4%,
0.8,57,1.1%,
0.7142857142857143,46,0.9%,
0.5714285714285714,42,0.8%,
0.8333333333333334,40,0.8%,

Value,Count,Frequency (%),Unnamed: 3
0.0,1741,33.8%,
0.1428571428571428,1,0.0%,
0.1666666666666666,1,0.0%,
0.2,7,0.1%,
0.2142857142857142,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
0.9545454545454546,1,0.0%,
0.9583333333333334,1,0.0%,
0.96,1,0.0%,
0.9622641509433962,2,0.0%,
1.0,277,5.4%,

0,1
Distinct count,4431
Unique (%),86.1%
Missing (%),0.0%
Missing (n),0

0,1
"trueskill.Rating(mu=29.241, sigma=7.190)",438
"trueskill.Rating(mu=31.731, sigma=6.490)",59
"trueskill.Rating(mu=25.041, sigma=6.293)",45
Other values (4428),4602

Value,Count,Frequency (%),Unnamed: 3
"trueskill.Rating(mu=29.241, sigma=7.190)",438,8.5%,
"trueskill.Rating(mu=31.731, sigma=6.490)",59,1.1%,
"trueskill.Rating(mu=25.041, sigma=6.293)",45,0.9%,
"trueskill.Rating(mu=28.344, sigma=7.222)",30,0.6%,
"trueskill.Rating(mu=30.751, sigma=6.899)",30,0.6%,
"trueskill.Rating(mu=32.785, sigma=6.272)",15,0.3%,
"trueskill.Rating(mu=31.034, sigma=6.550)",12,0.2%,
"trueskill.Rating(mu=24.303, sigma=6.272)",12,0.2%,
"trueskill.Rating(mu=33.425, sigma=6.004)",11,0.2%,
"trueskill.Rating(mu=26.549, sigma=6.072)",9,0.2%,

0,1
Distinct count,4431
Unique (%),86.1%
Missing (%),0.0%
Missing (n),0

0,1
"trueskill.Rating(mu=20.759, sigma=7.190)",438
"trueskill.Rating(mu=21.656, sigma=7.222)",59
"trueskill.Rating(mu=19.249, sigma=6.899)",45
Other values (4428),4602

Value,Count,Frequency (%),Unnamed: 3
"trueskill.Rating(mu=20.759, sigma=7.190)",438,8.5%,
"trueskill.Rating(mu=21.656, sigma=7.222)",59,1.1%,
"trueskill.Rating(mu=19.249, sigma=6.899)",45,0.9%,
"trueskill.Rating(mu=24.959, sigma=6.293)",30,0.6%,
"trueskill.Rating(mu=18.269, sigma=6.490)",30,0.6%,
"trueskill.Rating(mu=25.697, sigma=6.272)",15,0.3%,
"trueskill.Rating(mu=18.966, sigma=6.550)",12,0.2%,
"trueskill.Rating(mu=17.215, sigma=6.272)",12,0.2%,
"trueskill.Rating(mu=22.207, sigma=7.274)",11,0.2%,
"trueskill.Rating(mu=23.451, sigma=6.072)",9,0.2%,

0,1
Distinct count,24
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,179.28
Minimum,152.4
Maximum,210.82
Zeros (%),0.0%

0,1
Minimum,152.4
5-th percentile,165.1
Q1,172.72
Median,180.34
Q3,185.42
95-th percentile,193.04
Maximum,210.82
Range,58.42
Interquartile range,12.7

0,1
Standard deviation,8.6388
Coef of variation,0.048187
Kurtosis,0.052046
Mean,179.28
MAD,7.0463
Skewness,-0.096264
Sum,922200
Variance,74.63
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
185.42,630,12.2%,
182.88,623,12.1%,
177.8,509,9.9%,
175.26,507,9.9%,
180.34,489,9.5%,
187.96,405,7.9%,
172.72,388,7.5%,
167.64,353,6.9%,
190.5,326,6.3%,
170.18,304,5.9%,

Value,Count,Frequency (%),Unnamed: 3
152.4,2,0.0%,
154.94,25,0.5%,
157.48000000000005,7,0.1%,
160.02,63,1.2%,
162.56,114,2.2%,

Value,Count,Frequency (%),Unnamed: 3
198.12,29,0.6%,
200.66,6,0.1%,
203.2,15,0.3%,
208.28,3,0.1%,
210.82,15,0.3%,

0,1
Correlation,1

0,1
Distinct count,60
Unique (%),1.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,78.052
Minimum,52.163
Maximum,156.49
Zeros (%),0.0%

0,1
Minimum,52.163
5-th percentile,56.699
Q1,65.771
Median,77.111
Q3,83.915
95-th percentile,113.4
Maximum,156.49
Range,104.33
Interquartile range,18.144

0,1
Standard deviation,15.948
Coef of variation,0.20432
Kurtosis,0.8289
Mean,78.052
MAD,12.13
Skewness,0.95488
Sum,401500
Variance,254.33
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
77.11064,1034,20.1%,
70.30676,891,17.3%,
83.91452,744,14.5%,
92.98635999999999,507,9.9%,
65.77083999999999,493,9.6%,
61.234919999999995,487,9.5%,
56.699,268,5.2%,
52.16308,129,2.5%,
120.20188,105,2.0%,
108.86208,77,1.5%,

Value,Count,Frequency (%),Unnamed: 3
52.16308,129,2.5%,
56.699,268,5.2%,
61.23492,487,9.5%,
65.77083999999999,493,9.6%,
70.0,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
131.54168,1,0.0%,
133.80964,1,0.0%,
136.0776,4,0.1%,
146.510216,2,0.0%,
156.48924,2,0.0%,

0,1
Correlation,1

0,1
Distinct count,25
Unique (%),0.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,183.62
Minimum,152.4
Maximum,213.36
Zeros (%),0.0%

0,1
Minimum,152.4
5-th percentile,165.1
Q1,177.8
Median,182.88
Q3,190.5
95-th percentile,200.66
Maximum,213.36
Range,60.96
Interquartile range,12.7

0,1
Standard deviation,9.9919
Coef of variation,0.054415
Kurtosis,0.15086
Mean,183.62
MAD,7.8009
Skewness,-0.059022
Sum,944550
Variance,99.837
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
182.88,738,14.3%,
177.8,499,9.7%,
187.96,489,9.5%,
185.42,453,8.8%,
190.5,430,8.4%,
180.34,424,8.2%,
193.04,371,7.2%,
172.72,278,5.4%,
195.58,254,4.9%,
175.26,205,4.0%,

Value,Count,Frequency (%),Unnamed: 3
152.4,9,0.2%,
157.48,15,0.3%,
160.02,54,1.0%,
162.56,72,1.4%,
165.1,129,2.5%,

Value,Count,Frequency (%),Unnamed: 3
205.74,22,0.4%,
208.28,18,0.3%,
210.82,6,0.1%,
211.0,1,0.0%,
213.36,32,0.6%,

0,1
Correlation,1

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Orthodox,3941
Southpaw,1036
Switch,150
Other values (2),17

Value,Count,Frequency (%),Unnamed: 3
Orthodox,3941,76.6%,
Southpaw,1036,20.1%,
Switch,150,2.9%,
Open Stance,15,0.3%,
Sideways,2,0.0%,

0,1
Distinct count,5
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0

0,1
Orthodox,3941
Southpaw,1036
Switch,150
Other values (2),17

Value,Count,Frequency (%),Unnamed: 3
Orthodox,3941,76.6%,
Southpaw,1036,20.1%,
Switch,150,2.9%,
Open Stance,15,0.3%,
Sideways,2,0.0%,

0,1
Distinct count,37
Unique (%),0.7%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,29.241
Minimum,9
Maximum,47
Zeros (%),0.0%

0,1
Minimum,9.0
5-th percentile,23.0
Q1,26.0
Median,29.0
Q3,32.0
95-th percentile,36.85
Maximum,47.0
Range,38.0
Interquartile range,6.0

0,1
Standard deviation,4.5144
Coef of variation,0.15439
Kurtosis,1.6335
Mean,29.241
MAD,3.4942
Skewness,-0.31328
Sum,150420
Variance,20.38
Memory size,40.3 KiB

Value,Count,Frequency (%),Unnamed: 3
29.0,468,9.1%,
30.0,468,9.1%,
31.0,443,8.6%,
27.0,440,8.6%,
28.0,431,8.4%,
26.0,399,7.8%,
32.0,392,7.6%,
25.0,307,6.0%,
33.0,291,5.7%,
34.0,268,5.2%,

Value,Count,Frequency (%),Unnamed: 3
9.0,1,0.0%,
10.0,12,0.2%,
11.0,13,0.3%,
12.0,12,0.2%,
13.0,10,0.2%,

Value,Count,Frequency (%),Unnamed: 3
43.0,3,0.1%,
44.0,5,0.1%,
45.0,1,0.0%,
46.0,2,0.0%,
47.0,1,0.0%,

0,1
Correlation,1

Unnamed: 0.1,Unnamed: 0,R_fighter,B_fighter,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,win_by,last_round,last_round_time,Format,date,location,Winner,R_KD_LANDED,R_KD_ATT,B_KD_LANDED,B_KD_ATT,R_SIG_STR_LANDED,R_SIG_STR_ATT,B_SIG_STR_LANDED,B_SIG_STR_ATT,R_TOTAL_STR_LANDED,R_TOTAL_STR_ATT,B_TOTAL_STR_LANDED,B_TOTAL_STR_ATT,R_TD_LANDED,R_TD_ATT,B_TD_LANDED,B_TD_ATT,R_HEAD_LANDED,R_HEAD_ATT,B_HEAD_LANDED,B_HEAD_ATT,R_BODY_LANDED,R_BODY_ATT,B_BODY_LANDED,B_BODY_ATT,R_LEG_LANDED,R_LEG_ATT,B_LEG_LANDED,B_LEG_ATT,R_DISTANCE_LANDED,R_DISTANCE_ATT,B_DISTANCE_LANDED,B_DISTANCE_ATT,R_CLINCH_LANDED,R_CLINCH_ATT,B_CLINCH_LANDED,B_CLINCH_ATT,R_GROUND_LANDED,R_GROUND_ATT,B_GROUND_LANDED,B_GROUND_ATT,weight_class,R_wins,B_wins,R_fights,B_fights,R_wins_pct,B_wins_pct,R_TOTAL_STR_pct,B_TOTAL_STR_pct,R_HEAD_pct,B_HEAD_pct,R_BODY_pct,B_BODY_pct,R_LEG_pct,B_LEG_pct,R_DISTANCE_pct,B_DISTANCE_pct,R_CLINCH_pct,B_CLINCH_pct,R_GROUND_pct,B_GROUND_pct,R_skill,B_skill,R_Height,B_Height,R_Weight,B_Weight,R_Reach,B_Reach,R_Stance,B_Stance,R_Age,B_Age
0,0,Henry Cejudo,Marlon Moraes,2.573099,0.94958,0.358491,0.25,1,2,12,2,0,0,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Henry Cejudo,4,4,4,4,440,171,113,119,691,1299,118,332,19,53,1,4,239,742,56,243,164,219,30,46,37,53,27,38,265,750,103,313,110,170,0,1,65,94,10,13,Bantamweight,16.0,8.0,10.0,5.0,1.6,1.6,0.531948,0.355422,0.322102,0.230453,0.748858,0.652174,0.698113,0.710526,0.353333,0.329073,0.647059,0.0,0.691489,0.769231,"trueskill.Rating(mu=36.945, sigma=2.696)","trueskill.Rating(mu=33.242, sigma=2.803)",162.56,162.56,61.23492,61.23492,162.56,162.56,Orthodox,Orthodox,32.0,32.0
1,1,Valentina Shevchenko,Jessica Eye,37.818182,42.75,0.472222,0.5,3,7,12,8,1,0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Valentina Shevchenko,0,0,0,0,416,11,513,12,720,1131,696,1587,17,36,5,10,259,592,320,1120,54,84,91,146,103,135,102,123,253,617,421,1247,48,65,73,118,115,129,19,24,Women's Flyweight,10.0,8.0,7.0,10.0,1.428571,0.8,0.636605,0.438563,0.4375,0.285714,0.642857,0.623288,0.762963,0.829268,0.410049,0.33761,0.738462,0.618644,0.891473,0.791667,"trueskill.Rating(mu=32.366, sigma=2.872)","trueskill.Rating(mu=24.637, sigma=3.000)",165.1,165.1,56.699,56.699,167.64,167.64,Southpaw,Southpaw,31.0,31.0
2,2,Tony Ferguson,Donald Cerrone,4.16,14.183784,0.428571,0.313433,15,22,5,58,2,6,TKO - Doctor's Stoppage,2,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Tony Ferguson,3,3,36,36,832,200,2624,185,951,1995,2978,5704,6,14,42,134,534,1451,1268,3590,130,208,668,902,168,206,688,800,742,1742,2134,4566,26,43,266,412,64,80,224,314,Lightweight,28.0,46.0,15.0,31.0,1.866667,1.483871,0.476692,0.52209,0.368022,0.353203,0.625,0.740576,0.815534,0.86,0.425947,0.467367,0.604651,0.645631,0.8,0.713376,"trueskill.Rating(mu=35.664, sigma=2.195)","trueskill.Rating(mu=27.729, sigma=2.441)",180.34,180.34,70.30676,70.30676,193.04,193.04,Orthodox,Orthodox,35.0,35.0
3,3,Jimmie Rivera,Petr Yan,1.90625,1.486772,0.277778,0.5,0,1,1,2,0,1,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Petr Yan,3,3,2,2,366,192,281,189,406,939,347,619,5,18,5,10,192,644,215,465,82,146,56,68,92,104,10,12,328,839,195,438,33,47,44,55,5,8,42,52,Bantamweight,12.0,8.0,8.0,4.0,1.5,2.0,0.432375,0.560582,0.298137,0.462366,0.561644,0.823529,0.884615,0.833333,0.390942,0.445205,0.702128,0.8,0.625,0.807692,"trueskill.Rating(mu=28.273, sigma=3.398)","trueskill.Rating(mu=28.499, sigma=2.322)",162.56,162.56,61.23492,61.23492,172.72,172.72,Orthodox,Orthodox,29.0,29.0
4,4,Tai Tuivasa,Blagoy Ivanov,0.902778,1.0,0.0,0.0,0,0,1,0,0,0,Decision - Unanimous,3,5:00,3 Rnd (5-5-5),2019-06-08,"Chicago, Illinois, USA",Blagoy Ivanov,2,2,0,0,130,144,123,123,131,254,124,408,0,2,0,0,91,203,90,369,27,31,29,34,12,15,4,4,99,203,119,402,29,44,4,5,2,2,0,0,Heavyweight,6.0,2.0,4.0,2.0,1.5,1.0,0.515748,0.303922,0.448276,0.243902,0.870968,0.852941,0.8,1.0,0.487685,0.29602,0.659091,0.8,1.0,0.0,"trueskill.Rating(mu=32.664, sigma=4.431)","trueskill.Rating(mu=29.864, sigma=2.253)",187.96,187.96,119.748288,119.748288,190.5,190.5,Southpaw,Southpaw,26.0,26.0


# Testing

### Temporal Train-Dev-Test split
Observations in this data set are not independendent, as later appearances of a fighter depend on his/her earlier fights. Furthermore, this dependence is not only between different observations of the same fighter. If I know that a fighter is facing the world champion next, that will tell me something about his chances of victory–this is explicitly accounted for in the Trueskill model, and may be implicitly captured by some of the models we fit in the next section. Consequently, it does not make sense to perform ordinary cross-validation to optimise model hyperparameters. Viewed from a different perspective, the objective is to train a model to predict the outcome of future fights given past fights. Consequently, it makes no sense to evaluate the model on its ability to predict fights given past *and future* fights. 

Rather, we perform a temporal train-test split of the data. This has the effect of ensuring that the data on which we evaluate model performance is in the future compared to the training data. We set aside the last 200 fights as a dedicated test set. The remaining data will constitute our trainset. To tune hyperparameters, we will hold out the 600 last fights from the train data, train on the remaining and then evaluate accuracy on the held-out set. We repeat this process 2 more times, with the 400 last held out, then the 200 last held out. The hyperparameter settings that maximise the average accuracy across the 3 held out sets will be chosen. This is likely an established method, but on the offchance that it is not, I would like to call it "K-Hold" cross validation.

The hope is that by varying the held out set on which performance is evaluated, albeit only partially, we can mitigate the some of the overfitting that would likely occur if we only had 1 held out set. This is the approach that most closely mirrors ordinary cross validation, without violating the temporal constraints of this problem.

Once the hyperparameters have been chosen, the model is then trained on all the training data and used to predict the test set outcomes. This result will be the final reported accuracy, used to compare and choose model.

### Model Evaluation Metric: Accuracy
We are looking to build a model that will predict the probability of a fighter winning a fight before it happens, so as to help us make bets that maximise our expected returns. The payoff of a bet is binary, if you win the bet you get your money back plus profit, but if you lose the bet you loose all the money. With respect the model, the payoff function depends only on whether a prediction was correct or incorrect–it does not distinguish between different types of error and neither should we. As such, prediction accuracy is the metric used to evaluate model performance.


#### Creating a numeric representation the outcome variable
To fit models, the target variable, "Winner" must be expressed as a numerical. We create a binary variable "B_won" = 1 if B_fighter won the fight, 0 otherwise. This representation disregards the possibility of draws. However, because of the low draw rate, combined with our 50% probability decision rule, it is very unlikely that this will have an effect on our winner predictions. For later iterations, where the winning probability itself is of central interest–as opposed to just predicting who is most likely to win–the possibility of draws will have to be accounted for.

In [4]:
#Initialize to 0
cum_df["B_won"] = np.zeros(cum_df.shape[0])
for row in range(cum_df.shape[0]):
    #If B_won, change to 1
    if cum_df["B_fighter"].loc[row] == cum_df["Winner"].loc[row]:
        cum_df["B_won"].loc[row] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [6]:
cum_df["B_won"].describe()

count    5144.000000
mean        0.309292
std         0.462247
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: B_won, dtype: float64

In [7]:
#Train-Dev-Test Split

#There are no fighters in the test of dev set with more than one fight in that set
#This is desirable because we would otherwise have to make predictions two fights into the future
test_df = cum_df.loc[0:200]
X_test = test_df.drop(columns=["B_won"], axis=1)
y_test = test_df["B_won"]

train_df = cum_df.loc[200:]
X_train = train_df.drop(columns=["B_won"], axis=1)
y_train  = train_df["B_won"]

# Using only TrueSkill to predict fights

The Trueskill model was developed by Microsoft to model the skill of players in hopes to make good matches in online game like Halo by matching equally skilled opponents. It is assumed that every player has some time-varying skill, which is unobserved. To objective of the model is to infer the latent player skill from the observed performances–wins, losses and draws. In terms of modelling, skill s for player i at time t, N(s_ti|mu, sigma) represents our belief of the players true skill level. sigma, the uncertainty of our belief decreases as we observe more games of the player. Performance p of player i at time t, in turn is assumed to be a drawn from a normal distribution centered on s_ti with variance beta. The key modelling assumption here is that, whilst the  player skill is specific to each player, the variance of performance around that skill is specific to the game, i.e. the same for all players. Consequently, beta can be quite impactful and I will perform validation to select it.

Because the model does no "training" in the sense that it does not alter the method by which it classifies based on. That is, the function mapping is set once beta has been chosen. Therefore, it will make no difference if we train the model first on the training set and then on the test set: we might as well just create trueskill estimates for every fighter in every fight, and then evaluate performance across the three hold-out splits.

In [9]:
import trueskill
import itertools
import math

In [12]:
def get_rating(fighter, date):
    '''
    Takes a fighter name and a date
    Returns the skill of the fighter after the previous fight
    If there was no previous fight, create a new, default rating
    '''
    
    #Get the previous fight, the highest row in which the fighter appears that is lower than the current row
    previous_fight = np.where(((cum_df["R_fighter"]==fighter) | (cum_df["B_fighter"]==fighter)) & (date > cum_df["date"]))  
        
    #If this is the first fight, make a new rating
    if len(previous_fight[0]) == 0:
        return env.create_rating()
    
    #Get which side the fighter was in previous fight
    if cum_df["R_fighter"].loc[previous_fight[0][0]]==fighter:
        previous_prefix = "R_"
    else:
        previous_prefix = "B_"
        
    #Else, return skill    
    return cum_df[previous_prefix+"skill"].loc[previous_fight[0][0]]



def win_probability(r_skill, b_skill, beta):
    '''
    Takes red fighter and blue fighter trueskill ratings,
    beta represents the variance of performance around player skill
    Returns estimated probability that blue fighter will win
    '''
    delta_mu = r_skill.mu - b_skill.mu
    sum_sigma = r_skill.sigma**2 + b_skill.sigma**2
    denom = math.sqrt(2 * (beta * beta) + sum_sigma)
    ts = trueskill.global_env()
    return ts.cdf(delta_mu / denom)


In [19]:
#Source: https://trueskill.org/
score_dict = {"params": [1, 2, 3, 4, trueskill.BETA, 5, 6, 7],
            "avg_accuracy" : []}
#Implementing the validation strategy outlined above:

for beta in score_dict["params"]:
    #Create environment using current beta val
    env = trueskill.TrueSkill(draw_probability=0.019, beta=beta)
    #Reset fighter_skills
    cum_df["R_skill"] = np.full(cum_df.shape[0], np.nan)
    cum_df["B_skill"] = np.full(cum_df.shape[0], np.nan)
    
    #Iterate from old fights to new
    for row in reversed(range(cum_df.shape[0])):

        #Find winner
        if cum_df["Winner"].loc[row] ==  cum_df["R_fighter"].loc[row]:
            winner_name = cum_df["R_fighter"].loc[row]
            loser_name = cum_df["B_fighter"].loc[row]
            r_win = True
        elif cum_df["Winner"].loc[row] ==  cum_df["B_fighter"].loc[row]:
            winner_name = cum_df["B_fighter"].loc[row]
            loser_name = cum_df["R_fighter"].loc[row]
            b_win = True
        #If neither fighter won, it must have been a draw
        else:
            winner_name = cum_df["B_fighter"].loc[row]
            loser_name = cum_df["R_fighter"].loc[row]
            draw = True

        #Get skill ratings
        date = cum_df["date"].loc[row]
        winner = get_rating(winner_name, date)
        loser = get_rating(loser_name, date)

        #Update ratings given outcome of previous fight
        if r_win:
            cum_df["R_skill"].loc[row], cum_df["B_skill"].loc[row] = env.rate_1vs1(winner, loser)
        elif b_win:
            cum_df["B_skill"].loc[row], cum_df["R_skill"].loc[row] = env.rate_1vs1(winner, loser)
        else:
            cum_df["B_skill"].loc[row], cum_df["R_skill"].loc[row] = env.rate_1vs1(winner, loser, drawn=True)
    
    
    #Get win probabilities from fighter skills
    cum_df["Trueskill_bwon_pred"] = np.full(cum_df.shape[0], np.nan)
    for row in range(cum_df.shape[0]):
        r_skill = cum_df["R_skill"].loc[row]
        b_skill = cum_df["B_skill"].loc[row]
        cum_df["Trueskill_bwon_pred"].loc[row] = win_probability(b_skill, r_skill, beta)
    
    score = 0
    for split in [600, 400, 200]:
        #Generating prediction by rounding probability to nearest integer, o or 1
        #Get accuracy accross heldout set
        score += np.mean(np.round(cum_df["Trueskill_bwon_pred"].loc[200:200+split])==cum_df["B_won"].loc[200:200+split])
        print(beta, split, score)

    #Record average accuracy
    score_dict["avg_accuracy"].append(score/3)

1 600 0.5740432612312812
1 400 1.152596877191381
1 200 1.7396615538082965
2 600 0.5806988352745425
2 400 1.1717212791648168
2 200 1.753810831403623
3 600 0.5757071547420965
3 400 1.1567545362882312
3 200 1.7537694616613655
4 600 0.5757071547420965
4 400 1.1517670051161613
4 200 1.743806806111186
4.166666666666667 600 0.5790349417637272
4.166666666666667 400 1.1600823233098618
4.166666666666667 200 1.7521221243048868
5 600 0.5723793677204659
5 400 1.1459454525084958
5 200 1.7330101291254112
6 600 0.5707154742096506
6 400 1.1368002622395759
6 200 1.718889814478382
7 600 0.5707154742096506
7 400 1.134306496653541
7 200 1.7114209245142376


In [48]:
#Select the best performing beta: 2
beta = score_dict['params'][np.argmax(score_dict['avg_accuracy'])]

#Let's see how we did
score_dict['avg_accuracy']

[0.5815552137115675,
 0.5865413755259014,
 0.5837723552725294,
 0.575743315420001,
 0.5768525777605445,
 0.5774016902127042,
 0.5749065415065439,
 0.576564916299247]

In [13]:
beta=2
env = trueskill.TrueSkill(draw_probability=0.019, beta=beta)
#Reset fighter_skills
cum_df["R_skill"] = np.full(cum_df.shape[0], np.nan)
cum_df["B_skill"] = np.full(cum_df.shape[0], np.nan)

#Iterate from old fights to new
for row in reversed(range(cum_df.shape[0])):

    #Find winner
    if cum_df["Winner"].loc[row] ==  cum_df["R_fighter"].loc[row]:
        winner_name = cum_df["R_fighter"].loc[row]
        loser_name = cum_df["B_fighter"].loc[row]
        r_win = True
    elif cum_df["Winner"].loc[row] ==  cum_df["B_fighter"].loc[row]:
        winner_name = cum_df["B_fighter"].loc[row]
        loser_name = cum_df["R_fighter"].loc[row]
        b_win = True
    #If neither fighter won, it must have been a draw
    else:
        winner_name = cum_df["B_fighter"].loc[row]
        loser_name = cum_df["R_fighter"].loc[row]
        draw = True

    #Get skill ratings
    date = cum_df["date"].loc[row]
    winner = get_rating(winner_name, date)
    loser = get_rating(loser_name, date)

    #Update ratings given outcome of previous fight
    if r_win:
        cum_df["R_skill"].loc[row], cum_df["B_skill"].loc[row] = env.rate_1vs1(winner, loser)
    elif b_win:
        cum_df["B_skill"].loc[row], cum_df["R_skill"].loc[row] = env.rate_1vs1(winner, loser)
    else:
        cum_df["B_skill"].loc[row], cum_df["R_skill"].loc[row] = env.rate_1vs1(winner, loser, drawn=True)


#Get win probabilities from fighter skills
cum_df["Trueskill_bwon_pred"] = np.full(cum_df.shape[0], np.nan)
for row in range(cum_df.shape[0]):
    r_skill = cum_df["R_skill"].loc[row]
    b_skill = cum_df["B_skill"].loc[row]
    cum_df["Trueskill_bwon_pred"].loc[row] = win_probability(b_skill, r_skill, beta)

### Test set accuracy

In [21]:
#Testing performance on the 200 most recent fights in the dataset
np.mean(np.round(cum_df["Trueskill_bwon_pred"].loc[:200])==cum_df["B_won"].loc[:200])

0.582089552238806

# Let's define some feature sets

Ideally we want to find a set of features where each feature is highly correlated with the outcome, but the features are not highly correlated with eachother.

Let us start by finding the correlation between the outcome of interest, b_win, and the different statistics.
Let's print find all features with above 0.1 correlation with B_won.

Then, we show the features that are highly correlated.

In [48]:
import sys
np.set_printoptions(threshold=sys.maxsize)
idx = np.where(0.05<abs(cum_df.corr()[['B_won']].sort_values(by='B_won', ascending=False)))[0]
cum_df.corr()[['B_won']].sort_values(by='B_won', ascending=False).iloc[idx]

Unnamed: 0,B_won
B_won,1.0
Trueskill_bwon_pred,0.242658
B_SIG_STR_ATT,0.220879
B_Age,0.17245
R_Age,0.17245
R_DISTANCE_ATT,0.142943
R_DISTANCE_LANDED,0.134092
R_HEAD_ATT,0.13409
R_TOTAL_STR_ATT,0.127034
R_fights,0.125418


In [43]:
corr_matrix = cum_df.corr(method = 'pearson').abs()
show = (corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool)).stack().sort_values(ascending=False))
print(show[0:30])

R_Weight            B_Weight           1.000000
R_KD_LANDED         R_KD_ATT           1.000000
R_Age               B_Age              1.000000
R_Height            B_Height           1.000000
B_KD_LANDED         B_KD_ATT           1.000000
R_Reach             B_Reach            1.000000
B_LEG_LANDED        B_LEG_ATT          0.994047
R_LEG_LANDED        R_LEG_ATT          0.993336
B_GROUND_LANDED     B_GROUND_ATT       0.990011
R_GROUND_LANDED     R_GROUND_ATT       0.989431
B_CLINCH_LANDED     B_CLINCH_ATT       0.988905
B_BODY_LANDED       B_BODY_ATT         0.987761
R_CLINCH_LANDED     R_CLINCH_ATT       0.987209
R_BODY_LANDED       R_BODY_ATT         0.985598
B_HEAD_ATT          B_DISTANCE_ATT     0.978843
B_DISTANCE_LANDED   B_DISTANCE_ATT     0.978755
B_SIG_STR_LANDED    B_HEAD_LANDED      0.978693
B_TOTAL_STR_LANDED  B_TOTAL_STR_ATT    0.976755
R_HEAD_ATT          R_DISTANCE_ATT     0.976390
R_SIG_STR_LANDED    R_HEAD_LANDED      0.976232
B_TOTAL_STR_ATT     B_HEAD_ATT         0

In general, there is a strong correlation between the number attempted and landed for the various statistics of this form. This makes intuitive sense, the more you throw the more you land. This suggests that it might be redundant in most cases to include both statistics in a model–perhaps it is best to combine #attempted and accuracy.

Further, as speculated earlier, there is little variation in physical characteristics of fighters conditional on weight class. Weight, height, reach and age are all highly correlated between red_side and blue side fighters. Consequently, we cannot expect these statistics to hold discriminative power.

In [51]:
#Train set, everything except the 200 most recent fights
X_train = cum_df[["Trueskill_bwon_pred","B_TOTAL_STR_ATT", "R_TOTAL_STR_ATT", 
                       "B_TOTAL_STR_pct", "R_TOTAL_STR_pct",  "R_DISTANCE_pct", "B_DISTANCE_pct",
                      "R_DISTANCE_ATT", "B_DISTANCE_ATT", "B_TD_ATT", "R_TD_ATT"]].loc[200:]
y_train = cum_df["B_won"].loc[200:]

#TEst, everything except train
X_test = cum_df[["Trueskill_bwon_pred","B_TOTAL_STR_ATT", "R_TOTAL_STR_ATT", 
                       "B_TOTAL_STR_pct", "R_TOTAL_STR_pct",  "R_DISTANCE_pct", "B_DISTANCE_pct",
                      "R_DISTANCE_ATT", "B_DISTANCE_ATT", "B_TD_ATT", "R_TD_ATT"]].loc[:200]
y_test = cum_df["B_won"].loc[:200]

# Let's try using an SVM to classify which fighter will win.


All SVMs have the hyperparameter C in common. Conceptually, it specifies a tradeoff between training set classification accuracy and model smoothness. Large values of C means that the model will place larger value on correctly classifying all training data, whilst lower values of C means that keeping the decision surface smooth (i.e. not change much for small variations in X). It can thus be thought of as a regularization parameter. As such, selecting appropriate values for these hyperparameters is very important for model performance (scikit-learn.org, n.d). Cross validation is performed on a wide range of values for C, from e-8 to e8 with one order or magnitude between each value.

An SVM with a RBF kernel has, in addition to C, one additional hyperparameter, gamma, which how close away data  have to be to affect the decision boundary at a point. Higher values of gamma mean that only data closeby are taken into consideration, resulting in a more flexible decision boundary. Conversely, smaller values of gamma mean data from a wider range affect the decision boundary at each point, resulting in a smoother decision function. Consequently, gamma can be thought of as a parameter that guides the smoothness of the decision boundary. As is often the case, choosing the right value for gamma is about determining the appropriate bias-variance trade-off (for the same reasons as explained in poly kernel) (scikit-learn.org, n.d.)

We use our K-hold cv procedure to select hyperparamters. We use exponential spacing between each step of gamma, between 1e-8 and 1e4. Same as for c.

In [44]:
#Support Vector Machines
from sklearn import svm

from sklearn.metrics import accuracy_score

In [54]:
C_range = gamma_range = [10**x for x in range(-4, 4)]
parameters = {'gamma':gamma_range, 
              'C':C_range}

scores_dict = {}

#Because of our special cross validation procedure, I implement everything by hand
scores_list = []
for c in C_range:
    for gamma in gamma_range:
        score = 0
        for split in [600, 400, 200]:
            #Create svm with current parameter setting 
            svc = svm.SVC(kernel="rbf", gamma=gamma, C=c)
            
            #Fit on the training data
            svc.fit(X_train[split:], y_train[split:])
            
            #Evaluate accuracy on hold-out set
            y_pred=svc.predict(X_train[:split])
            score += accuracy_score(y_train[:split],y_pred)
        
        #Store average accuracy across the 3 hold-out sets
        scores_dict[(c, gamma)] = score/3

## Test set accuracy
We select the hyperparameter settings that gave the best cross validation accuracy, fit the model on the entire training dataset and then evaluate accuracy on the test set.

In [84]:
scores_df = pd.DataFrame.from_dict(scores_dict, orient='index', columns=["Train Accuracy"])
#Alot of parameter values have the same Training accuracy
scores_df[:40]

Unnamed: 0,Train Accuracy
"(0.0001, 0.0001)",0.557778
"(0.0001, 0.001)",0.557778
"(0.0001, 0.01)",0.557778
"(0.0001, 0.1)",0.557778
"(0.0001, 1)",0.557778
"(0.0001, 10)",0.557778
"(0.0001, 100)",0.557778
"(0.0001, 1000)",0.557778
"(0.001, 0.0001)",0.557778
"(0.001, 0.001)",0.557778


In [94]:
#We pick gamma=1, c=1, on the rationale that it is nice and even
svc = svm.SVC(kernel="rbf", gamma=1, C=1)

#Fit on the all training data
svc.fit(X_train, y_train)

#Evaluate accuracy on test set
y_pred=svc.predict(X_test)
score = accuracy_score(y_test,y_pred)
print("Test set accuracy:", score)

Test set accuracy: 0.5024875621890548


# Trying again, but without Trueskill

We fit another SVM with an RBF kernel, but this time without Trueskill as a feature and instead having win related data.

In [104]:
cum_df.head(2)

Unnamed: 0.1,Unnamed: 0,R_fighter,B_fighter,R_SIG_STR_pct,B_SIG_STR_pct,R_TD_pct,B_TD_pct,R_SUB_ATT,B_SUB_ATT,R_PASS,B_PASS,R_REV,B_REV,win_by,last_round,last_round_time,Format,date,location,Winner,R_KD_LANDED,R_KD_ATT,B_KD_LANDED,B_KD_ATT,R_SIG_STR_LANDED,R_SIG_STR_ATT,B_SIG_STR_LANDED,B_SIG_STR_ATT,R_TOTAL_STR_LANDED,R_TOTAL_STR_ATT,B_TOTAL_STR_LANDED,B_TOTAL_STR_ATT,R_TD_LANDED,R_TD_ATT,B_TD_LANDED,B_TD_ATT,R_HEAD_LANDED,R_HEAD_ATT,B_HEAD_LANDED,B_HEAD_ATT,R_BODY_LANDED,R_BODY_ATT,B_BODY_LANDED,B_BODY_ATT,R_LEG_LANDED,R_LEG_ATT,B_LEG_LANDED,B_LEG_ATT,R_DISTANCE_LANDED,R_DISTANCE_ATT,B_DISTANCE_LANDED,B_DISTANCE_ATT,R_CLINCH_LANDED,R_CLINCH_ATT,B_CLINCH_LANDED,B_CLINCH_ATT,R_GROUND_LANDED,R_GROUND_ATT,B_GROUND_LANDED,B_GROUND_ATT,weight_class,R_wins,B_wins,R_fights,B_fights,R_wins_pct,B_wins_pct,R_TOTAL_STR_pct,B_TOTAL_STR_pct,R_HEAD_pct,B_HEAD_pct,R_BODY_pct,B_BODY_pct,R_LEG_pct,B_LEG_pct,R_DISTANCE_pct,B_DISTANCE_pct,R_CLINCH_pct,B_CLINCH_pct,R_GROUND_pct,B_GROUND_pct,R_skill,B_skill,R_Height,B_Height,R_Weight,B_Weight,R_Reach,B_Reach,R_Stance,B_Stance,R_Age,B_Age,B_won,Trueskill_bwon_pred
0,0,Henry Cejudo,Marlon Moraes,2.573099,0.94958,0.358491,0.25,1,2,12,2,0,0,KO/TKO,3,4:51,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Henry Cejudo,4,4,4,4,440,171,113,119,691,1299,118,332,19,53,1,4,239,742,56,243,164,219,30,46,37,53,27,38,265,750,103,313,110,170,0,1,65,94,10,13,Bantamweight,16.0,8.0,10.0,5.0,1.6,1.6,0.531948,0.355422,0.322102,0.230453,0.748858,0.652174,0.698113,0.710526,0.353333,0.329073,0.647059,0.0,0.691489,0.769231,"trueskill.Rating(mu=38.950, sigma=3.574)","trueskill.Rating(mu=34.251, sigma=3.713)",162.56,162.56,61.23492,61.23492,162.56,162.56,Orthodox,Orthodox,32.0,32.0,0.0,0.336864
1,1,Valentina Shevchenko,Jessica Eye,37.818182,42.75,0.472222,0.5,3,7,12,8,1,0,KO/TKO,2,0:26,5 Rnd (5-5-5-5-5),2019-06-08,"Chicago, Illinois, USA",Valentina Shevchenko,0,0,0,0,416,11,513,12,720,1131,696,1587,17,36,5,10,259,592,320,1120,54,84,91,146,103,135,102,123,253,617,421,1247,48,65,73,118,115,129,19,24,Women's Flyweight,10.0,8.0,7.0,10.0,1.428571,0.8,0.636605,0.438563,0.4375,0.285714,0.642857,0.623288,0.762963,0.829268,0.410049,0.33761,0.738462,0.618644,0.891473,0.791667,"trueskill.Rating(mu=33.759, sigma=3.864)","trueskill.Rating(mu=24.323, sigma=3.959)",165.1,165.1,56.699,56.699,167.64,167.64,Southpaw,Southpaw,31.0,31.0,0.0,0.20269


In [115]:
#Train set, everything except the 200 most recent fights
X_train = cum_df[["B_TOTAL_STR_ATT", "R_TOTAL_STR_ATT", 
                       "B_TOTAL_STR_pct", "R_TOTAL_STR_pct",  "R_DISTANCE_pct", "B_DISTANCE_pct",
                      "R_DISTANCE_ATT", "B_DISTANCE_ATT", "B_TD_ATT", "R_TD_ATT", "R_Age", "B_Age",
                         "R_wins", "B_wins"]].loc[200:]
y_train = cum_df["B_won"].loc[200:]

#TEst, everything except train
X_test = cum_df[["B_TOTAL_STR_ATT", "R_TOTAL_STR_ATT", 
                       "B_TOTAL_STR_pct", "R_TOTAL_STR_pct",  "R_DISTANCE_pct", "B_DISTANCE_pct",
                      "R_DISTANCE_ATT", "B_DISTANCE_ATT", "B_TD_ATT", "R_TD_ATT", "R_Age", "B_Age",
                         "R_wins", "B_wins"]].loc[:200]
y_test = cum_df["B_won"].loc[:200]

In [106]:
C_range = gamma_range = [10**x for x in range(-4, 4)]
parameters = {'gamma':gamma_range, 
              'C':C_range}

scores_dict = {}

#Because of our special cross validation procedure, I implement everything by hand
scores_list = []
for c in C_range:
    for gamma in gamma_range:
        score = 0
        for split in [600, 400, 200]:
            #Create svm with current parameter setting 
            svc = svm.SVC(kernel="rbf", gamma=gamma, C=c)
            
            #Fit on the training data
            svc.fit(X_train[split:], y_train[split:])
            
            #Evaluate accuracy on hold-out set
            y_pred=svc.predict(X_train[:split])
            score += accuracy_score(y_train[:split],y_pred)
        
        #Store average accuracy across the 3 hold-out sets
        scores_dict[(c, gamma)] = score/3

### Test set accuracy
We select the hyperparameter settings that gave the best cross validation accuracy, fit the model on the entire training dataset and then evaluate accuracy on the test set.

In [113]:
scores_df = pd.DataFrame.from_dict(scores_dict, orient='index', columns=["Train Accuracy"])
#Alot of parameter values have the same Training accuracy

print("The best parameters are %s with a score of %0.3f"
      % (scores_df["Train Accuracy"].idxmax(), scores_df["Train Accuracy"].max()))

The best parameters are (1, 0.0001) with a score of 0.560


In [118]:
#We pick gamma=1, c=1, on the rationale that it is nice and even
svc = svm.SVC(kernel="rbf", C=1, gamma=0.0001)

#Fit on the all training data
svc.fit(X_train, y_train)

#Evaluate accuracy on test set
y_pred=svc.predict(X_test)
score = accuracy_score(y_test,y_pred)
print("Test set accuracy:", np.round(score, 3))

Test set accuracy: 0.478


# PCA followed by SVM
This time we try a different approach, we make a large feature set, apply PCA to reduce dimensionality, then fit an SVM.

We use cross validation to set the number of principal components and the hyperparameter settings for the SVM. 


In [100]:
#Train set, everything except the 200 most recent fights
X_train = cum_df[["Trueskill_bwon_pred","B_TOTAL_STR_ATT", "R_TOTAL_STR_ATT", 
                       "B_TOTAL_STR_pct", "R_TOTAL_STR_pct",  "R_DISTANCE_pct", "B_DISTANCE_pct",
                      "R_DISTANCE_ATT", "B_DISTANCE_ATT", "B_TD_ATT", "R_TD_ATT", 
                 "R_SUB_ATT", "B_SUB_ATT", "R_PASS", "B_PASS", "R_REV", "B_REV", "B_fights", 
                  "R_fights", "R_Age", "B_Age", "R_Height", "B_Height", "R_Reach", "B_Reach"]].loc[200:]

y_train = cum_df["B_won"].loc[200:]

#TEst, everything except train
X_test = cum_df[["Trueskill_bwon_pred","B_TOTAL_STR_ATT", "R_TOTAL_STR_ATT", 
                       "B_TOTAL_STR_pct", "R_TOTAL_STR_pct",  "R_DISTANCE_pct", "B_DISTANCE_pct",
                      "R_DISTANCE_ATT", "B_DISTANCE_ATT", "B_TD_ATT", "R_TD_ATT", 
                 "R_SUB_ATT", "B_SUB_ATT", "R_PASS", "B_PASS", "R_REV", "B_REV", "B_fights", 
                  "R_fights", "R_Age", "B_Age", "R_Height", "B_Height", "R_Reach", "B_Reach"]].loc[200:]
y_test = cum_df["B_won"].loc[:200]

In [101]:
from sklearn.decomposition import PCA

pca = PCA(random_state=0)
X = pca.fit_transform(X_train)
y = y_train

#Getting a sense for the by plotting the Cumulative Summation of the Explained Variance
#This informs the values tested in CV
plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of Components')
plt.ylabel('Variance (%)') #for each component
plt.title('Proportion of Total Variance Explained ')
plt.show()

KeyboardInterrupt: 

### Using cross validation to select the number of principal components to use

In [None]:
C_range = gamma_range = [10**x for x in range(-4, 4)]
component_range = 
parameters = {'gamma':gamma_range, 
              'C':C_range}

scores_dict = {}

#Because of our special cross validation procedure, I implement everything by hand
scores_list = []
for n_components in component_range:
    pca = PCA(random_state=0, n_components=n_components)
    X = pca.fit_transform(X_train)
    y = y_train
    
    for c in C_range:
        for gamma in gamma_range:
            score = 0
            for split in [600, 400, 200]:
                #Create svm with current parameter setting 
                svc = svm.SVC(kernel="rbf", gamma=gamma, C=c)

                #Fit on the training data
                svc.fit(X[split:], y_train[split:])

                #Evaluate accuracy on hold-out set
                y_pred=svc.predict(X[:split])
                score += accuracy_score(y_train[:split],y_pred)

            #Store average accuracy across the 3 hold-out sets
            scores_dict[(n_components, c, gamma)] = score/3

Abort PCA mission, it is taking too long.

Thanks for a really cool semester! Merry Christmas.