# UFC Fight Duration Analysis

Authors: Titus Bridgwood, Ioana Preoteasa

## Introduction

This notebook is the technical document detailing our analysis of fight duration analysis. For our analysis we will be using some of the more advanced regression techniques and will hopefully demonstrate a replicable process for handling and analysing data.

We have used a dataset of historical data for the UFC from 1993 to 2019 (https://www.kaggle.com/rajeevw/ufcdata#raw_total_fight_data.csv), containing information for all individual UFC bouts during this time period. The data set ostensibly contains two useful targets: the winner of the fight, and the length of the fight. Although it would be nice to see whether variables contained within this dataset can consistently predict wins, as winning (or not) is a binary categorical taget, and as such would lend itself to classification. On the other hand we can take match times, convert them into seconds and see whether the many variables in this dataset are capable or making decent predictions or fight length.

While there are lots of lovely in-fight statistics such as significant strike rates, takedown percentages and much more, we will be mostly looking at prior information for these fights such as the heights and weights of the fighters, how many successive wins the fighters have had etc. This is because we want our model to have predictive capabilities for business use cases such as deciding betting odds, and aside from the ultimate winner, the next most common betting target is the length of the fight ('I want him to go down in the 4th!'). While we've got our models up and running we are going to try and see whether we can change our target variable to something more nuanced like predicting strike percentage. 

## Dataset & Data Cleaning

There was a decent amount of cleaning for us to do with this set. Many of the columns were encoded as `object` types, especially any columns containing percentages, so we were sure to turn them from percentages into 0 to 1 floats. 

With this historical data we have to account for the fact that the rules have changed progressively throughout the history of the UFC. Round lengths of five minutes were introduced in UFC 21, and we suspect this might have an effect on estimations of match length, we will be sure to run models both both with and without matches before UFC 21. 

In [None]:
# Importing the data we will be using. 


## Feature Selection and Regularization

Our features are for the most part pretty different and very difficult compare to one another. For this reason we've normalised them with `sklearn`'s `normalize`. 

As for feature selection, it seems intuitively as though there will be some colinearity that we may at some point have to account for. 

We will be using Lasso regression, a form of regression that weights coefficients of a model so that some coefficients aren't producing too much of an effect on the model. Lasso differs from Ridge regression in that it is bounded, meaning that if a feature's weight will be reduced to 0 if its weighting is reduced to a certain level. To a certain extent this is a hands off way of letting sklearn's optimization select features for us.

We will also using polynomial preprocessing which will create a feature matrix of all the possible polynomial combinations for features. This will produce more nuanced features for our model but also runs the risk of creating a model that is overfitted to our training data and is less capable of generalizing to new and unseen data. For this reason we will be trying a number of different options for our level of polynomial preprocessing and the alpha threshold we set for our Lasso model.

We will be creating a training and testing split in order to see whether our model is capable of generalising to new data rather than simply perfectly fitting the data we already have. 

In [None]:
# Creating a split between training and testing data(80% Train, 20% Test)
x_train, x_test, y_train, y_test = train_test_split(numer_x_df, numer_y_df, test_size=0.2, random_state=42)

## Modelling 

We will initially be trying to obtain as simple a model as possible, in order to produce a baseline which we can work upon by introducing greater complexity to the model iteratively. We will be using all of the possible x variables to predict fight duration to begin with, and LASSO should then start to deweight those deemed unimportant. 

We have produced a library with code that should speed this process up and make results easier to interpret. The alpha in LASSO regression changes the extent to which features are deweighted, so if we set our initial alpha to 0 this essentially tells LASSO to perform an Ordinary Least Squares regression. With polynomial feature order set to 1 the model will also produce a straight line. 


Great so now we have a baseline R-squared score (although perhaps not the strongest) of `scorehere`. We can start to introduce some complexity into the model now to see if we can predict the fight duration any more convincingly. We'll start by adding second order polynomial features, which will produce pair wise interactions for all of the features in the dataset. 

INTERPRET RESULTS HERE. From here we can manually set the alpha value for our LASSO regression so that it will start to weight the coefficients. We have started with an alpha value of 0.5 as we found it to be quite effective in our initial tests. 

We could manually plug and chug a series of different alpha values, however SKLearn has a nifty built-in, `LassoLarsIC`, that optimizes both the weighting for the coefficients *and* the alpha by performing several iterations of the Lasso regression with different values for both. By default we run 1000 iterations but later we will hone the estimate by trying many more iterations

### Validation

We will be using cross validation just to keep ourselves in check, as our models are starting to get pretty complex we will be using only a small number of folds 

In [None]:
n_splits = 5
crossval = KFold(n_splits, shuffle=True, random_state=42)
scores = cross_val_score(reg_poly, x_poly_train, y_train, scoring='r2', cv=crossval)
mean_score = np.mean(scores)

## Refinement

## Honesty in Data Science