# CRISP-DM Methodology for UFC Fight Prediction

## 1. Business Understanding
   - Define the problem: Predict the winner of UFC fight.
   - Objectives: Create an ML model to predict winners accurately.
   - Success criteria: Achieve prediction accuracy above 70%.

## 2. Data Understanding
   - Gather data: Collect historical UFC fight data.
   - Explore data: Understand its structure, quality, and relationships.

## 3. Data Preparation
   - Preprocess data: Handle missing values, outliers, etc.
   - Feature engineering: Create or transform features for better performance.

## 4. Modeling
   - Select algorithms: Choose suitable ML algorithms.
   - Train-test split: Divide data into training and testing sets.
   - Model training: Train selected algorithms on training data.
   - Model evaluation: Assess model performance using various metrics.

## 5. Evaluation
   - Assess model performance: Compare models, tune hyperparameters.
   - Validate results: Validate model on testing data.

## 6. Deployment
   - Deploy the model: Integrate into a platform for predictions.
   - Monitor performance: Continuously monitor and update the model.

## 7. Iterative Improvement
   - Gather feedback: Collect user and stakeholder feedback.
   - Update the model: Incorporate feedback and new data.
   - Repeat the process: Iterate through the CRISP-DM process for improvements.


## CRISP-DM: 2. Data Understanding

### Import Python Libraries 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#to ignore warnings
import warnings
warnings.filterwarnings('ignore')

from sklearn.preprocessing import MinMaxScaler

###  Load the Dataset

In [None]:
pd.set_option("display.max_rows", None, "display.max_columns", None) 
data_file = 'DataUfc/data.csv'
df = pd.read_csv(data_file)

###  Exploratory Data Analysis (EDA)

In [None]:
#Preview of dataset provided by kaggle
df.head()

Every row is a compilation of info about each fighter up until that fight. 

In [None]:
#Check datatypes by using info() 
# This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
df.info(max_cols=200)

#### Summary Statistics for Numerical Columns:

In [None]:
#Calculate summary statistics for numerical columns (mean, median, standard deviation, etc.).
df.describe()

In [None]:
df.shape

### Column definitions:

```markdown
- **R_ and B_ Prefix**: Signifies red and blue corner fighter stats respectively.
- **_opp_ Columns**: Contain the average of damage done by the opponent on the fighter.
- **KD**: Number of knockdowns.
- **SIG_STR**: Number of significant strikes landed of attempted.
- **SIG_STR_pct**: Significant strikes percentage.
- **TOTAL_STR**: Total strikes landed of attempted.
- **TD**: Number of takedowns.
- **TD_pct**: Takedown percentages.
- **SUB_ATT**: Number of submission attempts.
- **PASS**: Number of times the guard was passed.
- **REV**: Number of reversals landed.
- **HEAD**: Number of significant strikes to the head landed of attempted.
- **BODY**: Number of significant strikes to the body landed of attempted.
- **CLINCH**: Number of significant strikes in the clinch landed of attempted.
- **GROUND**: Number of significant strikes on the ground landed of attempted.
- **win_by**: Method of win.
- **last_round**: Last round of the fight (e.g., if it was a KO in 1st, then this will be 1).
- **last_round_time**: When the fight ended in the last round.
- **Format**: Format of the fight (3 rounds, 5 rounds, etc.).
- **Referee**: Name of the referee.
- **Date**: Date of the fight.
- **Location**: Location in which the event took place.
- **Fight_type**: Weight class and whether it's a title bout or not.
- **Winner**: Winner of the fight.
- **Stance**: Stance of the fighter (orthodox, southpaw, etc.).
- **Height_cms**: Height in centimeters.
- **Reach_cms**: Reach of the fighter (arm span) in centimeters.
- **Weight_lbs**: Weight of the fighter in pounds (lbs).
- **Age**: Age of the fighter.
- **title_bout**: Boolean value of whether it is a title fight or not.
- **weight_class**: Weight class the fight is in (Bantamweight, Heavyweight, Women's Flyweight, etc.).
- **no_of_rounds**: Number of rounds the fight was scheduled for.
- **current_lose_streak**: Count of current concurrent losses of the fighter.
- **current_win_streak**: Count of current concurrent wins of the fighter.
- **draw**: Number of draws in the fighter's UFC career.
- **wins**: Number of wins in the fighter's UFC career.
- **losses**: Number of losses in the fighter's UFC career.
- **total_rounds_fought**: Average of total rounds fought by the fighter.
- **total_time_fought (seconds)**: Count of total time spent fighting in seconds.
- **total_title_bouts**: Total number of title bouts taken part in by the fighter.
- **win_by_Decision_Majority**: Number of wins by majority judges decision in the fighter's UFC career.
- **win_by_Decision_Split**: Number of wins by split judges decision in the fighter's UFC career.
- **win_by_Decision_Unanimous**: Number of wins by unanimous judges decision in the fighter's UFC career.
- **win_by_KO/TKO**: Number of wins by knockout in the fighter's UFC career.
- **win_by_Submission**: Number of wins by submission in the fighter's UFC career.
- **win_by_TKO_Doctor_Stoppage**: Number of wins by doctor stoppage in the fighter's UFC career.
```

In [None]:
# The below code shows null values present in the data in column-wise
df.isnull().sum()

#### Distribution of Null Values 

In [None]:
#This code will iterate over all columns in the DataFrame, 
# and if a column starts with 'B_', it will add the column name to the list B_fighters_columns.
B_fighters_columns = []
for column in df.columns:
        if column.startswith('B_'):
            B_fighters_columns.append(column)

In [None]:
#Same code but for R_fighters.
R_fighters_columns = []
for column in df.columns:
        if column.startswith('R_'):
            R_fighters_columns.append(column)

In [None]:
#function null_values_vis_bar_plot takes four parameters: df, columns_list, color, and title. 
#It generates a bar plot to visualize the count of null (NaN) values in columns 
# of a DataFrame specified in the columns_list.
def null_values_vis_bar_plot(df, columns_list, color, title):
    null_values = {}

    for column in columns_list:
        if df[column].isnull().sum()!=0:
            null_values[column] = df[column].isnull().sum()

    # Create bar plot for null counts in columns starting with 'B'
    plt.figure(figsize=(18,8))
    plt.bar(null_values.keys(), null_values.values(), color=color)
    plt.xlabel('Columns')
    plt.ylabel('Count of Null Values')
    plt.title(f'{title}')

    plt.xticks(rotation=90)  # Rotate x-axis tick labels vertically

    plt.show()

In [None]:
null_values_vis_bar_plot(df, B_fighters_columns, 'blue', "Null Values Visualisation for fighters in the Blue Corner")

In [None]:
null_values_vis_bar_plot(df, R_fighters_columns, 'red', "Null Values Visualisation for fighters in the Red Corner")

In [None]:
#Checking null values in columns that doesn't related to the fighters (exclude R_ and B_ letter in the begining of the columns)
for column in df.columns:
        if df[column].isnull().sum()!=0 and not column.startswith('R_') and not column.startswith('B_'):
            print(f"Nan in {column}: {df[column].isnull().sum()}")

* As we can see above, Blue fighters have 1427 missing rows and Red fighters have 712, which means they probably must not have had any UFC fights before, that's why their stats data are missing. We can not delete them, since in total we got 6012 rows, and deleting 1/3(1427 + 712) of it will significantly reduces the size of your dataset and affect on it's shape. 
* We can also notice that the only column that has missing values and does not related to any fighter is Referee column, which we don't really need for our further analysis and model, so we just simply drop it.
* Instead of deleting the missing rows, we can impute the missing values with sensible replacements. 
For numerical features, we will fill missing values with the mean, median, or mode of the respective feature. For categorical features, you might fill missing values with the most frequent category.

In [None]:
# Plot histograms for selected columns before imputation
plt.figure(figsize=(60, 70))
plt.subplots_adjust(hspace=0.5)
B_fighters_NULLcolumns = []
for column in B_fighters_columns:        
    if df[column].isnull().sum()!=0:
        B_fighters_NULLcolumns.append(column)
            
for i, column in enumerate(B_fighters_NULLcolumns):
    plt.subplot(9, 10, i+1)
    sns.histplot(df[column], kde=True, color='blue', bins=30)
    plt.title(f'{column} - Before Imputation')
    plt.xlabel('')
    plt.margins(x=0)  # Disable margins to allow zooming
plt.show()

In [None]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [None]:
#Null columns list update 
B_fighters_NULLcolumns = []
for column in B_fighters_columns:        
    if df2[column].isnull().sum()!=0:
        B_fighters_NULLcolumns.append(column)

# Create a copy of the dataframe
df_imputed = df2.copy()

# Initialize the imputer
imputer = IterativeImputer(max_iter=10, random_state=0)

# Fit and transform the imputer on the selected columns
df_imputed[B_fighters_NULLcolumns] = imputer.fit_transform(df_imputed[B_fighters_NULLcolumns])

# Check if there are still any missing values
print("Number of missing values after imputation:")
print(df_imputed[B_fighters_NULLcolumns].isnull().sum())

In [None]:
# Plot histograms for selected columns after imputation
plt.figure(figsize=(60, 70))
plt.subplots_adjust(hspace=0.5)

for i, column in enumerate(B_fighters_NULLcolumns):
    plt.subplot(9, 10, i+1)
    sns.histplot(df[column], kde=True, color='orange', bins=30)
    plt.title(f'{column} - After Imputation')
    plt.xlabel('')
    plt.margins(x=0)  # Disable margins to allow zooming
plt.show()

#### Visualisation of Relationships Between Variables:

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr())

plt.show()

#### Target Variable "Winner" Distribution 

In [None]:
# Set the figure size
plt.figure(figsize=(10, 10))

#Define custom colors for each category 
colors = {'Blue': 'blue',
          'Red': 'red',
          'Draw': 'grey'}

# Plot pie chart
df['Winner'].value_counts().plot(kind='pie', 
                                 autopct='%1.1f%%', 
                                 startangle=90, 
                                 colors=[colors.get(label, 'green') for label in df['Winner'].unique()])
plt.title('Distribution of Target\n')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.legend()
plt.show()

We can see here that "Draws" are incredibly rare. Since the "Winner" variable is going to be our target, we are dealing with multi-class classification problem, but to simplify it "Draws" should be removed from the target variable so it becomes a binary classification task.

With a dataset containing a wide range of features like the one you provided, there are several potential avenues for visualization and analysis. Here are a few ideas to get you started:

* Winner Distribution: Visualize the distribution of winners ('Winner' column). This can give insights into whether certain factors or fighter characteristics contribute to winning more often.
* Fighter Characteristics: Explore the distribution of various fighter characteristics such as height, weight, reach, age, and stance. You can visualize these distributions using histograms or box plots and compare them between winning and losing fighters.
* Fight Outcomes by Weight Class: Analyze the distribution of fight outcomes (win/loss) across different weight classes ('weight_class' column). This can reveal whether certain weight classes have higher win rates or are more competitive.
* Fight Statistics: Explore the average fight statistics (e.g., significant strikes, takedowns, knockdowns) for winning and losing fighters. Visualize these using histograms or box plots to identify differences in performance between winners and losers.
* Stance Analysis: Investigate the impact of fighter stance ('B_Stance' and 'R_Stance' columns) on fight outcomes. Compare the win rates and performance metrics of fighters with different stances.
* Title Bout Analysis: Analyze the frequency and outcomes of title bouts ('title_bout' column). Visualize the distribution of wins and losses in title bouts compared to non-title bouts.
* Correlation Analysis: Compute the correlation matrix between different numerical features in the dataset and visualize it using a heatmap. This can help identify relationships and dependencies between variables.
* Time Series Analysis: If the 'date' column contains temporal data, you can perform time series analysis to examine trends and patterns in fight outcomes or fighter characteristics over time.

### 4: Data Preprocessing:

### Dealing with NaNs

In [None]:
# Returns the column names along with the number of NaN values in that particular column
for column in df.columns:
    if df[column].isnull().sum()!=0:
        print(f"Nan in {column}: {df[column].isnull().sum()}")


* Numerical Columns: Many numerical columns such as those related to average statistics (e.g., average knockdowns, significant strike percentages, takedown percentages, etc.) have a substantial number of missing values (1427 missing values). This suggests that there may be instances where this statistical information was not recorded for certain fights or fighters.
* Categorical Columns: Categorical columns like 'Stance' for both red and blue fighters also have missing values, albeit fewer compared to the numerical columns. Similarly, the 'Referee' column has 32 missing values.
* Physical Attributes: Columns related to physical attributes like height, reach, and weight also have missing values. For example, 'B_Reach_cms' has 891 missing values, 'R_Reach_cms' has 406 missing values, 'B_Height_cms' has 10 missing values, and 'R_Height_cms' has 4 missing values. The number of missing values in these columns is relatively smaller compared to the numerical columns.
* Age: Both 'B_age' and 'R_age' columns have missing values (172 and 63 missing values respectively), indicating that the age of some fighters is not recorded in the dataset.
* Winner: It's noteworthy that the 'Winner' column has no missing values, indicating that the outcome of each fight is known.


* Referee doesn't look like an important column. Let's delete that.
* Let's see if height and reach have a correlation
* The rest i.e. Age, Stance and Height, let's fill with the median of that column.

Droping "Referee" column 

In [None]:
df2 = df.copy()

In [None]:
df2.drop(columns=['Referee'], inplace=True)

In [None]:
df2.head()

Correlation between 'Height' and 'Reach'

In [None]:
# Set style of scatterplot
sns.set_context("notebook", font_scale=1.1)
sns.set_style("ticks")

# Create scatterplot of dataframe
sns.lmplot('R_Height_cms', # Horizontal axis
           'R_Reach_cms', # Vertical axis
           data=df2, # Data source
           fit_reg=True # fix a regression line
           ) # S marker size

* We can see there is a positive correlation between height and reach. So we'll replace reach with height

In [None]:
df2.isnull().sum()

In [None]:
#Fills missing values in the Reach_cms column with Height_cms value
df2['R_Reach_cms'].fillna(df2['R_Height_cms'], inplace=True)
df2['B_Reach_cms'].fillna(df2['B_Height_cms'], inplace=True)

Filling 'Age', 'Stance' with the median of that column.

In [None]:
df2['B_Stance'].value_counts()

In [None]:
df2['R_Stance'].fillna('Orthodox', inplace=True)
df2['B_Stance'].fillna('Orthodox', inplace=True)

In [None]:
for column in df2.columns:
    if df2[column].isnull().sum()!=0:
        print(f"Nan in {column}: {df2[column].isnull().sum()}")

To check the shape of the 'B/R_age' columns before and after filling missing values with the median, we can visualise the distribution of values before and after filling missing values to get a sense of how the shape of the data has changed.

In [None]:
# Before filling missing values (kernel density estimation)
plt.figure(figsize=(8, 5))
sns.kdeplot(df2['B_age'].dropna(), color='blue', label='Before Filling Missing Values')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('KDE Plot of B_age Before Filling Missing Values')
plt.legend()
plt.show()


In [None]:
# Fill missing values with the median
df2['B_age'].fillna(df2['B_age'].median(), inplace=True)

In [None]:
# After filling missing values
plt.figure(figsize=(8, 5))
sns.kdeplot(df2['B_age'], color='orange', label='After Filling Missing Values')
plt.xlabel('Age')
plt.ylabel('Density')
plt.title('KDE Plot of B_age After Filling Missing Values')
plt.legend()
plt.show()

In [None]:
#This line fills any remaining missing values in the df2 with the median of each column.
df2.fillna(df2.median(), inplace=True)

### Removing non essential columns