 January 16, 2023


# Social Data Science Base Camp Exam 
____

In this project, I will extract and analyse data on UFC fighters. While I do not have a personal affinity for the sport, I frequently view it as a result of my boyfriend's interest. During one of his sessions of research on the UFC website, I was intrigued by its design and the comprehensive statistics provided on each fighter's performance. I also observed the presence of links to their social media profiles, adding to the website's appeal.


To meet the exam criteria and to make this notebook more organized, the structure is as follows:

**The notebook is divided into five overall sections**: 
1. Scraping the UFC website from data 
2. Extracting Twitter information 
    * Conducting word counts 
3. Merging datasets 
4. Data visualization 
5. Data analysis 
    * Linear regression
    * Linear regression with non-binary categorical variable 
    * Logistic regression 
    


Below, I will start with the first section. 

## Section 1: Scraping the UFC website 

For this task, I will be scraping data from the official website of the Ultimate Fighting Championship (UFC). The website contains a comprehensive list of all fighters in the UFC and provides detailed information on each fighter, including performance statistics (such as wins by knockout), background information (such as age), and links to their social media pages. To narrow down the list of fighters, I have applied filters based on fighting style, selecting those who specialize in MMA, jiu-jitsu, and Brazilian jiu-jitsu. This process has yielded a total count of around 300 fighters.

I began by scraping data from 336 fighters, but due to empty fields within the HTML table and/or advertisements, I removed these and ended up with a final count of 265 fighters. Using the HTML link provided by the UFC for each fighter, I extracted additional information from their profiles such as age, wins by knockout, wins by submission, significant strikes landed per minute, and arm reach. I find these variables interesting for the analysis as they might help answer questions such as the relationship between arm reach and wins by knockout, or other relationships.

At the end of section 1, I compiled all of the collected data into a data frame and saved it as a CSV file for further analysis.

Let's begin!



In [None]:
# importing requests and getting the HTML code

import requests as rq
ufc = rq.get('https://www.ufc.com/athletes/all?filters%5B0%5D=fighting_style%3A7146')

# checking if status code 200, ensuring that everyting is fine.
ufc.status_code

In [None]:
from bs4 import BeautifulSoup

In [None]:
# turning it into a beautiful soup object

from bs4 import BeautifulSoup
ufc_soup = BeautifulSoup(ufc.text) #text is basically the html code
ufc_soup

Upon examining the webpage, I noticed that all the fighter data is organized within a table. As such, I narrowed the scope of the data extraction process to only include the information contained within this table, which I stored in an object.

In [None]:
fighter = ufc_soup.find_all('li', class_ = "l-flex__item")

In [None]:
# lets take a look
fighter

The page I am trying to scrape has a "load more" button. As a result, my code from above only provided 12 fighters (as are presented on the first page on the website). 

In [None]:
len(fighter)

To scrape information from the other fighters, I will use the selenium package. 

In [None]:
# installing selenium
import selenium

In [None]:
# importing webdrive:

import webdriver_manager

In [None]:
#Furthermore, there are several packages I need for making this work.

import webdriver_manager
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time

# the URL to the ufc webpage:
UFC_URL = "https://www.ufc.com/athletes/all?filters%5B0%5D=fighting_style%3A7145&filters%5B1%5D=fighting_style%3A7146&filters%5B2%5D=fighting_style%3A7150"

PATIENCE_TIME = 60

LOAD_MORE_BUTTON_XPATH = "//a[@class='button']" # the xpath for the load more button. I found this 
                                                # when I inspected the element. From here, I could 
                                                # simply copy the xpath from the inspect

driver = webdriver.Chrome(ChromeDriverManager().install()) # This code below installs the webdriver. 
                                #I have chosen chrome for convenience purposes as this was already 
                                # installed on my computer. When running this code, it opens a new chrome window. 
driver.get(UFC_URL)

# There is probably a more pythonic way of doing this, but this is how I did: 
# To reach the final page, I'd have to click the "load more" button 27 times:

for n in range(27):
    try:
        loadMoreButton = driver.find_element(By.XPATH, LOAD_MORE_BUTTON_XPATH)
        time.sleep(2)
        loadMoreButton.click()
        time.sleep(5) 
    except:
        pass
print("Complete")
time.sleep(10)


page = driver.page_source # now I got the page in the page element

soup = BeautifulSoup(page, "html.parser") #making the page into a soup object

In [None]:
# Investigating it
soup

In [None]:
# as above, I extract the table I want from the soup and see how many fighters we have now 
# - luckily, a bit more than 12 this time.

fighter_soup = soup.find_all('li', class_ = "l-flex__item")
print(len(fighter_soup))

In [None]:
# inspecting one 

fighter_soup[0]

In [None]:
# extracting the name 

fighter_soup[1].find('span', class_='c-listing-athlete__name').text #get the name

In [None]:
# extracting the nickname 

fighter_soup[1].findAll('div', class_='field__item')[1].text

In [None]:
# extracting the weight class

fighter_soup[1].findAll('div', class_='field__item')[2].text

In [None]:
# extracting the twitter

fighter_soup[1].findAll('a', class_='c-listing-athlete-flipcard__social-link')[0].get('href')

In [None]:
# extracting the profile link + adding the first part of the link to the string as this is not included in the html code

ufclink = "https://www.ufc.com"
ufclink + str(fighter_soup[1].find('a')['href'])

In [None]:
# Creating a function so that I can get all the information for each fighter at once

def extract_fighter_info(soup): 

    name = soup.find('span', class_='c-listing-athlete__name').text
    if 'weight' in soup.find_all('div', class_='field__item')[1].text:
        nickname = np.nan 
    else:
        nickname = soup.find_all('div', class_='field__item')[1].text
    
    if ' \n\n\n\n\n' in soup.find_all('div', class_='field__item')[2].text:
        weight_class = soup.find_all('div', class_='field__item')[1].text
    else: 
        weight_class = soup.find_all('div', class_='field__item')[2].text
    
    twitter_str = "twitter"
    
    twitter_final = []
    

    for i in [0,1,2]:
        try:
            link_temp0 = soup.find_all('a', class_='c-listing-athlete-flipcard__social-link')[0].get('href')
            link_temp1 = soup.find_all('a', class_='c-listing-athlete-flipcard__social-link')[1].get('href')
            link_temp2 = soup.find_all('a', class_='c-listing-athlete-flipcard__social-link')[2].get('href')
        except:
            pass
    
    if twitter_str in link_temp0: 
        twitter_final.append(link_temp0)
    elif twitter_str in link_temp1:
        twitter_final.append(link_temp1)
    elif twitter_str in link_temp1: 
        twitter_final.append(link_temp2)
    else: 
        twitter_final.append(np.nan)
        
        
    profil_link = ufclink + str(soup.find('a')['href'])

    return name, nickname, weight_class, twitter_final, profil_link


The function that I have written contains several loops, each of which serves a specific purpose. One of the reasons for this is that I discovered during the data extraction process that not all fighters had nicknames. As a result, the weight class information was stored in the position of the nickname (i.e., field_item[2] became field_item[1]). Consequently, when I created the data frame, some of the values for the fighters did not match the correct columns.

Furthermore, the order of the Twitter links was not consistent for all fighters. To address this issue, I had to write a loop that iterated through the information within 'c-listing-athlete-flipcard__social-link', and then only extracted the info if the string contained 'twitter'. This approach ensured that the Twitter handles were properly assigned to the correct fighters in the final data frame.

In [None]:
# testing if it works - luckily it does.
extract_fighter_info(fighter_soup[1])

Getting this information for all the fighters by iterating through the fighter_soup, using the function:

In [None]:
import pandas as pd

info = []
for i in range(len(fighter_soup)):
    try:
        temp = extract_fighter_info(fighter_soup[i])
    except: 
        temp = pd.NA
    info.append(temp)

In [None]:
# everything should now be stored in "Info"

info

In [None]:
# the first one is an NA which makes it difficult to execute the code below when we want 
# the list in a dataframe. I remove the first NA only. 

info.pop(0)

In [None]:
# double checking that all fighters still are with us

len(info)

In [None]:
# making the info to dataframe

ufc_df = pd.DataFrame(info)

In [None]:
ufc_df

In [None]:
# there are, however, extra whitespace and newline characters at the names. Cleaning it:
info1 = []

for inf in info: 
    try:
        temp = inf[0].strip()
    except TypeError: 
        temp= pd.NA
    info1.append(temp)

In [None]:
len(info1)

In [None]:
# changing the column 0 to the info1 list with cleaned names
ufc_df[0]=info1

In [None]:
ufc_df

Lets drop the rows that have all NAN values.
The reason why some rows have all NA values is that some of the boxes of the structured HTML table contain no fighter, but instead just emty spaces or an advertisement. 

In [None]:

import numpy as np

ufc_df = ufc_df.dropna(how='all')
ufc_df.head(50)

In [None]:
# after cleaning, down to 265 observations

ufc_df.shape

In [None]:
# I see that the twitter links are now contained within lists. Lets fix it.

print(type(ufc_df.loc[0, 3]))


In [None]:
ufc_df[3] = ufc_df[3].str.get(0)

In [None]:
ufc_df
# now without the lists

In [None]:
# lets rename the columns: 
ufc_df.columns = ['name', 'nickname', 'weight_class', 'twitter', 'ufc_profile']
ufc_df.head(50)

### So far so good.

Now I have some different variables in my dataframe- but mostly just categoricals and links. 
The weightclass variable will constitute my non-binary categorical variable. 
But we need some more. Lets first create a binary variable - the most obvious one being a sex variable. I see that the fighters are not filtered by sex. However, we can easily do this by spotting how the weightclasses are seperated. For women, they are all called "women's X-weight" 


In [None]:
# Lets see how many women we have:

ufc_df.weight_class.str.count("Women").sum()

In [None]:
# creating the sex variable

ufc_df['gender'] = np.where(ufc_df.weight_class.str.count("Women"), 0, 1)

In [None]:
# women are now 0 and men are 1

ufc_df.head(20)

In [None]:
# counting the 1 and 0 in the gender column to make sure we get the same results as when counting the "women"
# string from the weight class

ufc_df['gender'].value_counts()

# and it matches

___ 

#### Scraping information from each fighter profile

As there are lots of information and stats on each of the fighter's profile, we can scrape this information and add it to the dataframe. 

The performance information we want from the profiles are: 

* wins by knockout (continuous) 
* wins by submission (continuous) 
* striking accuracy (stated in percentage on website but divided by 100 here) (continuous)
* significant strikes landed per min (continuous)
* Average fighting time (continuous)

___ 
First: **number of wins by knockout**

In [None]:

knockout = []

from tqdm.notebook import tqdm # using tqdm to get a feeling of the progress of the code


for link in tqdm(ufc_df['ufc_profile']):
    try:
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        knockout_temp = soup_temp.find("p", class_= "athlete-stats__text athlete-stats__stat-numb").text
        knockout.append(knockout_temp)
    except:
        knockout.append(np.nan)
        
# not all fighters have 1) won by knockout or 2) have the information listed. In those cases, they get NaN

In [None]:
knockout

In [None]:
ufc_df['wins_knockout'] = knockout 
ufc_df

___ 

Next: **wins by submission**

In [None]:

submission = []

for link in tqdm(ufc_df['ufc_profile']):
    try:
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        submission_temp = soup_temp.find_all("p", class_= "athlete-stats__text athlete-stats__stat-numb")[1].text
        submission.append(submission_temp)
    except IndexError:
        submission.append(np.nan)

# not all fighters have 1) won by submission or 2) have the information listed. In those cases, they get NaN

In [None]:
submission

In [None]:
ufc_df['wins_submission'] = submission

In [None]:
ufc_df

___ 

Next: **striking accuracy**

In [None]:
import re

striking_accuracy = []


for link in tqdm(ufc_df['ufc_profile']):
    try:
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        temp1 = soup_temp.find("svg", class_= "e-chart-circle").text
        temp2 = re.sub(r'\D', '' , temp1.strip()) # temp1 contains text. We are only interested in the number
        temp3 = int(str(temp2)[:2])/100 # the number is posted twice. We only want it once and it is a 2 
                                        # digit number + divide by 100 as it is stated in percentage
        striking_accuracy.append(temp3)
    except:
        striking_accuracy.append(np.nan)

In [None]:
striking_accuracy

In [None]:
# adding it to df

ufc_df['striking_accuracy'] = striking_accuracy 
ufc_df

___ 

Next: **significant strikes landed per min**

In [None]:


str_landed_min = []

for link in tqdm(ufc_df['ufc_profile']):
    try:
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        temp1 = soup_temp.find("div", class_= "c-stat-compare__number").text
        temp2 = re.sub(r'\n', '' , temp1.strip()) # temp1 contains \n. We are only interested in the number
        str_landed_min.append(temp2)
    except:
        str_landed_min.append(np.nan)

In [None]:
str_landed_min



For three of the fighters, the extraction was wrong, resulting in a value of 00:00. 
Looking into the profile webpages for the fighters manually, it turns out that the websites contain no values as they are not updated with fighter info. 

I will manually correct for these below after having included the list to the df.

In [None]:
# adding the list to the df.

ufc_df['sig_str_landed_min'] = str_landed_min

In [None]:
# correcting values for fighter index 17, 71 and 157

ufc_df['sig_str_landed_min'][17] = np.nan
ufc_df['sig_str_landed_min'][71] = np.nan
ufc_df['sig_str_landed_min'][157] = np.nan

In [None]:
ufc_df.head(20)

___ 

Next: **average fight time**

In [None]:

avg_fight_time = []

for link in tqdm(ufc_df['ufc_profile']):
    try:
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        temp1 = soup_temp.find_all("div", class_= "c-stat-compare__number")[7].text
        temp2 = re.sub(r'\D', '', temp1.strip())
        temp3 = int(str(temp2))/100
        avg_fight_time.append(temp3)
    except:
        avg_fight_time.append(np.nan)

In [None]:
avg_fight_time

# some of the fighter get nan value here as the website is not updated with the average fighting time

In [None]:
ufc_df['avg_fight_time'] = avg_fight_time
ufc_df.head(15)

In [None]:

max(avg_fight_time)

I was initially very confused regarding the fight time statistics for certain fighters. Specifically, I observed that some fighters had an average fight time that exceeded the standard duration of 15 minutes (as fights consists of three rounds, each of five minutes). While this discrepancy initially puzzled me, I eventually discovered that it was likely due to the inclusion of overtime periods or other factors.

When manually examining fighters with an average fight time exceeding 15 minutes, I found that the information had been accurately extracted from the website.

___ 

Now I will move on from the fighter stats and extract some background info on the fighters.
More specifically, I will extract the following background information: 
* status
* age 
* reach 

___ 


First: fighter status

In [None]:
status = []

# some of the fighters dont have fighter status and the code instead extracted the fighters hometown/country.
# These had the format "hometown, country" and i thus had to tell the code that if there was a comma in the 
# text, make this a nan value instead. 

string = ','

for link in tqdm(ufc_df['ufc_profile']):
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        temp1 = soup_temp.find("div", class_= "c-bio__text").text
        if string not in temp1: 
            status.append(temp1)
        else:
            status.append(np.nan)

In [None]:
status

In [None]:
ufc_df['status'] = status

In [None]:
# creating a binary variable of status. 
# with this code, the default will be fighting =0 and the "not fighting for various reasons (not fighting/ 
# retired/ nan/ etc.) = 1. 

ufc_df['status_binary'] = np.where(ufc_df.status.str.count("Active"), 0, 1)

In [None]:
ufc_df.head()

___ 


Next: **age**

In [None]:
age = []


for link in tqdm(ufc_df['ufc_profile']):
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        temp1 = soup_temp.find("div", class_= "field field--name-age field--type-integer field--label-hidden field__item").text
        age.append(temp1)

In [None]:
age

In [None]:
ufc_df['age']= age

___ 


Next: **reach**

Not all fighters had their reach info written on their page. 
In these situations, the -2 index would instead contain info on the fighters' debut date. 
Debut date consist of a string with more than 5 characters (e.g. JUL. 30, 2022). 
I thus coded that for each time the fighters -2 index would be a string with 5 characters or less, append it to the list. Otherwise, insert nan. 

In [None]:

reach = []


for link in tqdm(ufc_df['ufc_profile']):
        url1 = rq.get(link)
        soup_temp = BeautifulSoup(url1.text)
        temp1 = soup_temp.find_all("div", class_= "c-bio__text")[-2].text
        if len(temp1) <= 5:
            reach.append(temp1)
        else: 
            reach.append(np.nan)

In [None]:
reach

In [None]:
ufc_df['reach']= reach

In [None]:
ufc_df

In [None]:
ufc_df.to_csv('ufc_data.csv', index=True)

___ 

In retrospect, creating a function to get all the information might have been easier. But on the other hand, the web pages were different from fighter to fighter, and I think it would have caused more problems if I didn't go through the information bit by bit. 




___ 

## Section 2:  getting Twitter information 

To retrieve the Twitter information for each fighter, I used the column in the ufc_df that contained their respective Twitter links ('twitter'). Using the Twitter API, I extracted information such as follower count, total number of tweets, and their past 0-10 tweets. This information was subsequently stored in the twitter_df.

After retrieving this information, I conducted several wordcounts as prescribed by the exam requirements:
* a word count that's relevant to each observation in my DataFrames. 
* a word count where I find the most popular word in my text variable containing all fighters' tweets. 
    * Then I count how many times this popular word is used within each observation's tweets.


In [None]:
# Importing tweeting and getting the different tokens from my AppCred file:

import tweepy

from AppCred_Template import BEARER_TOKEN
from AppCred_Template import CONSUMER_KEY, CONSUMER_SECRET
from AppCred_Template import ACCESS_TOKEN, ACCESS_TOKEN_SECRET


In [None]:
# getting the api

api = tweepy.Client(bearer_token = BEARER_TOKEN,
                       consumer_key = CONSUMER_KEY,
                       consumer_secret = CONSUMER_SECRET,
                       access_token = ACCESS_TOKEN,
                       access_token_secret = ACCESS_TOKEN_SECRET,
                       return_type=dict,        # Return the response as a Python dictionary.
                       wait_on_rate_limit=True) # Wait once the rate limit is reached. 

Right now, the twitter links contain the entire link in the dataframe. Not just the handle. Below, I will make a list only consisting of the twitter handle.

In [None]:
ufc_df['twitter']

In [None]:
# using regex to only get the twitter handle

import re

twitter_handle = []

for string in ufc_df['twitter']: 
    temp_string = re.sub(r'^\w\w\w\w\w\W\W\W\w\w\w\w\w\w\w\W\w\w\w\W', '', str(string))
    twitter_handle.append(temp_string)

In [None]:
twitter_handle

___

To get hold of various twitter information on the fighters' twitter profiles, i will look into the user field of public metrics. This contains, among other, information on followers count, following count and number of tweets. 

In [None]:
# creating an object containing everything so i wont have to ask twitter for each item and wait:

twitter_info = []

for handle in twitter_handle:   
    try:
        if len(handle) <= 3: 
            twitter_info.append(np.nan) 
        else: 
            handle_temp = api.get_user(username= handle, user_fields = ['public_metrics'])
            twitter_info.append(handle_temp)
    except: 
        twitter_info.append(np.nan)
        

I included the 'if statement' in this code, as the fighters with no twitter handle had the handle string 'nan' and not a NaN. This resulted in that a person on twitter with a twitter handle of 'nan' was included several times - and he had no connection to the ufc whatsoever. As a result, I had to change this, saying that if the twitter handle was equal to or below 3, let this be NaN. 
From eyeballing the list above, no other fighter has a twitter handle at 3 characters or less. 

In [None]:
twitter_info

In [None]:
len(twitter_info)

In [None]:
type(twitter_info)

**Extracting the number of followers:**

In [None]:

follower_count = []

for i in range(len(twitter_info)):
    try:
        follower_count_temp = twitter_info[i]['data']['public_metrics']['followers_count']
        follower_count.append(follower_count_temp)
    except: 
        follower_count.append(np.nan)

In [None]:
follower_count

In [None]:
len(follower_count)

**Extracting how many profiles each fighter is following:**

In [None]:
following_count = []

for i in range(len(twitter_info)):
    try:
        following_count_temp = twitter_info[i]['data']['public_metrics']['following_count']
        following_count.append(following_count_temp)
    except: 
        following_count.append(np.nan)

In [None]:
following_count

**Extracting the total tweet count of each fighter:**

In [None]:
tweet_count = []

for i in range(len(twitter_info)):
    try:
        tweet_count_temp = twitter_info[i]['data']['public_metrics']['tweet_count']
        tweet_count.append(tweet_count_temp)
    except: 
        tweet_count.append(np.nan)

In [None]:
tweet_count

In [None]:
# saving the extracted information in a dataframe

twitter_df = pd.DataFrame(list(zip(twitter_handle, follower_count, following_count, tweet_count)),
               columns =['twitter_handle', 'follower_count', 'following_count', 'tweet_count'])

In [None]:
twitter_df

# As visible below, I am keeping the rows for fighters that do not have twitter, resulting in NaNs in the 
# twitter rows, as I am planning on merging the UFC and twitter datasets. 

**Extracting the tweets:**

Twitter is a central way of communicating in the UFC sport, and it makes sense that many UFC fighters invest much time and produce many tweets. 
However, due to my first-level developer account, there are certain limitations to how many tweets I can get hold of and how many days back I can access them. 
Given that I also have more than 200 fighters in my dataset, I assume it is acceptable not to extract all tweets for each fighter (as we can see on the tweet_count, Twitter is heavily used) but that it is adequate only to obtain 1-10 tweets per fighter. 

I will do this below:

In [None]:
# creating an object containing the timeline for the fighters. 

fighter_timeline = []

for i in range(len(twitter_info)):
    try:
        id_temp = twitter_info[i]['data']['id']
        timeline_temp = api.get_users_tweets(id_temp)
        fighter_timeline.append(timeline_temp)
    except: 
        fighter_timeline.append(np.nan)


In [None]:
fighter_timeline

In [None]:
len(fighter_timeline)

In [None]:
# getting a feeling of how the data is structured
fighter_timeline[0]['data'][0]['text']

In [None]:

for fighter in fighter_timeline: 
    try: 
        print(len(fighter['data']))
    except TypeError: 
        pass

I see that we maximum get 10 tweets per fighter. I assume those with a smaller tweet number simply produced fewer tweets within the past 7 days. 

Within the fighter_timeline, there are many fighters, each with a dict of data and herunder, 0-10 strings. 
I have extracted these by using two for loops:



In [None]:

tweets_text1 = []

for fighter in fighter_timeline: 
    tweets_text = []
    try: 
        for text in range(len(fighter['data'])):
            text_temp = str(fighter['data'][text]['text'])
            #print(text_temp)
            tweets_text.append(text_temp)
    except: 
        tweets_text.append(np.nan)
    tweets_text1.append(tweets_text)
        
        

In [None]:
# tweets_text1 now have the tweets divided seperated for each fighter. 
tweets_text1

In [None]:
len(tweets_text1) # and we still have the NaNs 

In [None]:
# adding it to the dataframe
twitter_df['tweets_text'] = tweets_text1

In [None]:

twitter_df

In [None]:
# checking that everything is still there even though it is now in pandas dataframe
twitter_df['tweets_text'][0]

**Next step: Conduct a word count relevant to each observation in my Twitter data frame** 

As fighters use Twitter to comment on fights and fighters, both in real-time and between fighting events, I imagine a lot of references to the UFC. I will count the word for each fighters' tweets and include it in the data frame.



___





In [None]:
# inspecting the first element of the list that we added to the dataframe

tweets_text1[0]

In [None]:
# just trying with one of the fighter tweets before moving on to iterating through all of them
count = 0

for tweets in tweets_text1[1]:
    if 'UFC' in tweets:
        count += 1
count

In [None]:
# Iterating over the tweets to count how many times each fighter mentions UFC in their past tweets

total_count = []
count = 0

for fighter in tweets_text1: 
    for tweets in fighter: 
        if tweets is np.nan: 
            pass 
                
        elif 'UFC' in tweets:
            count += 1
    total_count.append(count)
    count= 0 # need to state this again to "restart" the count object so it does not sum everything. 
            
total_count

In [None]:
len(total_count)

In [None]:
twitter_df['ufc_count']= total_count
twitter_df


___

**Next step: More wordcounts**

As mentioned above, I will conduct the following steps below: 

1. finding the most popular word in the text variable in your DataFrame. This will be the tweets_text variable. 

2. Afterwards, I will create a new variable that indicates the number of times that word is used for each observation in the data. 


#### Step 1: finding the most popular word in the whole text variable: tweets_text

In [None]:
# this is the object. It is exactly equal to the column in the dataframe (see above). 
# I simply find it easier to work with in this list form, but it is the same. 

tweets_text1


To get the most popular word in the whole text variable, I find it difficult when the lists are nested. 
Below, I make a function that takes all the lists and flattens them so that I will get one long list only with strings of the text

In [None]:

def flatten(input_list):
    output_list = []
    for element in input_list:
        if type(element) == list:
            output_list.extend(flatten(element))
        else:
            output_list.append(element)
    return output_list


In [None]:
# like this

tweets_text2 = flatten(tweets_text1)
tweets_text2

Next step is to split the strings into seperate words. However, I get an error as I have NaNs in my list, 
And I cannot split an element that is not a string. Thus, I have to remove the NaNs. I do this by 
iterating over the elements in my list and assessing whether they are floats. When it encounters a float, 
it will move on; when its not a float, it will append this word to a list. 
My new list with strings, containing no floats, are thus within the res object. 

In [None]:

res = []
for element in tweets_text2:
    try:
        float(element)
    except ValueError:
        res.append(element)

Now it is possible for me to split it. Furthermore, I will make all letters into lower case so that if a word has been written multiple times and only differs in upper/lower case, it will count the same. 
I do this by using list comprehension: 

In [None]:

tweets_text3 = [word for line in res for word in line.lower().split()]

In [None]:
tweets_text3

Before finding the most popular word, I will remove stopwords by using a stop_words file I have been introduced to at my university. 

In [None]:
# opening the file

with open('stop_words.txt', 'r') as txt:
    stop_words = txt.read()

In [None]:
# saving the file by closing it 
txt.closed

In [None]:
# stop_words is a bit messy. Just cleaning it a bit below: 

stop_words

stop_words_list = stop_words.replace('\n', ' ').split(' ') 
# code to replace \n with ' ' and split it at the ' '

print(stop_words_list)

In [None]:
# removing non-word characters from the text 

tweets_words_only = [re.sub(r'\W','',element).strip() for element in tweets_text3]
tweets_words_only

In [None]:
# iterating over the words in the list to see if the words are present within the stop_words list. 
# if they are, they are passed. Otherwise, the are appended to the list relevant_words. 

relevant_words = []

for word in tweets_words_only: 
    if word in stop_words: 
        pass
    else: 
        relevant_words.append(word)
        
relevant_words

In [None]:
# AND now we can get the frequency of the most used word in the whole text variable. 

from collections import Counter

frequent_words = Counter(relevant_words)
top_four = frequent_words.most_common(4)
print(top_four)

I believe I picked out a very reasonable word in my first word count, mirrored in the word count of the entire text variable, as shown here. UFC is the most frequently used word among fighters' tweets. As I wrote above, I believe this makes sense as Twitter is heavily used within the organisation and community. 
As step two of this exam project is to create a new variable that indicates the number of times that this most popular word is used for each observation in the data, I have already created this above. To ensure that I fulfil the exam requirements, I will create a new variable that indicates the number of times the second most used word is mentioned in each fighter's tweets, that being "fight". I will do this below.

___

#### Step 2: Create a new variable that indicates the number of times the *second* most frequent word is used for each observation in the data (as I accidentally already counted the most frequent word and added this to the data frame)




In [None]:
# same code as above with the 'UFC' string within each fighters' tweets. 
# Iterating over the tweets to count how many times each fighter mentions fight in their tweets

total_count1 = []
count1 = 0

for fighter in tweets_text1: 
    for tweets in fighter: 
        if tweets is np.nan: 
            pass # same here as above with the first 'UFC' count. 
                
        elif 'fight' in tweets:
            count1 += 1
    total_count1.append(count1)
    count1= 0 # need to state this again to "restart" the count object so it does not sum everything. 
            
total_count1

In [None]:
twitter_df['fight_count']= total_count1
twitter_df



___ 

## Section 3: Merging datasets


I have tried to construct these two data frames so that I could merge them in this step. That is, I have kept those NANs in my head in every task, ensuring that even though fighters' might not have Twitter or may not have all background information, I have kept the observation. 
When conducting the analysis later, I intend to create copies of the data frames and remove the fighters with NANs in key variables for my given analysis. 

In [None]:
# As the index of the ufc dataframe was not reset after having removed observations, it messed with the 
# concenation below. Thus, I have reset them here and used drop=True as to not add the "old" index as a column
ufc_df = ufc_df.reset_index(drop=True)

In [None]:
total_df = pd.concat([ufc_df, twitter_df], axis = 1) # axis=1 for horizontal join
total_df

In [None]:
# Overview of the columns 

for col in total_df.columns:
    print(col)

In [None]:
# getting an overview of how many variables are nummeric

numerics = total_df.select_dtypes(include=np.number).columns.tolist()

print(range(len(numerics))) # there is 9 and their names are as below
numerics


I see that the 'wins by submission', 'wins by knockout', 'significant strikes landed per min' and ' reach' is not nummeric. Will change this below.

___ 

When trying to convert the values within these variables to integer, I get an error as it will not accept the NaNs in the dataframe. I thought about whether I wanted to make a copy of the dataframe and remove rows with NANs when each variable was needed or whether I simply filled the NaNs with zero. In these instances I fill them with 0, as not all fighters' have (yet) won by submission or knockout. These got NaN as the UFC website did not post the metric if they had not won. 

In [None]:
total_df['wins_knockout'] = total_df['wins_knockout'].fillna(0).astype('int')
total_df['wins_submission'] = total_df['wins_submission'].fillna(0).astype('int')

In [None]:
total_df

For striking accuracy, however, I am certain the fighters do not have an accuracy of 0. For fighters with NaN as striking accuracy, it may instead be that the webpage was not updated. Let me count how many fighters it is: 

In [None]:
print(total_df['striking_accuracy'].isna().sum())

As it is only three fighters, I will remove these from the dataframe for simplicity reasons. It might be more proper to make copies of the total_df dataset and only remove these observations when the variables are needed for analysis, thus, keeping them in the total dataframe and using the other variables of these observations in other analyses. 

However, for both this variable and 'reach' below, I evaluated that a total loss of 9 observations is 
not too many. Nevertheless, I recognise the ideal way of doing this might be to make new dataframes for each variable. 

In [None]:
# removing them
total_df = total_df.dropna(subset=['striking_accuracy'])

In [None]:
# checking if they are removed. 
print(total_df['striking_accuracy'].isna().sum())

In [None]:
# checking if the rows were dropped from the whole dataset
# from 265 to 262 are three --> they are removed. 
total_df.shape

In [None]:
# making the striking_accuracy into integer as well: 

total_df['striking_accuracy'] = total_df['striking_accuracy'].astype('int')

In [None]:
# same thing with reach as with striking accuracy:

print(total_df['reach'].isna().sum())

# there are six rows with nan

In [None]:
# removing these
total_df = total_df.dropna(subset=['reach'])

As I need the variable as integers, and the values at the moments are of 80.00, 65.00, etc. - character, I have to loop through each row, first converting each value to floats and afterwards converting them to integers. I save these in a new list and substitute the old reach column with this new list. 

In [None]:

new_reach = []

for row in total_df['reach']: 
    val = float(row)
    new_reach.append(int(val))
    
new_reach

In [None]:
# substituting old reach column with new
total_df['reach'] = new_reach

In [None]:
# final control of whether the variables really are nummeric now: 

numerics = total_df.select_dtypes(include=np.number).columns.tolist()

print(range(len(numerics))) 
numerics

# and they are here. 

In [None]:
# saving total_df as csv file

total_df.to_csv('total_df_ufc_twitter.csv', index=True)

Now the data frames are merged into one, total_df, and the different variables that should be numeric are converted. 


___

## Section 4: Data visualization




As the task here, among others, is to plot the outcome variables, I will briefly just expand on the outcome variables I have in mind. 

For the linear regression, I thought it interesting to investigate wins by knockout as a function of reach - I don't know much about fighting, so I might disregard some techniques here, but I find it reasonable to investigate whether arm length have an effect on number of knockouts.

For the logistic regression, I intend to explore the fighters' fighting status as a function of number of wins - that is, I will make a new column in the dataset just below where I aggregate the values within the two columns wins_knockout and wins_submission. 

Below, I will 1) plot the outcome variables of "wins_knockout" and "status_binary" on their own below. I will do this by using histograms which are relevant for continuous and/or discrete data and is useful in visualising the distribution of the data.

Afterwards, I will 2) make two bivariate plots containing the outcome variable and two predictors 
(one predictor per plot). 

#### 1. plotting the outcome variables on their own

#### Visualizing "wins_knockout"

In [None]:
import matplotlib.pyplot as plt

     
plt.hist(total_df['wins_knockout'], ec="k") #ec="k" draws the lines between each bin. 

# adding title and labels
plt.title("Frequency of wins by knockout by UFC fighters")
plt.xlabel("Number of knockout wins")
plt.ylabel("Count of knockout wins")

plt.show()# to avoid clutter

As depicted in the plot, there are a few exceptional fighters with a significantly high number of wins by knockout, although they represent a minority. This plot is a good overview the range of knockout wins and the number of fighters falling within that range but it does not give us a good sense of the distribution at the lower end (or from 0-10/11). To obtain a more fine grained impression, we could consider excluding these outliers, but for now, this plot suffices in presenting an overall view of the distribution.

#### Visualizing status_binary

In [None]:
plt.hist(total_df['status_binary'], bins=[-.5,.5,1.5], ec="k")

# adding title and labels
plt.title("UFC Fighting status frequency")
plt.xlabel("Fighting status: 0=active, 1=non-active")
plt.ylabel("Count fighting status among fighters")
plt.xticks((0,1))

plt.show()# to avoid clutter


As explicitly stated in the x-axis of the figure, a default value of 0 corresponds to "fighting" and a value of 1 represents "not fighting." The majority of the fighters listed on the website are currently active, participating in UFC fights, and engaged in the sport. However, there are also some non-active fighters featured on the website. The criteria for organizing this list and determining when to remove a non-active fighter is unknown. It is possible that UFC may choose to keep high-ranking non-active fighters on the website for a certain period of time, removing them only when they are no longer of interest to the public. This is purely speculative, however, and not based on any specific information.

For now, we can see the distribution of active and non-active fighters in the data frame.

#### 2. make two bivariate plots containing the outcome variable and two predictors  (one predictor per plot)

##### First bivariate plot: relationship between wins by knockout and reach. 

In [None]:
plt.scatter(total_df['reach'],total_df['wins_knockout'], )
# reach on the x axis (independent) and wins_knockout on the y axis (dependent)

# adds titles
plt.title("Relationsship between UFC fighters' reach and wins by knockout")
# adds x-axis label
plt.xlabel("Reach in inches")
# adds y-axis label
plt.ylabel("Number of knockout wins")
# add x-axis tick label

plt.show()

As the plot is self-contained with labels and titles, we can now interpret it. We can see an upwards trend, suggesting that there might be a relationship between the variables. However, much scatter indicates that if a relationship exists, it may be weak. 

##### Second bivariate plot: relationship between fighter status and number of wins 

As explained above, I will make a new variable, total_wins, which aggregates the values of the columns wins_knockout and wins_submission. This will be the independent variable. I do this before I move on to the plot.

In [None]:
total_df['total_wins'] = total_df.apply(lambda row: row.wins_knockout +
                                  (row.wins_submission), axis = 1)

In [None]:
total_df

In [None]:
plt.scatter(total_df['total_wins'],total_df['status_binary'], )
# total wins on the x axis (independent) and status on the y axis (dependent)

# adds titles
plt.title("binary bivariate relationsship between number of wins and fighting status")
# adds x-axis label
plt.xlabel("number of a fighters' total wins")
# adds y-axis label
plt.ylabel("Fighter status, 0=active, 1=non-active")
# add x-axis tick label

plt.show()

As usual, it is difficult to extract much meaning from binary bivariate visualizations; all observations clutter at either 0 or 1, which makes sense given the variable's format. However, we might still be able to get a small sense of it. For example, we see that more fighters with a fighter status of non-active have a high number of wins. Intuitivally, this might make sense: if a fighter now is retired, she/he might have had a long career with many many fights throughout it, resulting in many wins. A younger/newly accepted ufc fighter might not have had as many fights yet and thus, not many wins. 

As is typical with binary bivariate visualizations, it can be challenging to derive substantial insights given that observations tend to cluster at either 0 or 1, in line with the variable's binary format. Nevertheless, we may still be able to get some insights from this plot. Notably, we observe that a few number of non-active fighters have a substantial number of wins. This observation aligns with our intuition, as retired fighters may have had lengthy careers with numerous fights, leading to a higher number of wins. In contrast, a younger or newly accepted UFC fighter may not have had as many fights yet and consequently may have fewer wins. On the other hand, active fighters appear to have a higher overall number of wins compared to the majority of non-active fighters. This observation suggests that fighters who do not often win may eventually stop participating at this level of fighting.
____

## Section 5: Data Analysis 

The section is divided into several analyses: 

1. Linear regression 
2. Linear regression - control (categorical)
4. Logistic regression 


In [None]:
# packages needed for this section

import statsmodels.formula.api as smf
from stargazer.stargazer import Stargazer
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols

##### Linear regression 

As explained above, I want to conduct a linear regression model of the continuous outcome variable "wins_knockout" (wins by knockout) and the predictor variable "reach". 

I hypothesize a positive relationsship indicating wins by knockout to increase for each unit increase in reach. 

Thus, I want to conduct a linear regression model with the continuous "wins_knockout $Y_i$ as outcome and "reach" $X_i$ as predictor. In other words, I assume that the wins_knockout $Y_i$ is a function of the following:

$$
Y_i = \beta_0 + \beta_1X_i + \epsilon_i
$$

where the errors $\epsilon_i$ are independent, normally distributed variables with $E(\epsilon_i)=0$ and $SD(\epsilon_i)=\sigma$.

In [None]:
# estimating the model 

est_mod = ols('wins_knockout~reach', data=total_df).fit()

In the cell below, I print the estimates of the regression coefficients $\beta_0$ and $\beta_1$:

In [None]:
est_mod.params.round(2)

In [None]:
# finding sd 
np.sqrt(est_mod.scale).round(2)

In [None]:
# shown more clearly using stargazer

s = Stargazer([est_mod])
s

We thus get the following estimated relationship between wins by knockout and reach:

$$
E(Y_i) = -16.79 + 0.32X_i + \epsilon_i
$$

where $\epsilon_i$ are independent, normally distributed with $E(\epsilon_i)=0$ and $SD(\epsilon_i)=4.12$. 

Based on the model, it is possible to interpret the coefficients as follows: 
* the $\beta_0 $ acts as the 'intercept' or 'constant'. The value of $\beta_0 $, being -16.79, is the value of the dependent variable, wins by knockout, if the effect of the independent variable, reach, is 0. This does not make much sense in this instance, as no fighter have a reach=0 and thus, no  wins by knockout at -16.79. If I had been smart before conducting this analysis, I would have standardized the values of the data frame so that 0 in reach would have been the reach mean of the fighters instead. This would make the intercept more meaningful. But as always, there is more clarity in hindsight. 


* the $\beta_1 $ denotes the independent variable, reach. In this instance,  $\beta_1 $ is equal to 0.32. Thus, for each unit increase in reach, wins by knockout will be affected positively with a 0.32 increase. 

As is visible from the stargazer table above, the coefficient, reach, is significant. We can get this understanding as we see that reach has three stars next to it, indicating a p-value of p<0.01. This means that there is a smaller probability than 0.01 of getting such an estimate for reach if the true effect were actually zero. This also means that I can confirm my hypothesis that there is a positive relationship between reach and wins by knockout. 

If I consider this logically, it may be the result of an underlying effect of the different fighters' weight classes - a variable that I have not controlled for in my analysis. All fighters are classified into weight classes, which ensures that no fighter has an unfair advantage if, for example, a larger fighter were up against a smaller fighter. In such cases, the techniques of the sport may not matter as much as the sheer physical size and strength of the fighter. For instance, a smaller fighter with a **shorter reach** may need to strike their opponent multiple times to secure a knockout, and in this instance, other techniques may be more effective in lighter weight classes. On the other hand, a heavyweight fighter with a **longer** reach may only need to strike their opponent once to secure a win by knockout. Therefore, it is possible that winning by knockout is more prevalent among higher weight classes and that weight class acts as a confounding factor. I will address this issue further below in the next section by controlling for weight class in my analysis.

Returning to the model itself, this is the model that is based on the actual observed data. To assess the distribution's spread and the model's usefulness, I will employ simulation and visualization techniques. This approach will enable me to determine whether the model produces data that closely resembles the actual data in the total_df.





In [None]:
# first plotting the estimated relationship between wins_knockout (Y) and reach (X):

def est_exp_ko_win(x) : return -16.79 + 0.32*x

sns.scatterplot(data=total_df,x='reach',y='wins_knockout')
sns.lineplot(x=[total_df['reach'].min(),total_df['reach'].max()],
             y=[est_exp_ko_win(total_df['reach'].min()),est_exp_ko_win(total_df['reach'].max())],
             color='black',linewidth=2);

The intercept is at -16.79 when reach has a value of 0. This means that the range of reach on the x-axis does not operate below approx. 55 and, thus, do not show the actual intercept. The plot indicates a positive relationship, but there is quite a bit of scatter. 

In [None]:
# simulating data

def sim_lin_reg_mod(beta0, beta1, sigma, xs, col_names) :
    ys = beta0 + beta1*xs + np.random.normal(0,sigma,xs.shape[0])
    sim = pd.DataFrame(zip(xs,ys),columns=col_names)
    return sim

np.random.seed(0)
ko_sims = [sim_lin_reg_mod(-16.79, 0.32, 4.12, total_df['reach'], ['reach','wins_knockout']) for i in range(0,5)]
ko_sims[0]

In [None]:
# plotting the observed data and simulated

dfs_plot = [total_df] + ko_sims

min_y = pd.concat(dfs_plot)['wins_knockout'].min()
max_y = pd.concat(dfs_plot)['wins_knockout'].max()

fig,ax = plt.subplots(2,3,figsize=(14,8))

for i in range(0,6) :
    dat = dfs_plot[i]
    a = ax.flatten()[i]
    sns.scatterplot(data=dat,x='reach',y='wins_knockout',ax=a, alpha=0.70) # alpha denoting hue of observations
    
    sns.lineplot(x=[total_df['reach'].min(),total_df['reach'].max()],
             y=[est_exp_ko_win(total_df['reach'].min()),est_exp_ko_win(total_df['reach'].max())],
             color='black',linewidth=3,ax=a);
    
    if i==0 :
        tit = 'Observed Outcomes'
    else :
        tit = 'Simulated Outcomes'
    a.set(title=tit,ylim=[min_y-20,max_y+20])

plt.tight_layout()

In general, I believe that the estimated model generates data that looks similar to the actual data. However, there appears to be less variation in the *observed data* for fighters with a shorter reach compared to the *simulated outcomes* for fighters with a shorter reach. Additionally, it seems that there is more variation in the *observed data* for fighters with a longer reach compared to the *simulated outcomes* for fighters with a longer reach.

This suggests that there may be an issue with the assumption that the standard deviation of the errors, $SD(\epsilon_i)=\sigma$, is the same for all $i$. This assumption restricts the standard deviation from being lower when reach is shorter and from being higher when reach is longer.


Below, I will, among others, look into this problem by simulating and estimating the errors.

__

As I am interested in calculating the errors $\epsilon_i$ according to the estimated model (where we assume these to be normally distributed), first step in doing do is to calculated the estimated $E(Y_i)$, that is, the outcome for each fighter - the wins by knockout for each fighter. 

I will do so as follows: 


In [None]:
ko_estmod = total_df[['reach','wins_knockout']].copy()
ko_estmod['exp_y'] = est_mod.predict() 
ko_estmod

As I have the estimated outcomes for each fighter, we can calculate the difference in estimated outcome for each fighter and actual observed outcome for each fighter. This difference between estmated and actual outcome for each fighter are the errors/residuals. 

In [None]:
ko_estmod['err'] = ko_estmod['wins_knockout'] - ko_estmod['exp_y']
ko_estmod


In [None]:
# now that the estimated errors are calculated, we can visualize these errors against the estimated wins by 
# knockout (exp_y): 


ax = sns.scatterplot(data=ko_estmod, x='exp_y', y='err')
ax.axhline(color='black',linestyle='--',linewidth=3);

In [None]:
# making scatterplots of errors that are simulated according to the estimated model. 
# I am making 8 simulations (not for any specfic reason other than to see many simulations). 

np.random.seed(0)
errs_sim = [np.random.normal(0, 4.12, 256) for i in range(0,8)] 
                            #0 for mean of errors according to the assumptions of the model
                            # 4.12 for the standard deviation 
                            # 256 for n fighters in the dataset
            
errs_plot = [ko_estmod['err']] + errs_sim 

fig, ax = plt.subplots(3,3,figsize=(16,12))

for i in range(0,9) :
    errs = errs_plot[i]
    a = ax.flatten()[i]
    sns.scatterplot(x=ko_estmod['exp_y'],y=errs,ax=a)
    a.axhline(color='black',linestyle='--',linewidth=3);
    
    if i==0 :
        tit = 'Estimated Errors'
    else :
        tit = 'Simulated Errors'
    a.set(title=tit,xlabel='exp_y',ylabel='err',ylim=[-25,25]) # setting the limitations of the y axis 
                                                               # so that it fits the data 

plt.tight_layout()

By comparing the distribution of errors between the estimated errors and simulated errors, it is visible that they are not completely similar. The simulated errors exhibit greater similarity to one another, while the estimated errors display a larger spread as the expected value of y, i.e. the wins by knockout, increases.  As was suggested in my interpretation of the plots above, this may indicate that the assumption of constant standard deviation of errors (homoskedasticity) is violated. 

Aside from homoscedasticity, another key assumption of the linear regression model is normality. This can be assessed through a qq plot, which is presented below:



In [None]:
import statsmodels.api as sm
sm.qqplot(ko_estmod['err'],line='r')
plt.show()

As the qqplot shows that all errors do not follow the red line, we cannot assume normally distributed errors and thus, the normality assumption is also violated. 

These violated assumptions of both normality of errors, and homoscedasticity hurt the reliability of the coefficients we found above. We thus have to estimate the coefficients again and change the model's assumptions. With the violated assumptions removed, we now have the following assumptions for the model: 

$$
Y_i = \beta_0 + \beta_1X_i + \epsilon_i
$$

where $\epsilon_i$ are independent random variables (NOT independent, normally distributed) with  $E(\epsilon_i)=0$ and $SD(\epsilon_i)=\sigma$. 

As the errors do not have the same standard deviation (they are heteroskedastic), we will estimate the coefficients using the robust 95% confidence intervals.



In [None]:
est_mod.get_robustcov_results(cov_type='HC3').summary().tables[1]

The estimates ended up being the same as above. Thus, it should be mentioned that there might be a reliability issue with the coefficients. 


___ 

### Linear regression - controlling for weight classes

As previously mentioned, I have a hypothesis that the estimated coefficient for reach is a result of an underlying effect of the different weight classes, and that weight class is the true explanatory factor for wins by knockout. To test this theory, I will run a linear regression model similar to the one above, but this time I will include weight class as a control variable. I intend not to remove weight classes based on sex.


With wins by knockout as outcome (Yi) and the predictors of reach and weightclass and the same assumptions as  above, the model is written as follows: 

$$
Y_i = \beta_0 + \beta_1 X_{i,1} + \beta_2 X_{i,2} + \beta_3 X_{i,3} + \beta_4 X_{i,4}+ \beta_5 X_{i,5}+ \beta_6 X_{i,6} + \beta_7 X_{i,7} + \beta_8 X_{i,8} + \beta_9 X_{i,9} + \beta_10 X_{i,10}+ \beta_11 X_{i,11}+  \beta_ 12 X + \epsilon_i
$$

where $\epsilon_i$ are independent random variables with  $E(\epsilon_i)=0$ and $SD(\epsilon_i)=\sigma$. 

Thus, the categorical values is written as follows depending on the weight class of interest: 

- $X_{i,1} = 1$ if $i$ indicate featherweight, else 0
- $X_{i,2} = 1$ if $i$ indicate flyweight, else 0
- $X_{i,3} = 1$ if $i$ indicate heavyweight, else 0
- $X_{i,4} = 1$ if $i$ indicate heavyweight, else 0
- $X_{i,5} = 1$ if $i$ indicate lightweight, else 0
- $X_{i,6} = 1$ if $i$ indicate middleweight, else 0
- $X_{i,7} = 1$ if $i$ indicate welterweight, else 0
- $X_{i,8} = 1$ if $i$ indicate bantamweight, else 0
- $X_{i,9} = 1$ if $i$ indicate featherweight, else 0
- $X_{i,10} = 1$ if $i$ indicate flyweight, else 0
- $X_{i,11} = 1$ if $i$ indicate strawweight, else 0

In [None]:
# counting how many weightclasses there are
total_df['weight_class'].nunique()

In [None]:
# wanting the names of the different weight classes
print(total_df['weight_class'].unique())

In [None]:
# Using C() to dummy code the categorical variable weight class

est_mod_2 = ols('wins_knockout~C(weight_class)+reach', data=total_df).fit() 

est_mod_2.get_robustcov_results(cov_type='HC3').summary().tables[1]



First, and for my own sake, I will just list the weightclasses from lowest to hight: 

* Strawweight (only womens strawweight in this dataset)
* Flyweight
* Bantamweight
* Featherweight
* Lightweight
* Welterweight
* Middleweight
* Light Heavyweight
* Heavyweight


**Interpretation**: 

I could have chosen a continuous variable as my control or one with fewer categories, but given weightclass made logically sense to me to investigate, I will interpret the coeffiecients as follows, with focus on the control: 
* Intercept:The model predicts that fighters at bantamweight with a reach of 0 will have an estimated -2.3 wins by knockout. However, it is not possible for a fighter to have a reach of 0, and this value will become clearer once the arm length is added as a variable.


* Reach: the model predicts that, holding weight classes constant, one unit increase in reach is associated with 0.13 more wins by knockout. 


* Betas : these shows the effect of going from one group to the other, starting from the intercept, assuming the reach variable is constant. That is, they denote the mean difference in wins by knockout between the different weight classes if reach were the same in each group. 



*reach*: Compared to the first model that only considered reach, wins by knockout increased by 0.32 for each unit increase in reach, where it only increases by 0.12 for each unit increase in reach in this model. Thus, its explanatory power has decreased and actually, is not significant anymore: As the CI for reach includes zero, I can conclude that reach is not statistically significant and that the reason for its signficance in the model above was its association with weight class. 


The dummy variable, i.e., the reference category is bantam weight. This means that when interested in estimating the average wins by knowckout for bantamweight with a reach of X, the outcome is simply $\beta_0 + \beta_(12) $. For other categories, it is the reference groups coefficient + the coefficient of the given weightgroup. This is also visible below.

In all but the heaviest weight class, there is a negative relationsship between wins by knockout and weight class. This relationsship is most negative among light weightclasses and increases as the weight class increases. 
___

For a fighter within the weight class of womens strawweight, the estimated average wins by knockout would be: 


$$
Y_i = \beta_0 + \beta_1 * 1  + \beta_2 *0 + \beta_3 *0 + \beta_4 *0 + \beta_5 *0+ \beta_6 *0 + \beta_7 *0 + \beta_8 *0 + \beta_9 *0 + \beta_10 *0+ \beta_(11) *0+ \beta_(12) + \epsilon_i
$$

which, simplified (as all the other categorical variables are equal 0 when womens strawweight is equal 1), is:


$$
Y_i = \beta_0 + \beta_(11) + \beta_(12)
$$


That is:

$$
3.12 = -2.3 + -3.03 + 0.13*65 
$$


** x=65  is chosen here as the unit as I googled that the average UFC womens strawweight fighter has a reach of 65 inches. 

___

For a fighter within the weight class of feather weight, the estimated average wins by knockout would be: 

$$
Y_i = \beta_0 + \beta_1 *0 + \beta_2 *0 + \beta_3 *0 + \beta_4 *0 + \beta_5 *0+ \beta_6 *0 + \beta_7 *0 + \beta_8 *0 + \beta_9 *0 + \beta_10 *0+ \beta_(11) *1 + \beta_(12) + \epsilon_i
$$

which, simplified (as all the other categorical variables are equal 0 when featherweight is equal 1), is:

$$
Y_i = \beta_0 + \beta_1 + \beta_(12)
$$


That is:

$$
5.02 = -2.3 + -1.78 + 0.13*70 
$$

** x=70 is chosen here as the unit as I googled that the average UFC featherweight fighter has a reach of 70 inches. 


___

For a fighter within the weight class of heavy weight, the estimated average wins by knockout would be: 


$$
Y_i = \beta_0 + \beta_1 *0 + \beta_2 *0 + \beta_3 *1 + \beta_4 *0 + \beta_5 *0+ \beta_6 *0 + \beta_7 *0 + \beta_8 *0 + \beta_9 *0 + \beta_10 *0+ \beta_(11) *0+ + \beta_(12) + \epsilon_i
$$

which, simplified (as all the other categorical variables are equal 0 when featherweight is equal 1), is:

$$
Y_i = \beta_0 + \beta_3 + \beta_(12)
$$


That is:

$$
11.05 = -2.3 + 3.27 + 0.13*77.5 
$$

** x=77.5 is chosen here as the unit as I googled that the average UFC heavyweight fighter has a reach of 77.5 inches. 


___

##### BUT
From the CI, it is also evident that not all weight classes are significant. 

I will present the coefficients and their p values more clearly with the stargazer table below.

In [None]:
# calculating estimated wins by knockout for womens strawweight

-2.3 + -3.03 + 0.13*65 

In [None]:
# calculating estimated wins by knockout for featherweight
-2.3 + -1.78 + 0.13*70 

In [None]:
# calculating estimated wins by knockout for heavyweight
-2.3 + 3.27 + 0.13*77.5 

In [None]:
s_weight = Stargazer([est_mod_2])
s_weight

The significance levels become clearer using stargazer. 

Only the weight classes of featherweight, heavyweight, and women's strawweight are significant. Above, I chose to interpret and calculate the outcomes only for the significant coefficients instead of for each and every coefficient, given that I have chosen a variable with many categories. 

Even though there are significant coefficients, it should be noted that the sample is small, and due to the many different weight classes, there are few observations within each. The results are thus not very reliable  and, thus, 1) may represent a different reflection of reality and 2) cannot be generalized to remaining UFC fighters.

In [None]:
# counting the different observations in each weight class 

total_df['weight_class'].value_counts()

### Logistic regression 

As explained in the beginning of the analysis section, I will conduct a logistic regression model of the binary outcome variable "status_binary" and the predictor variable "total wins". 

I hypothesize that there *is* a relationship and have two rather opposite theories: 
1. Each unit increase in number of wins is associated with increase in fighter status. This is derived from the thought that the more wins you have, the more it suggests a longer career, the more I rationalize the fighter to be retired now and thus, are not fighting. 

2. Each unit increase in number of wins is associated with decrease in fighter status. This is derived from the thought that if you, as a fighter, never win, you might be more likely to stop in the UFC as you dont have what it takes. So for each win, there is a decrease in fighter status towards 0=active.

In [None]:
#defining and fitting the model: 

log_reg = smf.logit("status_binary ~ total_wins", data=total_df).fit()

In [None]:
#the confidence interval does overlap 0, so we can assume that the relationsship is not significant

log_reg.summary().tables[1]

In [None]:
# making it even more clear by running stargazer

s1 = Stargazer([log_reg])
s1

**Interpretation:** We cannot apply the same interpretation of linear regression to logistic regression because the function and relationship differ; The relationship in logistic regression follows a sigmoid curve, not a linear one. Thus, besides noting the lack of significance in the coefficients, we cannot say much else about them from the table. 

To derive interpretability from the logistic regression model, we can use predicted probabilities. This is demonstrated below:

In [None]:
# fist sorting the values according to the total_wins variable so that the plot will look reasonable. 
# I change the indexes according to this new sorting as not to confuse myself below when the predicted
# probabilities are listed next to the indexes. 

log_df = total_df.sort_values("total_wins", ascending = False)
log_df = log_df.reset_index(drop=True)
log_df

In [None]:
# This then lists the predicted probability for fighter status for each fighters' number of wins

wins_predict = log_reg.predict(log_df['total_wins'])
wins_predict

In [None]:
# Then the predicted probabilities for fighter status are plotted on the y axis and the number of wins 
# are on the x axis. This draws a better picture than 1) the first visualization of the binary variable above
# and 2) the relationsship between the variables. 

sns.lineplot(data=log_df, x="total_wins", y=wins_predict) 

From this plot I can get see the predicted probabilities when everything else is held constant. Thus, I can interpret that the lower the total wins, the higher the probability of non-active fighter status. This also mean that the higher the total wins, the higher probability that the fighter is active. I cannot claim that this supports my second hypothesis, given the non-significant relationsship, but it looks as if the trend is leaning that way. 