<a href="https://colab.research.google.com/github/GalaxyTab7/Data219_0-Gang_Final/blob/main/Data219_Intro_GettingData_CleaningData.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Authors: Garrett McKenzie and Marina Davies

Team Name: 0% Gang

Professor: PC

Class: Data 219

Date Last Edited: 04/24/2024


# **Analyzing Fighting Game Players and Their Performances.**




Marina loves to play fighting games, she is on the UMW SuperSmash Bros team after all. Because of this interest, we decided to do a project looking at how certain attributes of professional fighting game players relate to their performance over the season. The ultimate goal of this analysis is to first find out what player attributes correlate to a higher 'DF' score, which is a performance indicator. The second goal is to create a model to predict a player's 2023 'DF' score given information about their prior performances. The use case for this model is to provide amature players, who can't attend large tournaments or can't devote large parts of their lives to the game, a tool to understand how well they are performing in relation to the pros.

# Dependencies and Imports

In [None]:
#Dependicies/Imports
import pandas as pd
import numpy as np
import plotly.express as px
import scipy.stats as stats
import re
import itertools
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import BayesianRidge
pd.options.mode.chained_assignment = None

# Getting The Data

The hardest part of this project was by far collecting the data. To collect the data, we had to write a web crawler to scrape a website called DashFight. This site contains vital information on fighting game players, which will be described later. Overall, the web crawler took somewhere between 45 and 55 hours to finish. As such, the data was collected piecemeal and then joined together along the rows using merge. Several challenges faced in scraping the website included:
- Loading in elements that require the user to scroll the page down before appearing in the HTML source.
- Needing to load a total of three websites for each player because of the way DashFight stores user information for different years.
- Learning how to handle the many exceptions that arise when working with inconsistent web pages and information.

The final version of the code used for the web crawler can be seen at https://github.com/GalaxyTab7/Data219FinalProject/blob/main/Scrape_Get_Stuff_Edit.py. Notably, it will not work in Collab. For it to run, the Microsoft Edge driver needs to be added to the local device's Path variable. Further, you may need to pip-install Selenium.

Even after this struggle, many of the data points contain NaNs. Further, only 243 of the 1205 collected have ranking information for 2023 "DF" scores.

In [None]:
#Load in each of the csv files with their respective parts of the overall dataset.
one = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(1-484)_Data_219%20-%20Copy" , encoding = "utf-8")
two = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(484-700)_Data_219" , encoding = "utf-8")
three = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(700-1397)_Data_219" , encoding = "utf-8")
four = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(1397-1564)_Data_219" , encoding = "utf-8")
five = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(1564-1693)_Data_219" , encoding = "utf-8")
six = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(1693-1968)_Data_219")
seven = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(1967%20-%203047)_Data_219" , encoding = "utf-8")
eight = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(3048-3560)_Data_219" , encoding = "utf-8")
nine = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Scrap(3560-5000)_Data_219" , encoding = "utf-8")

In [None]:
#Merge the files vertically to create the overall dataset from the parts.
data = pd.concat((one,two,three,four,five,six,seven,eight,nine) , axis = 0)
data = data.drop(columns = ["Unnamed: 0" , "NA_2022"])

In [None]:
#Basic descriptions of the data along with how many points can be predicted and how many points can be used for training/testing.
print("Overall shape")
print(data.shape)
print("Points to train/test with")
print(len(data[~data['Df_2023'].isna()]))


Overall shape
(1205, 15)
Points to train/test with
243


In [None]:
#Take a peak at the dataset
data.head(5)

Unnamed: 0,age,name,years_playing,tag,clan,local,matches,win_rate,bio,top_character,top_character_stats,DF_2022,Global_2022,Df_2023,Global_2023
0,"Birthday : Jun 16, 1996 (27)",Shuto Moriya,,Shuton,SunSister,Japan,147.0,68%,"Bio \nBorn on June 17, 1996, Shuto Moriya aka ...",Olimar48,48 - 41,712.0,29.0,583.0,38.0
1,,Seisuke Komeda,Playing : 7+,Kome,SUSANOO GAMING 8,Japan,130.0,64%,"Bio \nSeisuke Komeda, also known as ""Kome"" is ...",Shulk2,2 - 9,,,,
2,,Takuma Hirooka,Playing : 8+,Tea,SUSANOO GAMING 8,Japan,207.0,64%,"BioTakuma ""Tea"" Hirooka is a top professional ...",Pac,53 - 43,500.0,45.0,107.0,160.0
3,,Yutaro Nagumo,Playing : 6+,Paseriman,RayRoad Gaming,Japan,48.0,58%,"Bio \nYutaro ""Paseriman"" Nagumo is a Japanese ...",Diddy,0 - 1,,,,
4,,Ishiguro Tetsuya,,Raito,,Japan,82.0,64%,Bio \nRaito is a Super Smash Bros. Ultimate pl...,,,,,,


# Cleaning The Data

Due to the many NaNs, and the fact that some of the data is still formatted as it was on the site, some data cleaning/transforming will have to be done.

Checklist for cleaning:

 **A. Drop nan matches.**

 **B. Simplify age to a single number and fillna with average age.**

 **C. Get rid of the % for win rate, divide by 100 , and fillna with the average of the known win rates.**

 **D. Spilt top_character_stats into two columns, num_wins_with_top_char and num_losses_with_top_char. When doing this, the first num is wins and the second number is losses given (first - second) fillna with 0 for both.**

 **E. Get rid of any numbers in the top character column, and fill nans with unknown.**

 **F. Fill DF_2022 and Df_2023 with 0 for nans. Separate out the points that have a 2023 DF score into its own data frame.**

 **G. Fill global ranking with 500 for nans.**

 **H. Simplify years_playing to a single number and fillna with 1**

 **I. Fill bio nans with "Supercalifragilisticexpealidocious".**

 **J. Fill clan nans with none.**

 **K. Fill local nans with unknown.**

 **L. Fill names nans with unknown.**

 **M. Fill tag nans with unknown.**

All cleaning is done below:

In [None]:
#A
data = data.dropna(subset=["matches"])
data = data.drop([31], axis=0) #dropped because the information doesn't make sense, says theyre 3 years old but have been playing for 6 years

#E
data["top_character"] = data.loc[:,"top_character"].fillna("unknown")

#F
both = data[(~data["DF_2022"].isna())&(~data["Df_2023"].isna())]
just2022 = data[(~data["DF_2022"].isna())&(data["Df_2023"].isna())]
just2023 = data[(data["DF_2022"].isna())&(~data["Df_2023"].isna())]
data[["DF_2022", "Df_2023"]] = data[["DF_2022", "Df_2023"]].fillna(0)

#G
data[["Global_2022", "Global_2023"]] = data.loc[:,["Global_2022", "Global_2023"]].fillna(500)

#I
data["bio"]=data.loc[:,"bio"].fillna("Supercalifragilisticexpealidocious")

#J
data["clan"]=data.loc[:,"clan"].fillna("none")

#K
data["local"]=data.loc[:,"local"].fillna("unknown")

#L
data["name"]=data.loc[:,"name"].fillna("unknown")

#M
data["tag"]=data.loc[:,"tag"].fillna("unknown")


In [None]:
#The below code completes tasks B, C, D, E, and H.
data["age"] = data["age"].fillna("                         31")
data["top_character_stats"]=data["top_character_stats"].fillna("0 - 0")
wins = {}
losses = {}
for x in range(0, len(data)):
  data["win_rate"].iloc[x] = re.sub("[%]","",str(data["win_rate"].iloc[x]))
  data["top_character"].iloc[x] = re.sub("\d", "", str(data["top_character"].iloc[x]))
  data["age"].iloc[x] = str(data["age"].iloc[x])[25:27]

  ok1 = data.iloc[x]["years_playing"]
  ok2 = re.search("\d", str(ok1))
  if ok2 != None:
    well = ok2.span()
    ok3 = ok1[10:12]
    if ok3=="11" or ok3=="12" or ok3=="13" or ok3=="14" or ok3=="15" or ok3=="16" or ok3=="17" or ok3=="18" or ok3=="19" or ok3=="20" or ok3=="21" or ok3=="22" or ok3=="23" or ok3=="24" or ok3=="25" or ok3=="26" or ok3=="27" or ok3=="28" or ok3=="29" or ok3=="30":
      data["years_playing"].iloc[x] = data["years_playing"].iloc[x][well[0]:well[1]+1]
    else:
      data["years_playing"].iloc[x] = data["years_playing"].iloc[x][well[0]:well[1]]

  here = re.search("[-]", data["top_character_stats"].iloc[x]).span()
  thiswin = data["top_character_stats"].iloc[x][:here[0]-1]
  thisloss = data["top_character_stats"].iloc[x][here[1]+1:]
  wins[x] = thiswin
  losses[x] = thisloss




data["num_wins_with_top_char"]=pd.Series(wins)
data["num_losses_with_top_char"]=pd.Series(losses)

data["win_rate"]=(data["win_rate"].astype(float))/100
data["top_character"][data["top_character"]=="R"]="R.Mika"
data["top_character"][data["top_character"]=="Chun"]="Chun-Li"
data["top_character"][data["top_character"]=="Diddy"]="Diddy Kong"
data["top_character"][data["top_character"]=="Pac"]="Pac-Man"
data["top_character"][data["top_character"]=="Wii"]="Wii Fit Trainer"
data["top_character"][data["top_character"]=="Mr"]="Mr. Game & Watch"
data["top_character"][data["top_character"]=="Pokemon"]="Pokemon Trainer"
data["top_character"][data["top_character"]=="Ice"]="Ice Climbers"
data["top_character"][data["top_character"]=="M"]="M. Bison"
data["top_character"][data["top_character"]=="Young"]="Young Link"
data["top_character"][data["top_character"]=="Feng"]="Feng Wei"
data["top_character"][data["top_character"]=="F"]="F.A.N.G"
data["top_character"][data["top_character"]=="Josie"]="Josie Rizal"
data["top_character"][data["top_character"]=="E"]="E. Honda"
data["years_playing"] = data["years_playing"].fillna(1)

data['num_wins_with_top_char'] = data['num_wins_with_top_char'].astype(int)
data['years_playing'] = data['years_playing'].astype(int)
data['num_losses_with_top_char'] = data['num_losses_with_top_char'].astype(int)
data['age'] = data['age'].astype(int)
data['num_games_with_top_char']= data["num_wins_with_top_char"] + data["num_losses_with_top_char"]

data_with_df = data[data['Df_2023'] !=0]


In [None]:
#Taking a peak at the cleaned data
data.iloc[16:24]

Unnamed: 0,age,name,years_playing,tag,clan,local,matches,win_rate,bio,top_character,top_character_stats,DF_2022,Global_2022,Df_2023,Global_2023,num_wins_with_top_char,num_losses_with_top_char,num_games_with_top_char
16,30,Jason Bates,15,ANTi,none,USA,23.0,0.51,BioANTi is a former well-known professional Su...,Mario,1 - 8,0.0,500.0,0.0,500.0,1,8,9
17,31,Towa Kuriyama,8,Atelier,Team Liquid,Japan,82.0,0.66,"Bio Towa ""Atelier"" Kuriyama is a Japanese Supe...",unknown,0 - 0,48.0,163.0,0.0,500.0,0,0,0
18,31,Japan,7,Compact,none,unknown,47.0,0.47,BioCompact is a Japanese Ultimate and former W...,unknown,0 - 0,0.0,500.0,0.0,500.0,0,0,0
19,31,Kaito Kawasaki,13,Shogun,none,Japan,7.0,0.35,"BioKaito ""Shogun"" Kawasaki is a Snake main wit...",Snake,2 - 3,0.0,500.0,0.0,500.0,2,3,5
20,31,Japan,1,Kie,none,unknown,79.0,0.66,BioKie (きぃ) is a Japanese Super Smash Bros. Br...,unknown,0 - 0,0.0,500.0,0.0,500.0,0,0,0
21,27,Eric Weber,12,Mr. E,none,USA,95.0,0.51,"BioEric Weber, known as Mr. E, is a Lucina mai...",Lucina,43 - 24,0.0,500.0,821.0,30.0,43,24,67
22,31,Japan,8,Bokinchan,none,unknown,7.0,0.47,Bio\nBokichan is a Japanese SSB Ultimate playe...,unknown,0 - 0,0.0,500.0,0.0,500.0,0,0,0
23,31,Japan,8,Munekin,none,unknown,20.0,0.65,Bio\nMunekin is a professional Japanese smashe...,unknown,0 - 0,0.0,500.0,0.0,500.0,0,0,0
