<a href="https://colab.research.google.com/github/GalaxyTab7/Data219_0-Gang_Final/blob/main/Data219_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis Into

Before moving into building and testing pipelines for predicting a player's 2023 score given their attributes, some analysis will be performed on the target variable and feature variables. This analysis should help to answer the first research question which asks what player attributes are correlated with a higher 2023 'DF' score.

# Dependencies and Imports

In [None]:
#Dependicies/Imports
import pandas as pd
import numpy as np
import plotly.express as px
import scipy.stats as stats
import re
import itertools
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import BayesianRidge
pd.options.mode.chained_assignment = None

# Loading In Data

In [None]:
data = pd.read_csv("https://raw.githubusercontent.com/GalaxyTab7/Data219FinalProject/main/Cleaned_Data" , encoding = 'utf-8')

# Analysis

Below are the summary statistics for the target, 2023 'DF' scores. Notably, values equal to zero are ignored, as they were NaNs previously. Lastly, it is important to be aware of what 'DF' points are. These points are calculated by the website. The primary determinant of a player's DF points is their placement in tournaments, and the competition level of those tournaments.

The histogram of 2023 'DF' scores reveals a heavily right-skewed distribution. The mean is dragged up by this skew such that the mean is 1094, while the median is only 451. Due to the skew, the median and IQR are likely better descriptors of the set. The IQR is quite large, at about 942 'DF' points. To put that into perspective, gaining 1000 'DF' points is equivalent to winning a national-level tournament, which can include thousands of players.

In [None]:
print(data[data['Df_2023'] !=0]['Df_2023'].describe())
px.histogram(data[data['Df_2023'] !=0]['Df_2023'])

count     240.000000
mean     1094.291667
std      1468.108129
min        68.000000
25%       250.500000
50%       451.000000
75%      1192.250000
max      8888.000000
Name: Df_2023, dtype: float64


Below, several relationships between the factors and the target attribute, the ‘DF’ score for 2023, are visualized. Further, some simple statistical tests are performed to make more generalized conclusions about the data. All of these tests are performed at an alpha level of 5%.

The first two scatter plots visualize how a player’s win/loss amounts for their most played character relate to their ‘DF’ score for 2023. Interestingly, these two graphs roughly mirror each other. Further, both have positive coefficients indicating that an increase in either wins or losses with their top character relates to an increase in a higher 2023 ‘DF’ score. Additionally, the p-values are less than 5% indicating that the relationship between the number of wins/losses with a top character and the 2023 ‘DF’ score is significant. This conclusion is backed up by moderate r^2 values of 32% and 39% respectively. The relationship of “play more means a higher DF” continues as the relationship between the number of matches played and the 2023 ‘DF’ score is analyzed using the third scatterplot. Here, the p-value is yet again less than the alpha level, indicating a significant relationship between the two. Further, the r2 value is 39% and the correlation is positive, also mimicking prior results. Overall, from these first three visualizations, we can conclude that generally playing more, and specifically playing more with one’s main character, relates to a higher 2023 ‘DF’ score.

In [None]:
print("Scatterplot of 'DF' score in 2023 against the number of series wins with that players most frequently played character.")
result = stats.linregress(x = data[(data['Df_2023']!=0) & (data['Df_2023']<7000) & (data['num_wins_with_top_char'] != 0)]['num_wins_with_top_char'] , y = data[(data['Df_2023']!=0) & (data['Df_2023']<7000) & (data['num_wins_with_top_char'] != 0)]['Df_2023'])
print(result)
px.scatter(data[(data['Df_2023']!=0) & (data['Df_2023']<7000) & (data['num_wins_with_top_char'] != 0)], y = 'Df_2023' , x = 'num_wins_with_top_char')

Scatterplot of 'DF' score in 2023 against the number of series wins with that players most frequently played character.
LinregressResult(slope=10.855643416190787, intercept=564.6941697464, rvalue=0.372881472014933, pvalue=1.5133676218864396e-06, stderr=2.1697527816212956, intercept_stderr=121.5823890912194)


In [None]:
print("Scatterplot of 'DF' score in 2023 against the number of series losses with that players most frequently played character.")
result = stats.linregress(x = data[(data['Df_2023']!=0) & (data['Df_2023']<7000) & (data['num_losses_with_top_char'] != 0)]['num_losses_with_top_char'] , y = data[(data['Df_2023']!=0) & (data['Df_2023']<7000) & (data['num_losses_with_top_char'] != 0)]['Df_2023'])
print(result)
px.scatter(data[(data['Df_2023']!=0) & (data['Df_2023']<7000) & (data['num_losses_with_top_char'] != 0)], y = 'Df_2023' , x = 'num_losses_with_top_char')

Scatterplot of 'DF' score in 2023 against the number of series losses with that players most frequently played character.
LinregressResult(slope=12.316064539104675, intercept=606.8572600413212, rvalue=0.32513337467970777, pvalue=3.255581342951895e-05, stderr=2.8772879499478647, intercept_stderr=125.66001571766314)


In [None]:
print("Scatterplot of 'DF' for 2023 vs total number of matches.")
result = stats.linregress(x = data[(data['Df_2023']!=0) & (data['Df_2023']<7000)]['matches'] , y = data[(data['Df_2023']!=0) & (data['Df_2023']<7000)]['Df_2023'])
print(result)
px.scatter(data[data['Df_2023'] != 0], x = 'matches' , y = 'Df_2023')

Scatterplot of 'DF' for 2023 vs total number of matches.
LinregressResult(slope=6.553233834104745, intercept=342.1257716521104, rvalue=0.3928438597136176, pvalue=3.618541322584384e-10, stderr=1.0006983347038758, intercept_stderr=126.90213624014267)


These next visualizations show results that might be considered “obvious”; however, seeing and testing the relationships is still an important step of analysis. This first scatter plot visualizes how a player's 2023 “DF” score is affected by their 2022 “DF” score. As expected, players who performed well in the prior year did well in 2023. The regression analysis returns a p-value of 1.12 * 10^-10 indicating a significant relationship between the 2023 ‘DF’ score and the 2022 ‘DF’ score. The r^2 value is 40%, and the coefficient is positive indicating a moderate positive relationship between the two. One interesting note is that the group of players who had no 2022 ‘DF’ score had high variance in 2023 ‘DF’ scores with a range of scores from 68 to 4000. Another possibly “obvious” relationship is the relationship between win rate and 2023 ‘DF’ score. The second scatterplot visualizes this relationship as it shows a positive correlation between the two. The direct relationship is confirmed by a positive coefficient. Further, this relationship can be considered significant as the p-value is 3.618*10^-10, much less than the alpha level of 5%. Yet again, the relationship has a moderate r^2 correlation of about 39%. Overall, one can conclude that players who performed well in the prior year generally performed well in the next year, both on win rate and ‘DF’ score metrics.

In [None]:
print("Scatterplot of 'DF' for 2023 vs 'DF' for 2022.")
result = stats.linregress(x = data[(data['Df_2023']!=0) & (data['Df_2023']<7000)]['DF_2022'] , y = data[(data['Df_2023']!=0) & (data['Df_2023']<7000)]['Df_2023'])
print(result)
px.scatter(data[data['Df_2023'] != 0], x = 'DF_2022' , y = 'Df_2023')

Scatterplot of 'DF' for 2023 vs 'DF' for 2022.
LinregressResult(slope=0.3888844394825602, intercept=678.4993324173636, rvalue=0.4024526979301359, pvalue=1.2170914770356583e-10, stderr=0.057703485616080186, intercept_stderr=89.89067407863945)


In [None]:
print("2023 'DF' score by the global by the overall win rate for the player.")
result = stats.linregress(x = data[(data['Df_2023']!=0) & (data['Df_2023']<7000)]['win_rate'] , y = data[(data['Df_2023']!=0) & (data['Df_2023']<7000)]['Df_2023'])
print(result)
px.scatter(data[data['Df_2023']!=0] , x = 'win_rate' , y = 'Df_2023')

2023 'DF' score by the global by the overall win rate for the player.
LinregressResult(slope=7851.714434254843, intercept=-3354.739841453125, rvalue=0.3701977380437538, pvalue=4.122629604226566e-09, stderr=1285.2586492982134, intercept_stderr=718.4503832062647)


Another interesting feature of the scrapped data is the bio for the players. These bios offer short descriptions of the player's career and life. Below, the 10 most frequently appearing words for players, in at least the 75% quartile of 2023 'DF' scores, are given. Interestingly, the years 2018,2019, and 2020 appear very frequently for players with 2023 'DF' scores in the 75% percentile. The grouped boxplot, which visualizes how these words interact with the 2023 'DF' score, seems to show that those players who have these words in their bios also tend to have higher 2023 'DF' scores. However, a t-test for the difference of means produces a p-value of only 0.361 indicating that the mean 2023 'DF' score between the two groups is not significantly different. This makes sense, as the apparent difference is more caused by outliers in the group of players whose bios contain one of the top 10 most frequent words. Indeed, the medians for the two groups sit very close to each other.

In [None]:
print("Top 10 most frequently appearing words for players with 2023 'DF' scores in at least the 75% percentile.")
countVec = CountVectorizer(stop_words='english')
countVec.fit(data[data['Df_2023'] >= data['Df_2023'].quantile(0.75)]['bio'])
x = countVec.transform(data[data['Df_2023'] >= data['Df_2023'].quantile(0.75)]['bio'])
DfVocab = pd.DataFrame(x.todense() , columns = countVec.get_feature_names_out())
serTopWords = DfVocab.sum(axis = 0).sort_values(ascending = False).iloc[0:10]
list_top_30 = [value for value in serTopWords.index]
for word in serTopWords.index:
  print(word)

results =[]
for bio in data['bio']:
  has = False
  for word in bio.split():
    if word in list_top_30:
      has = True
  if has:
    results.append("Has Indicator")
  else:
    results.append("Doesn't Have Indicator")


print(" ")
print("This grouped boxplot shows Df_2023 scores grouped by wheter the bio for the player contains \na word in the top 30 most frequent words.")
data['Indicator'] = results

group1 = data[(data['Df_2023'] != 0) & (data['Indicator'] == "Has Indicator")]['Df_2023']
group2 = data[(data['Df_2023'] != 0) & (data['Indicator'] == "Doesn't Have Indicator")]['Df_2023']
results = stats.ttest_ind(group1 , group2)
print(results)
px.box(data[data['Df_2023'] != 0],x='Df_2023',color='Indicator')


Top 10 most frequently appearing words for players with 2023 'DF' scores in at least the 75% percentile.
2019
player
tekken
place
evo
2018
2020
online
fighter
street
 
This grouped boxplot shows Df_2023 scores grouped by wheter the bio for the player contains 
a word in the top 30 most frequent words.
TtestResult(statistic=0.9147365641754742, pvalue=0.3612557564732807, df=238.0)


The final three visualizations categorize discrete numeric values to examine their effect on the 2023 ‘DF’ score value.

The first analysis looks at how players' global ranks in 2022 relate to their 2023 ‘DF’ scores. From the grouped boxplot, it is clear that players who ranked higher in 2022, as a group, have higher 2023 ‘DF’ scores. This idea is backed up by the one-way ANOVA test which finds significant evidence that at least one of the groups has a different mean 2023 ‘DF’ score. Notably, there is higher variance in the higher global ranking groups. This could show that some players reach the top 10 or 20 of their group, then drop off or play their respective fighting game less.

The second to last visualization and analysis shows how the number of years playing affects 2023 'DF' scores. From the boxplot, it is clear that players who have been playing for a really long time, which is categorized as more than 10 years, have generally higher 2023 'DF' scores. This conclusion is backed up by the ANOVA test which finds significant evidence that one of the groups has a different mean 2023 'DF' score. Interestingly, it seems that the difference in 2023 'DF' scores between gamers who have played for only a year and gamers who have played for between 5 and 10 years is not very pronounced. This could suggest that the effect of playing a game more only begins to significantly affect the 'DF' score after quite a while. This makes sense when you consider the 10000 hour theory, which states to achieve mastery in any given skill one needs to spend 10000 hours practicing that skill.

 The last visualization shows how age affects 2023 ‘DF’ scores. The difference between the groups is less clear than in the prior plot; however, it appears that the under-25 age group has a slightly higher mean 2023 ‘DF’ score. This conclusion is backed up by the one-way ANOVA results which show significant evidence that at least one of the groups has a different mean 2023 ‘DF’ score.

 Overall, these three tests seem to point to youngish players who both ranked well globally last year and have played for a long time, having higher mean 2023 ‘DF’ scores.

In [None]:
list_results = []
for value in data['Global_2022'].values:
  if (value < 10):
    list_results.append("top 10")
  elif (value < 20):
    list_results.append("top 20")
  elif (value < 50):
    list_results.append("top 50")
  elif (value < 100):
    list_results.append("top 100")
  else:
    list_results.append("Other")

data['Cat_2022_Global'] = list_results

print("2023 'DF' score by the global ranking of the player.")
group1 = data[data['Cat_2022_Global'] == 'top 10']['Df_2023']
group2 = data[data['Cat_2022_Global'] == 'top 20']['Df_2023']
group3 = data[data['Cat_2022_Global'] == 'top 50']['Df_2023']
group4 = data[data['Cat_2022_Global'] == 'top 50']['Df_2023']
results = stats.f_oneway(group1 , group2 , group3 , group4)
print(results)
px.box(data[data['Cat_2022_Global'] != "Other"] , x = 'Df_2023' , color ='Cat_2022_Global' )

2023 'DF' score by the global ranking of the player.
F_onewayResult(statistic=6.521640561240325, pvalue=0.00033337355133197685)


In [None]:
data['years_playing'].describe()
list_add = []
for value in data['years_playing'].values:
  if value <= 1:
    list_add.append("Not very long")
  elif value < 5:
    list_add.append("A little while")
  elif value < 10:
    list_add.append("A long time")
  else:
    list_add.append("A really long time")

data['cat_years_playing'] = list_add
group1 = data[data['cat_years_playing'] == "Not very long"]['Df_2023']
group2 = data[data['cat_years_playing'] == "A little while"]['Df_2023']
group3 = data[data['cat_years_playing'] == "A long time"]['Df_2023']
group4 = data[data['cat_years_playing'] == "A really long time"]['Df_2023']
results = stats.f_oneway(group1 , group2 , group3 , group4)
print("2023 'DF' score by the number of years a player has been playing their game for:")
print(results)
px.box(data[data['Df_2023'] != 0], x = 'Df_2023', color='cat_years_playing')

2023 'DF' score by the number of years a player has been playing their game for:
F_onewayResult(statistic=8.872041418832726, pvalue=8.258273445618453e-06)


In [None]:
list_add = []
for value in data['age'].values:
  if value < 18:
    list_add.append("Under 18")
  elif value < 25:
    list_add.append("Under 25")
  elif value < 30:
    list_add.append("Under 30")
  elif value < 38:
    list_add.append("Under 38")
  elif value < 45:
    list_add.append("Under 45")

data['Cat_age'] = list_add
arg1 = data[data['Cat_age'] == "Under 18"]['Df_2023']
arg2 = data[data['Cat_age'] == "Under 25"]['Df_2023']
arg3 = data[data['Cat_age'] == "Under 30"]['Df_2023']
arg4 = data[data['Cat_age'] == "Under 38"]['Df_2023']
arg5 = data[data['Cat_age'] == "Under 45"]['Df_2023']
results = stats.f_oneway(arg1 , arg2 , arg3 , arg4 , arg5)
print("Grouped boxplots for 'DF' 2023 by the age of the player")
print(results)
px.box(data[data['Df_2023'] != 0] , x = 'Df_2023' , color = 'Cat_age')

Grouped boxplots for 'DF' 2023 by the age of the player
F_onewayResult(statistic=14.378619274429884, pvalue=1.953283383278205e-11)


# Analysis Conclusions

From these visualizations and tests, the following conclusions can be made:
- More series/games played with a top character is correlated with a higher 'DF' score in the current/next year.
- A higher 'DF' score in the prior year is correlated with a higher 'DF' score in the current/next year.
- A higher win rate overall is correlated with a higher 'DF' score in the current/next year.
- Grouping the data by whether a player's bio contains one of the top ten most frequent words that appear in the bios of players with 2023 'DF' scores in at least the 75% percentile does not cause a significant difference in mean 2023 'DF' scores between the groups.
- Playing for longer generally leads to getting a higher 'DF' score in the next/current year.
- Generally, being ranked higher globally in the prior year relates to a higher 'DF' score in the current/next year. However, there are some who seem to play less after reaching a high global rank, and thereby get lower 'DF' scores in the next year.
- Generally, the under 25, but over 18, age group seems to have slightly higher 'DF' scores.