# CSMODEL S11 | Project Phase 1
This notebook is the work of Group 4, consisting of the following members:

* CARNEY, JOHN PAUL COMPANIA
* GUERRRERO, MIGUEL ALFONSO DAVID
* REINANTE, CHRISTIAN VICTOR GO
* SALVADOR, JARYLL FRANCIS PENA

## Dataset Description
This project makes use of the [Online Gaming Anxiety Data Set](https://www.kaggle.com/datasets/divyansh22/online-gaming-anxiety-data). It contains responses gathered from a worldwide survey of gamers. Included in this survey are psychological assessments for anxiety, social phobia, and life satisfaction. It also gathered demographic and gaming-related information. Marian Sauter and Dejan Draschkow originally compiled the data.


## Importing Libraries
Before proceeding, we will import the necessary libraries which we will use to provide a general overview of the dataset.

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Importing Libraries
We then load the dataset as follows:

In [None]:
gamingAnxiety_df = pd.read_csv("GamingStudy_data.csv")
gamingAnxiety_df.head()

## Process and Implications of Data Collection
The data was gathered by means of a survey that was distributed to gamers globally. The survey had a range of inquiries commonly employed by psychologists to assess levels of anxiety, social phobia, and life satisfaction. Standardized psychological assessment instruments, including the General Anxiety Disorder Assessment (GAD), Satisfaction with Life Scale (SWL), and Social Phobia Inventory (SPIN) questionnaires, and inquiries regarding gaming habits and general demographics were included in the survey. 

Though not explicitly mentioned, it is extremely likely that this survey was conducted online, given that online surveys are commonly used when reaching a worldwide audience, especially gamers. The dataset description also includes *Reddit* as an example for the **Reference** variable, indicating the website was used as an avenue to conduct the survey as well. Assuming the data was collected as such, this presents several implications:

- **Sample Composition**: Because the data was collected through an online survey, it may over-represent individuals active in online gaming communities or gamers who primarily play online multiplayer games. As a result, those who do not regularly use the internet, are inactive in online gaming communities, or those who play single-player games exclusively may be underrepresented.

- **Voluntary Response Bias**: The data relies on self-reported responses, which can be subject to biases such as inaccurate self-assessment by the respondent or social desirability bias. Respondents with stronger views also may have been more likely to participate in the first place because of this.

**Each row** represents a single survey response from a gamer, and **each column** represents a variable collected in the survey. The dataset contains **13464 observations** in total, and there are **55 variables** in the dataset. We can verify this, and also check each individual variable using the info() method:

In [None]:
gamingAnxiety_df.info()

#### Demographic Information

- **S. No.:** Serial Number.  
- **Timestamp:** Time at which the participant took the questionnaire after it being launched.  
- **Gender:** Self-identified gender of the gamer taking the questionnaire.  
- **Age:** Self-reported age of the gamer taking the questionnaire.  
- **Work:** Work status of the gamer.  
- **Degree:** Highest degree attained.  
- **Birthplace:** Birthplace.  
- **Residence:** Place where the gamer currently resides.  
- **Residence_ISO3:** Current residence in ISO3 format.  
- **Birthplace_ISO3:** Birthplace in ISO3 format.
- **Accept:** Accept terms and conditions (not necessary for any analysis).  

#### Psychological Assessment

- **GAD1 to GAD7:** Responses to GAD questions 1 to 7.  
- **GADE:** Effect of gaming on work.  
- **SWL1 to SWL5:** Responses to SWL questions 1 to 5.  
- **SPIN1 to SPIN17:** Responses to SPIN questions 1 to 17.  
- **Narcissism:** Interest scale in the game (1-5).  
- **GAD_T:** GAD Total Score.  
- **SWL_T:** SWL Total Score.  
- **SPIN_T:** SPIN Total Score.  

#### Gaming Habits

- **Game:** Name of the game they play.  
- **Platform:** Mode of game playing (PC, Console, Mobile, etc.).  
- **Hours:** Number of hours in a week devoted to playing.  
- **earnings:** Earnings from the game (if any).  
- **whyplay:** Reason to play the game.  
- **League:** Respondent's current ingame rank.  
- **highestleague:** Highest rank attained.  
- **streams:** Number of online streaming sessions.


## Data Cleaning 
Next, we prepare our dataset for modeling and analysis. 

#### Pinpoint and Remove Irrelevant Variables
We start by removing the following variables:
- **League:** This column has inconsistent formatting and its value is not utilized in the study.
- **highestleague:** This column is entirely consisting of null values and will not be used.
- **Accept:** This column is not neccessary for analysis.
- **earnings:** This column is relevant to the study.
- **streams:** This column is not relevant to the study.
- **Residence:** We will be using Residence_ISO3 instead, as it is formatted more consistently.
- **Birthplace:** We will be using Birthplace_ISO3 instead, as it is formatted more consistently.

In [None]:
gamingAnxiety_df = gamingAnxiety_df.drop(columns=['League', 'highestleague', 'accept', 'earnings', 'streams', 'Residence', 'Birthplace'])

#### Handling Null Values 
This section will place focus on the Psychological Assessment variables as well as the gaming habits. To start, we drop variables that are irrelevant to our study. We will then start looking for variables with null values. We do this by iterating over each column and checking how many null-valued cells each of these may have.

In [None]:
nullVariables = gamingAnxiety_df.columns[gamingAnxiety_df.isnull().any()].tolist()
gamingAnxiety_df[nullVariables].isnull().sum()

Most variables here have a relatively low amount off null values (Less than 5%). Although we could choose to drop this data given how few they are, we will choose to perform imputation to preserve our sample size and maintain the variability of our dataset. Furthermore, if the missing cells are scattered (i.e. many rows only have one or two cells missing), then we may end up dropping a deceptively high amount of rows rather than just a few hundred. At worst, we may end up dropping a number of rows equal to the sum of the number of null values we have. 

To start, we display numerical summaries of every column with null values in our dataframe and seek out columns with outliers.

In [None]:
summary_stats = gamingAnxiety_df[nullVariables].describe(percentiles=[.25, .50, .75, .99]).round(2)

print("Summary Statistics Before Imputation:")
print(summary_stats)

We can see that Hours has a max value of 8000, but 99% of its values exist under 70. We can further understand this through a boxplot.

In [None]:
sns.boxplot(x=gamingAnxiety_df['Hours'])
plt.title('Box Plot of Hours')
plt.show()

Because there are extreme outliers present that may skew the mean, we will opt to impute according to the median. 

In [None]:
columns_to_impute = ['Hours']

for column in columns_to_impute:
    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(gamingAnxiety_df[column].median())

Let's verify that we've successfully performed the imputation:

In [None]:
gamingAnxiety_df[nullVariables].isnull().sum()

Columns Narcisssim and SPIN_T are numerical columns without outliers and are safe to impute according to the mean.

In [None]:
columns_to_impute = ['Narcissism', 'SPIN_T']

for column in columns_to_impute:
    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(gamingAnxiety_df[column].mean())


Let's verify that we've successfully performed the imputation:

In [None]:
gamingAnxiety_df[nullVariables].isnull().sum()

We cannot use mean imputation for our categorical variables. We would also rather not drop them, for the same reason we do not want to drop our numerical variables. One method of imputation compatible with categorical values we can use is mode imputation. We impute according to the mode below:

In [None]:
columns_to_impute = ['GADE', 'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4', 'SPIN5', 
                     'SPIN6', 'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 
                     'SPIN11', 'SPIN12', 'SPIN13', 'SPIN14', 'SPIN15', 
                     'SPIN16', 'SPIN17', 'Work', 'Degree', 'Reference',
                     'Residence_ISO3', 'Birthplace_ISO3']

for column in columns_to_impute:
    mode_value = gamingAnxiety_df[column].mode().iloc[0]
    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(mode_value)

And again verify that we've successfully imputed the categorical variables we've targetted:

In [None]:
gamingAnxiety_df[nullVariables].isnull().sum()

## Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a crucial step in understanding the characteristics and underlying patterns in our dataset. In this study, we conducted a comprehensive EDA to explore the psychological measures of anxiety, life satisfaction, and social phobia among gamers worldwide. 

In [None]:
gamingAnxiety_df.head()

## I. Distribution of Key Psychological Measures
What is the distribution of anxiety, life satisfaction, and social phobia scores among gamers?

To answer this we take have to take a comprehensive look into the psychological state of the respondents and, to analyze the distribution of three key psychological measures: anxiety scores (GAD_T), life satisfaction scores (SWL_T), and social phobia scores (SPIN_T). First we will construct numerical summaries and measure central tendencies, dispersion and correlation between the variables.  We will then use histograms to visualize the frequency distributions of these measures, as they provide an intuitive way to see how scores are spread across different ranges and to identify common patterns and abnormalities.

### Numerical Summaries:

In [None]:
vars = ['GAD_T', 'SWL_T', 'SPIN_T']

median = gamingAnxiety_df[vars].median().round(2)

mode = gamingAnxiety_df[vars].mode().round(2).iloc[0]  

summary_stats = gamingAnxiety_df[vars].describe().round(2)

summary_stats.loc['median'] = median
summary_stats.loc['mode'] = mode

print("Numerical Summaries of GAD_T, SWL_T and SPIN_T:")
print(summary_stats)


From the numerical summaries we can see that GAD_T has the smallest mean at 5.21 with the lowest standard deviation, while SWL_T and SPIN_T respectively have similar means of 19.79 and 19.85, however have have different dispersions with standard deviations of 7.23 and 13.14. We can gleam from this that most respondents have low anxiety scores and phobia scores while having middling life satisfaction scores. We can further confirm this through the histograms below.

### Visualization:

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.histplot(gamingAnxiety_df['GAD_T'], kde=True, bins=30)
plt.title('Distribution of Anxiety Scores (GAD_T)')
plt.xlabel('Anxiety Score')
plt.ylabel('No. of Respondents')

plt.subplot(1, 3, 2)
sns.histplot(gamingAnxiety_df['SWL_T'], kde=True, bins=30)
plt.title('Distribution of Life Satisfaction Scores (SWL_T)')
plt.xlabel('Life Satisfaction Score')
plt.ylabel('No. of Respondents')

plt.subplot(1, 3, 3)
sns.histplot(gamingAnxiety_df['SPIN_T'], kde=True, bins=30)
plt.title('Distribution of Social Phobia Scores (SPIN_T)')
plt.xlabel('Social Phobia Score')
plt.ylabel('No. of Respondents')

plt.tight_layout()
plt.show()


## Explanation : 
The data given shows that the anxiety frequency distribution has most respondents have lower anxiety score. The distribution of anxiety scores shows that most of the respondents medium to high life satisfaction scores. Lastly, the social phobia scores shows that most respondents have a lower phobia scores, with fewer respondents reporting higher levels of social phobia.

Let's explain it more. For the anxiety frequency distribution or the GAD-T, it shows a right-skewed distribution that indicates most repsondents have lower anxiety scores, The distribution helps us understand that anxiety levels are generally low among the respondents, with some outliers experiencing higher levels of anxiety. We can apply this data by analyzing this distrubtion, we can identify the proportion of respondents experiencing varying degrees of anxiety, which can help in data gathering and usage. For the Life Satisfaction scores, or the SWL_T, the histogram displays the distribution with a slightly left-skewed shape suggesting most respondents have moderate to high life satisfaction, or they're currently happy with their lives right now. This shows that the overall sample of respondents have a overall well-being. Now for the Social Phobia Scores or SPIN_T, the histogram shows the distribution is a right-skewed shape indicating that most repsondents have a lower social phobia score. Showing that social anxiety is not a significant issue for most of the respondents, with only a small minority experiencing high levels of social phobia. 

## Conclusion:
According to the data given, the distribution when it comes to the GAD-T shows that most of the respondents have lower anxiety scores, we can assume that majority of the respondents do not have any problems with anxiety. Same with the others, as the graph shows that the SWL-T although left-skewed shows that majority of the respondents are quite happy with their lives. SPIN-T graphs shows that majority of the respondents do not suffer from social phobia and only a small minority do. 

##  Gaming Hours
Here, we seek to answer the question "Do gaming hours per week correlate with anxiety, life satisfaction, and social phobia scores?"

### Numerical Summaries:

In [None]:
vars = ['Hours', 'GAD_T', 'SWL_T', 'SPIN_T']

median = gamingAnxiety_df[vars].median().round(2)

mode = gamingAnxiety_df[vars].mode().round(2).iloc[0]  

summary_stats = gamingAnxiety_df[vars].describe().round(2)

summary_stats.loc['median'] = median
summary_stats.loc['mode'] = mode

correlations = gamingAnxiety_df[['Hours', 'GAD_T', 'SWL_T', 'SPIN_T']].corr().round(2)

print("Numerical Summaries of GAD_T, SWL_T and SPIN_T:")
print(summary_stats)

print("\nCorrelations between variables:")
print(correlations)

From the numerical summaries we can see that hours has the a median of 20, which is more representative of its average as it has extreme outliers, explaining its high standard deviation of 70.21. Assessing the correlation coefficients between our variables, we can see that none of our variables have correlation coefficients above 0.50, suggesting that there is little to no correlation between them. We can further understand this through scatterplots. 

### Visualization:

In [None]:
gamingAnxiety_df = gamingAnxiety_df[gamingAnxiety_df['Hours'] <= 200]
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.scatterplot(x='Hours', y='GAD_T', data=gamingAnxiety_df)
plt.title('Gaming Hours per Week vs Anxiety Scores')
plt.xlabel('Gaming Hours per Week')
plt.ylabel('Anxiety Score')

plt.subplot(1, 3, 2)
sns.scatterplot(x='Hours', y='SWL_T', data=gamingAnxiety_df)
plt.title('Gaming Hours per Week vs Life Satisfaction Scores')
plt.xlabel('Gaming Hours per Week')
plt.ylabel('Life Satisfaction Score')

plt.subplot(1, 3, 3)
sns.scatterplot(x='Hours', y='SPIN_T', data=gamingAnxiety_df)
plt.title('Gaming Hours per Week vs Social Phobia Scores')
plt.xlabel('Gaming Hours per Week')
plt.ylabel('Social Phobia Score')

plt.tight_layout()
plt.show()

## Explanation
Scatterplots are best used when trying to visualize a relationship between two continous variables. Here we attempt to visualize any potential relationship between GAD_T, SPIN_T, and SWL_T.

From visuals alone, there is no obvious trend between gaming hours per week and these variables. One thing of note, however, is that the dots tend to be concentrated towards the left side of the graph. This skewness does indicate that the dataset represents more observations with lower gaming hours per week. Besides this, there is little more that can be gathered from looking at the graph alone.

## Conclusion:
Attempting to visualize potential relationships between gaming hours and anxiety, social phobia, and life satisfaction shows no clear trend. However, the concentration of dots on the left suggests lower gaming hours overall.

## III. Demographic breakdown (Age, Gender, Nationality) of Gamers
What is the demographic breakdown (age, gender, nationality) of gamers in the survey?
For this we can barplot the entire dataset to see the density of the respondents and where they live.

### Numerical Summaries:

In [None]:
vars = ['Age', 'Gender', 'Birthplace_ISO3']

mode = gamingAnxiety_df[vars].mode().iloc[0]

summary_stats_age = gamingAnxiety_df['Age'].describe().round(2)

summary_stats_age['median'] = round(gamingAnxiety_df['Age'].median(),2)
summary_stats_age['mode'] = mode['Age']

summary_stats_gender = gamingAnxiety_df['Gender'].value_counts().to_frame(name='count')
summary_stats_birthplace = gamingAnxiety_df['Birthplace_ISO3'].value_counts().to_frame(name='count')

print("Summary Statistics for Age:")
print(summary_stats_age)

print("\nMode for Gender and Birthplace_ISO3:")
print(mode[['Gender', 'Birthplace_ISO3']])

print("\nValue Counts for Gender:")
print(summary_stats_gender)

print("\nValue Counts for Birthplace_ISO3:")
print(summary_stats_birthplace)


From the numerical summaries we can see that most respondents are age 18-21, Male and reside in the USA. There is an abundance of male respondents, with 12699 males outnumbering 713 females, with an extreme minority of respondents listed as 'Other'. The spread of nationalities is similar to the two, with USA respondents outnumbering the second highest answered nationality, Germany, at 4380 to 1376. We can further understand these demographics through a barplot.

### Visualization:

In [None]:
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
sns.histplot(gamingAnxiety_df['Age'], kde=True, bins=30)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
sns.countplot(x='Gender', data=gamingAnxiety_df)
plt.title('Gender Distribution')
plt.xlabel('Gender')
plt.ylabel('Count')

plt.subplot(1, 3, 3)
top_nationalities = gamingAnxiety_df['Birthplace_ISO3'].value_counts().head(10)
sns.barplot(x=top_nationalities.index, y=top_nationalities.values)
plt.title('Top 10 Nationalities')
plt.xlabel('Nationality')
plt.ylabel('Count')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## Explanation
With the first graph being the age distribution we can see that the majority of the respondents are in their mid-20s, with fewer respondents on the outliers, being younger and older age extremes. The second graph shows the counts of male, female and other respondents. It is apparent that the respondents of this dataset is majorly males, with smaller proportions of female and other genders. The last graph being the top 10 nationalities of respondents. These are namely USA, Germany, UK, Canada, Netherlands, France, Sweden, Poland, Brazil and Portugal.

Let's dive in deeper in the graphs. The first graph shows that the peak respondent age is mid-20s, this shows that the age group is highly represented in this sampling, then we have more extremes reaching up to 60, showing that gaming is not a phase but a passion for some people. The second graph shows that majority of the respondents are male, we can use this for mental health data research and in indentifying target demographics. The last graph is again, the Top 10 Nationalities, showing USA as first, then Germany. The data also shows that 80% of the Top 10 countries are in Europe.


## Conclusion
The vast majoriy of gamers from the dataset are below the age of 20, Male, and from the USA.

## IV. Distribution of Preferred Gaming Device, Game Genre, and Psychological Measures

### Numerical Summaries:

In [None]:
vars = ['Platform', 'Game', 'GAD_T', 'SWL_T', 'SPIN_T']

mode = gamingAnxiety_df[vars].mode().iloc[0]

numerical_vars = ['GAD_T', 'SWL_T', 'SPIN_T']
summary_stats_numerical = gamingAnxiety_df[numerical_vars].describe().round(2)

median = gamingAnxiety_df[numerical_vars].median().round(2)

summary_stats_numerical.loc['median'] = median
summary_stats_numerical.loc['mode'] = mode[numerical_vars]
summary_stats_numerical = summary_stats_numerical.round(2)
summary_stats_platform = gamingAnxiety_df['Platform'].value_counts().to_frame(name='count')
summary_stats_game = gamingAnxiety_df['Game'].value_counts().to_frame(name='count')

print("Summary Statistics for Numerical Variables:")
print(summary_stats_numerical)

print("\nMode for Categorical Variables:")
print(mode[['Platform', 'Game']])

print("\nValue Counts for Platform:")
print(summary_stats_platform)

print("\nValue Counts for Game:")
print(summary_stats_game)

From the numerical summaries we can see that most respondents play on PC and play League of Legends. PC players outnumber the second closest platform, console players at 13218 to 222. Similarly, League of Legends players have the highest number of respondents playing it, with the closest second being 'Other', suggesting 1020 players play a game that might not be listed in the survey. We can understand these statistics better through boxplots.

### Visualization:

In [None]:
# Preferred gaming device and psychological measures
plt.figure(figsize=(15, 10))

# Device vs Anxiety
plt.subplot(3, 1, 1)
sns.boxplot(x='Platform', y='GAD_T', data=gamingAnxiety_df)
plt.title('Preferred Gaming Device vs Anxiety Scores')
plt.xlabel('Gaming Device')
plt.ylabel('Anxiety Score')

# Device vs Life Satisfaction
plt.subplot(3, 1, 2)
sns.boxplot(x='Platform', y='SWL_T', data=gamingAnxiety_df)
plt.title('Preferred Gaming Device vs Life Satisfaction Scores')
plt.xlabel('Gaming Device')
plt.ylabel('Life Satisfaction Score')

# Device vs Social Phobia
plt.subplot(3, 1, 3)
sns.boxplot(x='Platform', y='SPIN_T', data=gamingAnxiety_df)
plt.title('Preferred Gaming Device vs Social Phobia Scores')
plt.xlabel('Gaming Device')
plt.ylabel('Social Phobia Score')

plt.tight_layout()
plt.show()

# Game genre and psychological measures
plt.figure(figsize=(15, 10))

# Genre vs Anxiety
plt.subplot(3, 1, 1)
sns.boxplot(x='Game', y='GAD_T', data=gamingAnxiety_df)
plt.title('Game Genre vs Anxiety Scores')
plt.xlabel('Game Genre')
plt.ylabel('Anxiety Score')
plt.xticks(rotation=45)

# Genre vs Life Satisfaction
plt.subplot(3, 1, 2)
sns.boxplot(x='Game', y='SWL_T', data=gamingAnxiety_df)
plt.title('Game Genre vs Life Satisfaction Scores')
plt.xlabel('Game Genre')
plt.ylabel('Life Satisfaction Score')
plt.xticks(rotation=45)

# Genre vs Social Phobia
plt.subplot(3, 1, 3)
sns.boxplot(x='Game', y='SPIN_T', data=gamingAnxiety_df)
plt.title('Game Genre vs Social Phobia Scores')
plt.xlabel('Game Genre')
plt.ylabel('Social Phobia Score')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()


## Explanation:
A boxplot is extremely useful here as it displays a range of data and highlights central tendency, and how differeng genres and platforms may affect these.

The graph shows that in mobile users tend to have GAD_T, SWL_T and SPIN_T scores compared to PC and console users. Interestingly, PC players noticeably have more outliers compared to the rest, leaning towards high GAD_T and SPIN_T.

Though there are differences in the median scores, the most striking observations about the game genre graphs are the high amount of outliers presenting high GAD_T and high SPIN_T scores for games like League of Legends and Counter Strike, games known to be highly competitive. This may present an opportunity for feature engineering, classifying the games with regards to how competitive they are.

## Conclusion:

Anxiety, Social Phobia and Satisfaction with life tends to be higher for mobile users. High-score outliers are also noticeable with regards to social phobia and anxiety for competitive games.

### Research Question
After going through the Exploratory Data Analysis, our final research question is as follows:
- **How are different gaming devices and genres related to psychological statistics?**



<br><br><br><br><br>



# Data Modelling

To answer our research question, we will perform association rule mining. This is helpful as it can help us find any hidden patterns that might exist within the dataset. Before we proceed, we'll have to do some preprocessing to ensure our data is usable.


#### Preprocessing
We'll start with the games. For our list of games to be useable by our algorithms, we'll perform one-hot encoding to represent each game.

In [None]:
games_df = pd.get_dummies(gamingAnxiety_df['Game'])
games_df

We'll do the same for the platforms:

In [None]:
platform_df = pd.get_dummies(gamingAnxiety_df['Platform'])
platform_df

#### Feature Engineering
If we want to do the same for our psychological statistics, we'll need to bin them first according to each score's appropriate evaluation.

First, we'll perform this for the GAD scores, and give each score its [appropriate label.](https://adaa.org/sites/default/files/GAD-7_Anxiety-updated_0.pdf) GAD scores from 0 to 4 will be labeled "Minimal Anxiety," scores from 5 to 9 "Mild Anxiety" and so on.



In [None]:
# Assign a label for each level of anxiety
def categorizeAnxiety(score):
    if score >= 0 and score <= 4:
        return 'Minimal Anxiety'
    elif score >= 5 and score <= 9:
        return 'Mild Anxiety'
    elif score >= 10 and score <= 14:
        return 'Moderate Anxiety'
    elif score >= 15 and score <= 21:
        return 'Severe Anxiety'
    else:
        return 'Invalid Score'

#Create new column with categorized Anxiety
gamingAnxiety_df['Anxiety_Level'] = gamingAnxiety_df['GAD_T'].apply(categorizeAnxiety)

#Display newly created  Column
gadt_df = gamingAnxiety_df['Anxiety_Level']
print(gadt_df)

Now that we've successfully categorized the anxiety levels, we'll use one-hot encoding to represent the data:

In [None]:
gadt_df = pd.get_dummies(gadt_df)
gadt_df

We will then perform the same binning method for both [SPIN](https://greenspacehealth.com/en-us/social-anxiety-spin/) and [SWL.](https://fetzer.org/sites/default/files/images/stories/pdf/selfmeasures/SATISFACTION-SatisfactionWithLife.pdf)

In [None]:
# Assign a label for each level of social phobia
def categorizeSocialPhobia(score):
    if score >= 0 and score <= 20:
        return 'No Social Phobia'
    elif score >= 21 and score <= 30:
        return 'Mild Social Phobia'
    elif score >= 31 and score <= 40:
        return 'Moderate Social Phobia'
    elif score >= 41 and score <= 50:
        return 'Severe Social Phobia'
    elif score >= 51 and score <= 68:
        return 'Very Severe Social Phobia'
    else:
        return 'Invalid Score'

#Create new column with categorized Anxiety
gamingAnxiety_df['SocialPhobia'] = gamingAnxiety_df['SPIN_T'].apply(categorizeSocialPhobia)

#Display newly created  Column
spint_df = gamingAnxiety_df['SocialPhobia']
spint_df = pd.get_dummies(spint_df)

In [None]:
# Assign a label for each level of social phobia
def categorizeLifeSatisfaction(score):
    if score >= 5 and score <= 9:
        return 'Extreme Life Dissatisfaction'
    elif score >= 10 and score <= 14:
        return 'Moderate Life Dissatisfaction'
    elif score >= 15 and score <= 19:
        return 'Slight Life Dissatisfaction'
    elif score == 20:
        return 'Neutral Life Satisfaction'
    elif score >= 21 and score <= 25:
        return 'Slight Life Satisfaction'
    elif score >= 26 and score <= 30:
        return 'Moderate Life Satisfaction'
    elif score >= 31 and score <= 35:
        return 'Extreme Life Satisfaction'
    else:
        return 'Invalid Score'

#Create new column with categorized Anxiety
gamingAnxiety_df['Life Satisfaction'] = gamingAnxiety_df['SWL_T'].apply(categorizeLifeSatisfaction)

#Display newly created  Column
swlt_df = gamingAnxiety_df['Life Satisfaction']
swlt_df= pd.get_dummies(swlt_df)

## Association Rule Mining

With preprocessing finished, we can proceed with data mining. To start, we will generate the frequent itemsets for our dataframes. We do this by first organizing the dataframes we want to generate frequentsets from.

In [None]:
#Combine Anxiety, Social Phobia, and Life Satisfaction Dataframe
psych_df = pd.concat([gadt_df, spint_df, swlt_df], axis = 1)

#Combine Pysch Stats with games and platfor dataframes
psychToGames = pd.concat([psych_df, games_df], axis = 1)
psychToPlatform = pd.concat([psych_df, platform_df], axis = 1)

print(psychToGames)
print(psychToPlatform)

We can finally generate our frequent itemsets. We will be using *apriori* and *association_rules* from the *mlxtend* library. We'll generate a frequent_itemsets dataframe using Psych stats as well as Game Genre stats.

In [None]:
from mlxtend.frequent_patterns import apriori, association_rules

frequent_itemsets = apriori(psychToGames, min_support = 0.2, use_colnames = True)
frequent_itemsets

With our frequent itemsets having been generated, we can finally proceed with association rule mining:

In [None]:
association_rules(frequent_itemsets, metric = "confidence", min_threshold = 0.6)[['antecedents', 'consequents', 'support', 'confidence']]

We can apply the same process, this time comparing Psych stats to Platform

In [None]:
frequent_itemsets = apriori(psychToPlatform, min_support = 0.2, use_colnames = True)
association_rules(frequent_itemsets, metric = "confidence", min_threshold = 0.6)[['antecedents', 'consequents', 'support', 'confidence']]

## Statistical Inference

Upon analyzing the research question, we are presented with two independent variables Game and Platform, and dependent variables GAD_T, SWL_T, and SPIN_T. However, Game and Platform are not independent of each other, as a chosen Game may only be available on certain Platforms (i.e. League of Legends is strictly a PC Game) and vice versa. We can perform one-way ANOVA tests between the two independent variables and psychological measures to see if there is a significant difference in gaming-related anxiety levels across different gaming devices and genres. Our hypotheses are laid out as follows:

**Null Hypothesis (H0)**: There is no significant difference in gaming-related anxiety levels across different gaming devices and genres among various demographic groups. This is determined by an alpha level of <=0.05.

**Alternative Hypothesis (Ha)**: There is a significant difference in gaming-related anxiety levels across different gaming devices and genres among various demographic groups. This is determined by an alpha level of >0.05.

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

data = gamingAnxiety_df

data_lm = ols('GAD_T ~ C(Game)', data=data).fit()
table = sm.stats.anova_lm(data_lm)
print("ANOVA Results for GAD_T and Game:\n", table)


data_lm = ols('GAD_T ~ C(Platform)', data=data).fit()
table = sm.stats.anova_lm(data_lm)
print("\n\nANOVA Results for GAD_T and Platform:\n",table)


#### Statistical Inference Results
* df: Degrees of Freedom (number of categories - 1).
* sum_sq: Total Variability.
* mean_sq: Average Variance.
* F: Measure of Variability in GAD_T between the categories of the independent variable.
* PR(>F): P-value.
* a: Alpha level, typically 0.05. Threshold for our p-value.

In our first one-way ANOVA test, we can see that our p-value is 0.08. This is higher than our alpha level of 0.05, indicating that we cannot disprove the null hypothesis. This suggests that there is no significant difference in gaming-related anxiety levels across different games.

In our second one-way ANOVA test, we can see that our p-value is 0.01. This is much lower than our alpha level of 0.05, indicating that we can reject the null hypothesis and accept the alternative hypothesis. This suggests that there is a significant difference in gaming-related anxiety levels across different gaming devices.

In [None]:
tukey_result = pairwise_tukeyhsd(endog=gamingAnxiety_df['GAD_T'], groups=gamingAnxiety_df['Platform'], alpha=0.05)

print(tukey_result)

#### Post-Hoc Test Results
* meandiff: Difference in means between the two groups.
* p-adj: Adjusted p-values for comparisons.
* lower: Lower bound of the confidence interval for the mean difference.
* upper: Upper bound of the confidence interval for the mean difference.
* reject: Boolean indicating whether or not to reject the null hypothesis.

We can see that when comparing Console and PC in terms of GAD_T, our adjusted p-value is 0.66, which is higher than our alpha level of 0.05, therefore being unable to reject the null hypothesis. We can also observe a lower bound of -1.03 and an upper bound of 0.47, resulting in a confidence interval containing zero. This also supports the result that there is no statistically significant difference in gaming-related anxiety levels between the platforms Console and PC.

When comparing Console and Smartphone / Tablet however, we observe a lower p-adj of 0.40, which is lower than our alpha level. Our confidence interval does not contain 0 as it is from 0.09 to 5.03. Therefore we reject the null hypothesis and accept that there is a significant difference in gaming-related anxiety levels between the platforms Console and Smartphone / Tablet.

Lastly, when comparing PC and to Smartphone / Tablet, we also observe a low p-adj of 0.13 paired with a confidence interval that does not include 0. We can therefore conclude that there is a significant difference in gaming-related anxiety levels between the platforms PC and Smartphone / Tablet. 

Upon observing the meandiff between Console and Smartphone / Tablet, we get a value of 2.56. This means that Smartphone / Tablet respondents have higher GAD_T scores compared to that of Console by approximately 2.56 points. Similarly, the meandiff between PC and Smartphone / Tablet is 2.84, meaning that Smartphone / Tablet respondents have higher GAD_T scores by an average of 2.84 points over PC respondents.