# CSMODEL S11 | Project Phase 1
This notebook is the work of, Group 4, consisting of the following members:

* CARNEY, JOHN PAUL COMPANIA
* GUERRRERO, MIGUEL ALFONSO DAVID
* REINANTE, CHRISTIAN VICTOR GO
* SALVADOR, JARYLL FRANCIS PENA

## Dataset Description
This project makes use of the [Online Gaming Anxiety Data Set](https://www.kaggle.com/datasets/divyansh22/online-gaming-anxiety-data). It contains responses gathered from a worldwide survey of gamers. Included in this survey are psychological assessments for anxiety, social phobia, and life satisfaction. It also gathered demographic and gaming-related information. Marian Sauter and Dejan Draschkow originally compiled the data.


## Importing Libraries
Before proceeding, we will import the necessary libraries which we will use to provide a general overview of the dataset.

In [1]:
import numpy as np
import pandas as pd

## Importing Libraries
We then load the dataset as follows:

In [2]:
gamingAnxiety_df = pd.read_csv("GamingStudy_data.csv")
gamingAnxiety_df.head()

Unnamed: 0,S. No.,Timestamp,GAD1,GAD2,GAD3,GAD4,GAD5,GAD6,GAD7,GADE,...,Birthplace,Residence,Reference,Playstyle,accept,GAD_T,SWL_T,SPIN_T,Residence_ISO3,Birthplace_ISO3
0,1,42052.00437,0,0,0,0,1,0,0,Not difficult at all,...,USA,USA,Reddit,Singleplayer,Accept,1,23,5.0,USA,USA
1,2,42052.0068,1,2,2,2,0,1,0,Somewhat difficult,...,USA,USA,Reddit,Multiplayer - online - with strangers,Accept,8,16,33.0,USA,USA
2,3,42052.0386,0,2,2,0,0,3,1,Not difficult at all,...,Germany,Germany,Reddit,Singleplayer,Accept,8,17,31.0,DEU,DEU
3,4,42052.06804,0,0,0,0,0,0,0,Not difficult at all,...,USA,USA,Reddit,Multiplayer - online - with online acquaintanc...,Accept,0,17,11.0,USA,USA
4,5,42052.08948,2,1,2,2,2,3,2,Very difficult,...,USA,South Korea,Reddit,Multiplayer - online - with strangers,Accept,14,14,13.0,KOR,USA


## Process and Implications of Data Collection
The data was gathered by means of a survey that was distributed to gamers globally. The survey had a range of inquiries commonly employed by psychologists to assess levels of anxiety, social phobia, and life satisfaction. Standardized psychological assessment instruments, including the General Anxiety Disorder Assessment (GAD), Satisfaction with Life Scale (SWL), and Social Phobia Inventory (SPIN) questionnaires, and inquiries regarding gaming habits and general demographics were included in the survey. 

Though not explicitly mentioned, it is extremely likely that this survey was conducted online, given that online surveys are commonly used when reaching a worldwide audience, especially gamers. The dataset description also includes *Reddit* as an example for the **Reference** variable, indicating the website was used as an avenue to conduct the survey as well. Assuming the data was collected as such, this presents several implications:

- **Sample Composition**: Because the data was collected through an online survey, it may over-represent individuals active in online gaming communities or gamers who primarily play online multiplayer games. As a result, those who do not regularly use the internet, are inactive in online gaming communities, or those who play single-player games exclusively may be underrepresented.

- **Voluntary Response Bias**: The data relies on self-reported responses, which can be subject to biases such as inaccurate self-assessment by the respondent or social desirability bias. Respondents with stronger views also may have been more likely to participate in the first place because of this.

**Each row** represents a single survey response from a gamer, and **each column** represents a variable collected in the survey. The dataset contains **13464 observations** in total, and there are **55 variables** in the dataset. We can verify this, and also check each individual variable using the info() method:

In [3]:
gamingAnxiety_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13464 entries, 0 to 13463
Data columns (total 55 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   S. No.           13464 non-null  int64  
 1   Timestamp        13464 non-null  float64
 2   GAD1             13464 non-null  int64  
 3   GAD2             13464 non-null  int64  
 4   GAD3             13464 non-null  int64  
 5   GAD4             13464 non-null  int64  
 6   GAD5             13464 non-null  int64  
 7   GAD6             13464 non-null  int64  
 8   GAD7             13464 non-null  int64  
 9   GADE             12815 non-null  object 
 10  SWL1             13464 non-null  int64  
 11  SWL2             13464 non-null  int64  
 12  SWL3             13464 non-null  int64  
 13  SWL4             13464 non-null  int64  
 14  SWL5             13464 non-null  int64  
 15  Game             13464 non-null  object 
 16  Platform         13464 non-null  object 
 17  Hours       

#### Demographic Information

- **S. No.:** Serial Number.  
- **Timestamp:** Time at which the participant took the questionnaire after it being launched.  
- **Gender:** Self-identified gender of the gamer taking the questionnaire.  
- **Age:** Self-reported age of the gamer taking the questionnaire.  
- **Work:** Work status of the gamer.  
- **Degree:** Highest degree attained.  
- **Birthplace:** Birthplace.  
- **Residence:** Place where the gamer currently resides.  
- **Residence_ISO3:** Current residence in ISO3 format.  
- **Birthplace_ISO3:** Birthplace in I
- **Accept:**OAccept terms and conditions (not necessary for any analysis).3 format.  

#### Psychological Assessment

- **GAD1 to GAD7:** Responses to GAD questions 1 to 7.  
- **GADE:** Effect of gaming on work.  
- **SWL1 to SWL5:** Responses to SWL questions 1 to 5.  
- **SPIN1 to SPIN17:** Responses to SPIN questions 1 to 17.  
- **Narcissism:** Interest scale in the game (1-5).  
- **GAD_T:** GAD Total Score.  
- **SWL_T:** SWL Total Score.  
- **SPIN_T:** SPIN Total Score.  

#### Gaming Habits

- **Game:** Name of the game they play.  
- **Platform:** Mode of game playing (PC, Console, Mobile, etc.).  
- **Hours:** Number of hours in a week devoted to playing.  
- **earnings:** Earnings from the game (if any).  
- **whyplay:** Reason to play the game.  
- **League:** League.  
- **highestleague:** Highest league.  
- **streams:** - Number of online streaming sessi


## Data Cleaning (WIP)
- focus on pyscho stats

In [None]:
for column in gamingAnxiety_df.columns:
    unique_values = gamingAnxiety_df[column].unique()
    print(f"Unique values in {column}: {unique_values}")


check for columns with null

In [20]:
nan_variables = gamingAnxiety_df.columns[gamingAnxiety_df.isnull().any()].tolist()
gamingAnxiety_df[nan_variables].isnull().sum()

GADE                 649
Hours                 30
League              1852
highestleague      13464
streams              100
SPIN1                124
SPIN2                154
SPIN3                140
SPIN4                159
SPIN5                166
SPIN6                156
SPIN7                138
SPIN8                144
SPIN9                158
SPIN10               160
SPIN11               187
SPIN12               168
SPIN13               187
SPIN14               156
SPIN15               147
SPIN16               147
SPIN17               175
Narcissism            23
Work                  38
Degree              1577
Reference             15
accept               414
SPIN_T               650
Residence_ISO3       110
Birthplace_ISO3      121
dtype: int64

test mean imputation, very small % of total rows are imputed (exclude non numeric)
all columns except degree >5%

In [33]:

columns_to_impute = ['Hours', 'streams', 'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4', 'SPIN5', 
                     'SPIN6', 'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 'SPIN11', 'SPIN12', 'SPIN13', 'SPIN14', 'SPIN15', 
                     'SPIN16', 'SPIN17', 'Narcissism', 'SPIN_T']

for column in columns_to_impute:
    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(gamingAnxiety_df[column].mean())


In [34]:
gamingAnxiety_df[nan_variables].isnull().sum()

GADE                 649
Hours                  0
League              1852
highestleague      13464
streams                0
SPIN1                  0
SPIN2                  0
SPIN3                  0
SPIN4                  0
SPIN5                  0
SPIN6                  0
SPIN7                  0
SPIN8                  0
SPIN9                  0
SPIN10                 0
SPIN11                 0
SPIN12                 0
SPIN13                 0
SPIN14                 0
SPIN15                 0
SPIN16                 0
SPIN17                 0
Narcissism             0
Work                  38
Degree              1577
Reference             15
accept               414
SPIN_T                 0
Residence_ISO3       110
Birthplace_ISO3      121
dtype: int64

IMPUTATION FOR CATEGORICAL DATA USING MODE
mayeb try probabablistic imputation?

In [51]:
columns_to_impute = ['GADE', 'Work', 'Degree', 'Reference', 'Residence_ISO3', 'Birthplace_ISO3']

for column in columns_to_impute:
    mode_value = gamingAnxiety_df[column].mode().iloc[0]

    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(mode_value)

In [52]:
gamingAnxiety_df[nan_variables].isnull().sum()

GADE                   0
Hours                  0
League              1852
highestleague      13464
streams                0
SPIN1                  0
SPIN2                  0
SPIN3                  0
SPIN4                  0
SPIN5                  0
SPIN6                  0
SPIN7                  0
SPIN8                  0
SPIN9                  0
SPIN10                 0
SPIN11                 0
SPIN12                 0
SPIN13                 0
SPIN14                 0
SPIN15                 0
SPIN16                 0
SPIN17                 0
Narcissism             0
Work                   0
Degree                 0
Reference              0
accept               414
SPIN_T                 0
Residence_ISO3         0
Birthplace_ISO3        0
dtype: int64