# CSMODEL S11 | Project Phase 1
This notebook is the work of, Group 4, consisting of the following members:

* CARNEY, JOHN PAUL COMPANIA
* GUERRRERO, MIGUEL ALFONSO DAVID
* REINANTE, CHRISTIAN VICTOR GO
* SALVADOR, JARYLL FRANCIS PENA

## Dataset Description
This project makes use of the [Online Gaming Anxiety Data Set](https://www.kaggle.com/datasets/divyansh22/online-gaming-anxiety-data). It contains responses gathered from a worldwide survey of gamers. Included in this survey are psychological assessments for anxiety, social phobia, and life satisfaction. It also gathered demographic and gaming-related information. Marian Sauter and Dejan Draschkow originally compiled the data.


## Importing Libraries
Before proceeding, we will import the necessary libraries which we will use to provide a general overview of the dataset.

In [None]:
import numpy as np
import pandas as pd

## Importing Libraries
We then load the dataset as follows:

In [None]:
gamingAnxiety_df = pd.read_csv("GamingStudy_data.csv")
gamingAnxiety_df.head()

## Process and Implications of Data Collection
The data was gathered by means of a survey that was distributed to gamers globally. The survey had a range of inquiries commonly employed by psychologists to assess levels of anxiety, social phobia, and life satisfaction. Standardized psychological assessment instruments, including the General Anxiety Disorder Assessment (GAD), Satisfaction with Life Scale (SWL), and Social Phobia Inventory (SPIN) questionnaires, and inquiries regarding gaming habits and general demographics were included in the survey. 

Though not explicitly mentioned, it is extremely likely that this survey was conducted online, given that online surveys are commonly used when reaching a worldwide audience, especially gamers. The dataset description also includes *Reddit* as an example for the **Reference** variable, indicating the website was used as an avenue to conduct the survey as well. Assuming the data was collected as such, this presents several implications:

- **Sample Composition**: Because the data was collected through an online survey, it may over-represent individuals active in online gaming communities or gamers who primarily play online multiplayer games. As a result, those who do not regularly use the internet, are inactive in online gaming communities, or those who play single-player games exclusively may be underrepresented.

- **Voluntary Response Bias**: The data relies on self-reported responses, which can be subject to biases such as inaccurate self-assessment by the respondent or social desirability bias. Respondents with stronger views also may have been more likely to participate in the first place because of this.

**Each row** represents a single survey response from a gamer, and **each column** represents a variable collected in the survey. The dataset contains **13464 observations** in total, and there are **55 variables** in the dataset. We can verify this, and also check each individual variable using the info() method:

In [None]:
gamingAnxiety_df.info()

#### Demographic Information

- **S. No.:** Serial Number.  
- **Timestamp:** Time at which the participant took the questionnaire after it being launched.  
- **Gender:** Self-identified gender of the gamer taking the questionnaire.  
- **Age:** Self-reported age of the gamer taking the questionnaire.  
- **Work:** Work status of the gamer.  
- **Degree:** Highest degree attained.  
- **Birthplace:** Birthplace.  
- **Residence:** Place where the gamer currently resides.  
- **Residence_ISO3:** Current residence in ISO3 format.  
- **Birthplace_ISO3:** Birthplace in I
- **Accept:**OAccept terms and conditions (not necessary for any analysis).3 format.  

#### Psychological Assessment

- **GAD1 to GAD7:** Responses to GAD questions 1 to 7.  
- **GADE:** Effect of gaming on work.  
- **SWL1 to SWL5:** Responses to SWL questions 1 to 5.  
- **SPIN1 to SPIN17:** Responses to SPIN questions 1 to 17.  
- **Narcissism:** Interest scale in the game (1-5).  
- **GAD_T:** GAD Total Score.  
- **SWL_T:** SWL Total Score.  
- **SPIN_T:** SPIN Total Score.  

#### Gaming Habits

- **Game:** Name of the game they play.  
- **Platform:** Mode of game playing (PC, Console, Mobile, etc.).  
- **Hours:** Number of hours in a week devoted to playing.  
- **earnings:** Earnings from the game (if any).  
- **whyplay:** Reason to play the game.  
- **League:** League.  
- **highestleague:** Highest league.  
- **streams:** - Number of online streaming sessi


## Data Cleaning 
This section will place focus on the Psychological Assessment variables as well as the gaming habits. To start, we will start by looking for variables with null values. We do this by iterating over each column and checking how many null-valued cells each of these may have.

In [None]:
nullVariables = gamingAnxiety_df.columns[gamingAnxiety_df.isnull().any()].tolist()
gamingAnxiety_df[nullVariables].isnull().sum()

Most variables here have a relatively low amount off null values (Less than 5%). Although we could choose to drop this data given how few they are, we will choose to perform imputation to preserve our sample size and maintain the variability of our dataset. Furthermore, if the missing cells are scattered (i.e. many rows only have one or two cells missing), then we may end up dropping a deceptively high amount of rows rather than just a few hundred. At worst, we may end up dropping a number of rows equal to the sum of the number of null values we have. 

We start by doing mean imputation. Of course, we will only be doing this for cells that are supposed to have a numerical value.

In [None]:
columns_to_impute = ['Hours', 'streams', 'Narcissism', 'SPIN_T',
                     'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4', 'SPIN5', 
                     'SPIN6', 'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 
                     'SPIN11', 'SPIN12', 'SPIN13', 'SPIN14', 'SPIN15', 
                     'SPIN16', 'SPIN17']

for column in columns_to_impute:
    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(gamingAnxiety_df[column].mean())


Let's verify that we've successfully performed the imputation:

In [None]:
gamingAnxiety_df[nullVariables].isnull().sum()

We cannot use mean imputation for our categorical variables. We would also rather not drop them, for the same reason we do not want to drop our numerical variables. One method of imputation compatible with categorical values we can use is mode imputation. We mode imputation below:

In [None]:
columns_to_impute = ['GADE', 'Work', 'Degree', 'Reference',
                     'Residence_ISO3', 'Birthplace_ISO3']

for column in columns_to_impute:
    mode_value = gamingAnxiety_df[column].mode().iloc[0]
    gamingAnxiety_df[column] = gamingAnxiety_df[column].fillna(mode_value)

And again verify that we've successfully imputed the categorical variables we've targetted:

In [None]:
gamingAnxiety_df[nullVariables].isnull().sum()