<a href="https://colab.research.google.com/github/DeoZD/CSMODEL_G2_MCO/blob/main/CSMODEL_G2_MCO1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **CSMODEL MCO1 Group 2**
## LaSalleGameKNB?
* TIONGCO, KYAN THOMAS    S18
* DIAMANTE, DEO ZAMIR     S19
* LICUP, EVAN GABRIEL     S19
* SARROZA, MIKAEL JENSON	S19

### GamingStudy_data.csv
The original dataset consists of data collected as a part of a survey among gamers worldwide. The questionnaire asked questions that psychologists generally ask people who are prone to anxiety, social phobia, and less to no life satisfaction. The questionnaire consists of several set of questions as asked as a part of psychological study. The original data was collated by Marian Sauter and Dejan Draschkow.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import missingno as msno
import seaborn as sns
import re

# sets the theme of the charts
plt.style.use('seaborn-v0_8-darkgrid')

%matplotlib inline

url = 'https://github.com/DeoZD/CSMODEL_G2_MCO/raw/refs/heads/main/GamingStudy_data.csv'
orig_df = pd.read_csv(url, encoding='latin-1')

## Original dataframe information:

In [None]:
orig_df.info()

## Data Cleaning

In [None]:
# @title Remove `highestleague` as it's empty
pre_df = orig_df.drop(columns=['highestleague'])

In [None]:
# @title Remove `accept` as it's not indicative of what the survey data entails
pre_df = pre_df.drop(columns=['accept'])

**Note**: `accept` is meant to be the variable for Consent in survey participation, having either `Accept` or `NA` as values.

The value of `NA` means that the Consent step was skipped, as unlike the main study questions which were marked as required, the Consent part of the used survey form was not marked as such.

The used survey form states that not answering or not finishing the survey is the only way the data is not stored or shared.

Given that even if `accept` is `NA`, there are still participants *who answered the required parts of the survey until they finished*, we choose to interpret this as **"Consent by Action."**

By continuing to the next pages and answering the demographic and/or study-specific questions, the participant effectively demonstrated their willingness to participate.

The `NA` isn't treated as a "No"—just a skipped administrative step.

![accept](https://raw.githubusercontent.com/DeoZD/CSMODEL_G2_MCO/39481709c624a3431f3db90e0c1b8f9e033b5e3b/assets%20/accept.png)

In [None]:
# @title Remove other irrelevant variables (`Reference`, `Timestamp`, `S. No.`)
pre_df = pre_df.drop(columns=['Reference', 'Timestamp', 'S. No.'])

In [None]:
# @title Clean `League` categories

# Get the unique values to assess which values should stay or be removed
# pre_df['League'].unique()
# pre_df['League'].nunique()
# pre_df['League'].drop_duplicates().to_csv("unique_rank.csv", index=False)

In [None]:
def extract_rank(x):
    rank_order = ['bronze', 'silver', 'gold', 'platinum', 'diamond', 'master', 'grandmaster', 'challenger']

    if pd.isna(x) or not str(x).strip():
        return np.nan

    x = str(x).lower().strip()

    # UNRANKED (PRIORITY)
    unranked_patterns = [
        r'\b(unranked|not\s+ranked|no\s+rank|unraked|unrankt|unrank)\b',
        r'\b(placement|provisional|seeding|qualifying|not\s+placed|still\s+placing)\b',
        r'\b(not\s+applicable|n/a|na|none)\b',
        r'\b(dont\s+play|don\'t\s+play|never\s+played|havent\s+played)\s+ranked\b',
        r'\b(havent|haven\'t)\s+(?:done|played)\s+(?:ranked|placement)\b',
        r'\b(under|pre|not)\s*(?:level|30|lvl)\b',
        r'\b(aram|normal|casual)\s+only\b',
        r'\b(too\s+toxic|rank\s+anxiety|anxiousness)\b',
        r'\b0\s*(?:games|ranked)\b',
    ]

    for pattern in unranked_patterns:
        if re.search(pattern, x):
            return 'unranked'

    found_ranks = set()

    # FULL WORD RANKS
    for rank in rank_order:
        if re.search(rf'\b{rank}\b', x):
            found_ranks.add(rank)

    # LETTER-NUMBER RANKS
    ln_map = {
        'b': 'bronze',
        's': 'silver',
        'g': 'gold',
        'p': 'platinum',
        'd': 'diamond',
        'm': 'master',
        'gm': 'grandmaster',
        'ch': 'challenger'
    }

    ln_matches = re.findall(r'\b(gm|ch|[bsgpdm])\s*\d+\b', x)
    for code in ln_matches:
        found_ranks.add(ln_map[code])

    # MISSPELLINGS / VARIANTS
    variations = {
        'bronze': ['bronz', 'brnz', 'broze', 'bronce'],
        'silver': ['silv', 'slvr', 'siver', 'sivler'],
        'gold': ['gld', 'glod', 'goled', 'golden'],
        'platinum': ['plat', 'pltn', 'platin', 'platen', 'platnium', 'platium'],
        'diamond': ['diam', 'diamon', 'diamomd'],
        'master': ['mstr', 'mst', 'masters'],
        'grandmaster': ['grandm', 'gmaster'],
        'challenger': ['chall', 'challngr', 'challen'],
    }

    for rank, vars_list in variations.items():
        for var in vars_list:
            if re.search(rf'\b{var}\b', x):
                found_ranks.add(rank)
                break

    # HISTORICAL CONTEXT
    if not found_ranks:
        for rank in rank_order:
            if re.search(rf'\b(was|last\s+season|previously)\s+{rank}\b', x):
                found_ranks.add(rank)

    # RETURN HIGHEST RANK
    if found_ranks:
        return max(found_ranks, key=lambda r: rank_order.index(r))

    return np.nan

### extract_rank(x)

function defined in order to clean up the very messy data of the 'League' variable column

## Logic Flow (in order)
**I. Input validation**
* If x is empty or `NaN` → return `NaN`
* Normalize text
* Lowercase + strip spaces
* Unranked detection (priority)

**II. Uses *regex patterns* to catch:**
1. **Unranked**
  - “unranked”, “not ranked”, “no rank”
  - placement/provisional
  - casual-only players
  - never played ranked
  - rank anxiety / toxicity
  - 0 ranked games

  → Immediately returns 'unranked'

**Special Case:** "NA/Not Applicable" or explicit declaration of League not being appilcable as stated in survey form is treated not as `NaN` or `NA` to differentiate from actual `NA` value being treated as 'Unknown' instead of 'Unranked'
- not applicable → returns 'unranked'

2. **Exact rank words**
    
    Matches whole words:
- bronze, silver, gold, platinum, diamond, master, grandmaster, challenger
3. **Letter-number ranks**
Converts codes like:
* B1 → bronze
* P5 → platinum
* GM1 → grandmaster
* CH1 → challenger

4. **Misspellings / variants**
* Catches typos like:
  * platnium → platinum
  * glod → gold
  * siver → silver
  * bronce → bronze

5. **Historical context**
* Detects phrases like:
  * "was diamond"
  * "last season gold"

**III. Resolution rule**
* If multiple ranks found → return the highest rank using predefined order

**IV. Fallback**
* If nothing matched → return NaN

In [None]:
pre_df['League_clean'] = pre_df['League'].apply(extract_rank)

In [None]:
# @title Show and compare cleaned `League` categories

# Filter rows where both 'League' and 'League_clean' are not NaN for display
filtered_df = pre_df[pre_df['League'].notna() & pre_df['League_clean'].notna()].copy()

# Get all unique clean league values to ensure all are covered
unique_clean_leagues = filtered_df['League_clean'].unique()

# Display examples for each unique clean league
for league in sorted(unique_clean_leagues):
    print(f"\nLeague_clean: {league.upper()}")
    display(filtered_df[filtered_df['League_clean'] == league][['League', 'League_clean']].head(3))
    print("--------------------------------------------------------------------------------------")


In [None]:
# @title Clean `earnings` categories

# Get the unique values to assess which values should stay or be removed
# pre_df['earnings'].unique()
# pre_df['earnings'].nunique()
# pre_df['earnings'].drop_duplicates().to_csv("unique_earnings.csv", index=False)

In [None]:
def extract_earnings(x):
    if pd.isna(x) or not str(x).strip():
        return np.nan

    x = str(x).lower().strip()

    # 1. MONETIZATION / INCOME
    monetization_keywords = [
        'earn', 'earning', 'money', 'paid', 'income', 'living', 'wage',
        'stream', 'streaming', 'youtube', 'tournament winnings',
        'betting', 'trading', 'tuition', 'side income', 'make money',
        'career', 'job', 'profitable', 'shoutcaster'
    ]
    if any(k in x for k in monetization_keywords):
        return 'Monetization'

    # 2. COMPETITIVE / PRO ASPIRATION
    competitive_keywords = [
        'competitive', 'competition', 'tournament', 'ranked',
        'climb', 'ladder', 'improve', 'improvement', 'better',
        'become', 'pro', 'professional', 'well known', 'best',
        'aspire', 'goal', 'achieve', 'rank 1'
    ]
    if any(k in x for k in competitive_keywords):
        return 'Competitive / Pro-Aspiration'

    # 3. BOOSTING (explicit)
    if 'boost' in x or 'eloboost' in x or 'boosting' in x:
        return 'Boosting'

    # 4. ESCAPISM
    escapism_keywords = [
        'escapism', 'escape', 'forget', 'real life', 'get away', 'relief',
        'fill the void', 'numb', 'suppress', 'mental', 'memories'
    ]
    if any(k in x for k in escapism_keywords):
        return 'Escapism'

    # 5. ADDICTION (explicit psychological dependence)
    addiction_keywords = [
        'addicted', 'addiction', 'can’t stop', "can't stop",
        'compulsion', 'dependent', 'hooked'
    ]
    if any(k in x for k in addiction_keywords):
        return 'Addiction'

    # 6. HABIT (automatic behavior, routine)
    habit_keywords = [
        'habit', 'routine', 'autopilot', 'used to', 'just play',
        'normally play', 'always play', 'keep playing'
    ]
    if any(k in x for k in habit_keywords):
        return 'Habit'

    # 7. BOREDOM (time-filling behavior)
    boredom_keywords = [
        'bored', 'nothing better to do', 'kill time',
        'pass time', 'spend my time', 'time somehow',
        'waste time', 'no work', 'no job'
    ]
    if any(k in x for k in boredom_keywords):
        return 'Boredom'

    # 8. FUN & SOCIAL (default hobby class)
    fun_keywords = [
        'fun', 'friends', 'social', 'hobby',
        'enjoy', 'love', 'passion', 'relax'
    ]
    if any(k in x for k in fun_keywords):
        return 'Fun & Social'

    # 9. FALLBACK
    return 'Other / Just Playing'

### extract_earnings(x)

function defined in order to clean up and recategorize the data of the `earnings` variable column

## Logic Flow (in order)
**I. Input Validation**
* If x is empty, blank, or NaN → return NaN
* Normalize text:
	* Cast to string
	* Lowercase
	* Strip spaces
	* Normalize punctuation

**II. Priority Semantic Classification *(ordered)***
* Order matters — higher conceptual intent overrides lower ones
1. **Monetization / Income Intent *(highest priority)***
- Definition:
Any explicit intention to earn money, profit, income, or financial gain through gaming.

  Detected concepts:
  - Direct earnings:
    - “earn money”
    - “earning”
    - “income”
    - “paid”
    - “wage”
    - “living”
  - Career framing:
    - “career”
    - “job”
    - “profession”
    - “profitable”
  - Content monetization:
    - “streaming”
    - “youtube”
    - “content creation”
    - “shoutcasting”
  - Financial activities:
    - “betting”
    - “trading”
    - “tournament winnings”
    - “side income”
  - Statements of intent:
    - “make money”
    - “support myself”
    - “financially”

  → Returns: `'Monetization'`
2. **Competitive / Pro-Aspiration**
- Definition:
Non-monetary ambition to improve, compete, climb, or achieve status/recognition.

  Detected concepts:
  - Competition framing:
    - “competitive”
    - “competition”
    - “tournaments”
  - Performance goals:
    - “improve”
    - “get better”
    - “climb”
    - “ranked”
    - “ladder”
  - Identity/status:
    - “become pro”
    - “professional”
    - “well known”
    - “best”
    - “rank 1”
  - Aspirational language:
    - “goal”
    - “aspire”
    - “achieve”
  
  → Returns: `'Competitive / Pro-Aspiration'`
3. **Boosting**
- Definition:
Explicit commercial exploitation of skill ranking systems.

  Detected concepts:

  - “boosting”
  - “elo boosting”
  - “boost accounts”
  - “selling rank”

  → Returns: `'Boosting'`
4. **Escapism**
- Definition:
Gaming used as emotional or psychological escape from reality.

  Detected concepts:
  - “escape”
  - “forget real life”
  - “get away”
  - “relief”
  - “numb”
  - “suppress feelings”
  - “mental health”
  - “fill the void”
  - “cope”
  - “memories”

  → Returns: `'Escapism'`
5. **Addiction**
- Definition:
Explicit psychological dependence or compulsive behavior framing.

  Detected concepts:
    - “addicted”
    - “addiction”
    - “can't stop”
    - “hooked”
    - “dependent”
    - “compulsion”
    - “obsessed” (*pathological context*)

  → Returns: `'Addiction'`
6. **Habit**
- Definition:
Automatic, routine, non-emotional behavior.
  Detected concepts:
  “habit”
  “routine”
  “autopilot”
  “just play”
  “used to”
  “always play”
  “normally play”
  “keep playing”

  → Returns: 'Habit'
7. **Boredom**
- Definition:
Time-filling behavior due to lack of alternatives.

  Detected concepts:
    - “bored”
    - “nothing better to do”
    - “kill time”
    - “pass time”
    - “waste time”
    - “no work”
    - “no job”
    - “nothing else to do”

  → Returns: `'Boredom'`
8. **Fun & Social**
- Definition:
Healthy recreational motivation.

  Detected concepts:
    - “fun”
    - “enjoy”
    - “love”
    - “hobby”
    - “friends”
    - “social”
    - “relax”
    - “passion”

  → Returns: 'Fun & Social'

**III. Resolution Rule**

If multiple categories detected:
→ Return the highest-priority category based on this order:

> Monetization

> Competitive / Pro-Aspiration

> Boosting

> Escapism

> Addiction

> Habit

> Boredom

> Fun & Social

*Rationale*:
Motivation hierarchy reflects behavioral dominance, not emotional tone.

**IV. Fallback Rule**

If nothing matches:
→ Return `'Other / Just Playing'`

In [None]:
pre_df['earnings_clean'] = pre_df['earnings'].apply(extract_earnings)

In [None]:
# @title Show and compare cleaned `earnings` categories and drop original column

# Filter rows where both 'earnings' and 'earnings_clean' are not NaN for display
filtered_df = pre_df[pre_df['earnings'].notna() & pre_df['earnings_clean'].notna()].copy()

# Get all unique clean earnings values to ensure all are covered
unique_clean_earnings = filtered_df['earnings_clean'].unique()

# Display examples for each unique clean earnings
for earnings in sorted(unique_clean_earnings):
    print(f"\nearnings_clean: {earnings.upper()}")
    display(filtered_df[filtered_df['earnings_clean'] == earnings][['earnings', 'earnings_clean']].head(3))
    print("--------------------------------------------------------------------------------------")

pre_df = pre_df.drop(columns=['earnings'])
print("'earnings' column dropped")

In [None]:
# @title Clean and analyze `SPIN{1-18}` variable columns

spin_vars = [f'SPIN{i}' for i in range(1, 18)]

# Show counts of nulls for just these columns
null_counts = pre_df[spin_vars].isnull().sum()
print(null_counts)

In [None]:
# @title Analyze missing data patterns in `SPIN` variables

# SPIN (Social Phobia Inventory) consists of 17 items measuring social anxiety.
# This analysis checks: (1) missingness patterns, (2) completeness rates, and (3) whether people skip SPIN questions together.

# Visualize missing data correlation for SPIN variables
# (Do people who skip question A also skip question B?)
plt.figure(figsize=(12, 8))
msno.heatmap(pre_df[spin_vars])
plt.title('Missing Data Pattern for SPIN Variables')
plt.show()

# Calculate SPIN completeness statistics
spin_missing = pre_df[spin_vars].isnull().all(axis=1).sum()  # Skipped entire section
spin_complete = (~pre_df[spin_vars].isnull().any(axis=1)).sum()  # Answered all questions
spin_partial = len(pre_df) - spin_missing - spin_complete  # Answered some questions

print(f"People who skipped ALL SPIN questions: {spin_missing}")
print(f"People who answered ALL SPIN questions: {spin_complete}")
print(f"People with partial SPIN answers: {spin_partial}")

In [None]:
# @title Remove respondents with incomplete `SPIN` data

# Decision: Drop rows with missing SPIN values to ensure data quality for social phobia analysis.
# Justification: 95%+ of respondents completed all SPIN items. The high correlation (0.7-0.9)
# in missing data indicates systematic skipping (survey fatigue), making partial data unreliable.
# Using complete cases only with N > 12,000 still provides sufficient statistical power.

# Store original size for comparison
original_size = len(pre_df)

# Keep only people who answered ALL SPIN questions
spin_vars = [f'SPIN{i}' for i in range(1, 18)]
pre_df = pre_df.dropna(subset=spin_vars)

# Report cleaning results
rows_dropped = original_size - len(pre_df)
pct_dropped = (rows_dropped / original_size) * 100

print(f"Original dataset: {original_size} people")
print(f"After removing incomplete SPIN: {len(pre_df)} people")
print(f"Rows dropped: {rows_dropped} ({pct_dropped:.1f}%)")

In [None]:
# Check missing values for other columns
missing = pre_df.isnull().sum()
missing_pct = (missing / len(pre_df)) * 100

missing_summary = pd.DataFrame({
    'Missing_Count': missing,
    'Missing_Percentage': missing_pct
})
print(missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False))

In [None]:
print(pre_df['Degree'].unique())

In [None]:
# @title Handle missing values in `Degree` variable

# Degree represents highest educational attainment with 11.6% missing (1,484 respondents).
# Missing likely indicates: (1) no degree obtained, (2) currently pursuing education, or (3) privacy preference.

# Decision: Create "No degree / Not specified" category rather than dropping rows.
pre_df['Degree'] = pre_df['Degree'].fillna('No degree / Not specified')

In [None]:
print(pre_df['streams'].dtype)
print(pre_df['streams'].describe())

In [None]:
# @title Clean `Hours` and `streams` variables with logical constraints

# Both Hours (playing) and streams (other activities) must satisfy: Hours + streams ≤ 168
# Issues found: (1) streams > 168 (impossible alone), (2) Hours + streams > 168 (impossible combined)

# Decision: Remove extreme outliers, then cap streams to ensure total ≤ 168

# Step 1: Remove obvious data errors (streams > 168 individually)
print(f"Streams > 168 hours/week: {(pre_df['streams'] > 168).sum()}")
pre_df.loc[pre_df['streams'] > 168, 'streams'] = np.nan

# Step 2: Check for impossible combinations
pre_df['total_gaming_time'] = pre_df['Hours'] + pre_df['streams'].fillna(0)
impossible = (pre_df['total_gaming_time'] > 168).sum()
print(f"Cases where Hours + streams > 168: {impossible}")

# Step 3: Cap streams to ensure Hours + streams ≤ 168
# Only adjust if Hours is valid
mask = (pre_df['Hours'].notna()) & (pre_df['streams'].notna())
pre_df.loc[mask & (pre_df['total_gaming_time'] > 168), 'streams'] = \
    np.maximum(0, 168 - pre_df.loc[mask & (pre_df['total_gaming_time'] > 168), 'Hours'])

# Step 4: Fill remaining missing streams with 0
pre_df['streams'] = pre_df['streams'].fillna(0)

# Step 5: Verify and clean up
pre_df['total_gaming_time'] = pre_df['Hours'] + pre_df['streams']
print(f"\nAfter cleaning:")
print(f"Max Hours + streams: {pre_df['total_gaming_time'].max()}")
print(f"Cases > 168: {(pre_df['total_gaming_time'] > 168).sum()}")

# Drop temporary column
pre_df = pre_df.drop(columns=['total_gaming_time'])

print(f"\nstreams summary:")
print(pre_df['streams'].describe())

In [None]:
# @title Remove rows with missing `Hours` (gaming hours per week)

# Hours represents weekly gaming time - a critical predictor variable.
# Only 28 respondents (0.2%) missing this value.

# Decision: Drop these rows.
# Justification: Hours is central to research on gaming behavior and mental health.
# Cannot reliably estimate gaming time. Minimal data loss (0.2%) makes row deletion acceptable.

pre_df = pre_df.dropna(subset=['Hours'])

In [None]:
# @title Remove rows with missing `Narcissism` scores

# Narcissism is a psychological scale measuring narcissistic personality traits.
# Only 11 respondents (0.1%) missing this value.

# Decision: Drop these rows
# Justification: Psychological scales should not be imputed (unethical to guess personality traits).
# Negligible data loss (0.1%) makes row deletion the appropriate choice.

pre_df = pre_df.dropna(subset=['Narcissism'])

In [None]:
# @title Clean `Hours` values

# Remove rows where Hours > 168 (impossible in a week).
pre_df = pre_df[pre_df['Hours'] <= 168]

print(f"Max Hours after filter: {pre_df['Hours'].max()}")

In [None]:
# @title Cleaning `Platform` and `WhyPlay`

# Inconsistent text needs to be simplified for better grouping of data

# 1. Clean Platform
def clean_platform(platform):
    if pd.isna(platform):
        return "Unknown"

    # Force to string to handle any non-text data
    platform = str(platform)

    if "Console" in platform:
        return "Console"
    if "Smartphone" in platform:
        return "Mobile"
    return platform

# 2. Clean whyplay
def clean_whyplay(text):
    if pd.isna(text):
        return "Other"

    text = str(text).lower()

    # Priority grouping based on keywords
    if "improve" in text or "improving" in text:
        return "Improving"
    if "win" in text or "winning" in text:
        return "Winning"
    if "relax" in text:
        return "Relaxing"
    if "fun" in text:
        return "Fun"

    return "Other"

# Apply the cleaning functions
pre_df['Platform_Clean'] = pre_df['Platform'].apply(clean_platform)
pre_df['WhyPlay_Clean'] = pre_df['whyplay'].apply(clean_whyplay)

# Visualize the standardized groups
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

sns.countplot(x='Platform_Clean', data=pre_df, ax=axes[0])
axes[0].set_title('Distribution of Gaming Platforms')

sns.countplot(y='WhyPlay_Clean', data=pre_df, order=pre_df['WhyPlay_Clean'].value_counts().index, ax=axes[1])
axes[1].set_title('Distribution of Primary Motivations')

plt.tight_layout()
plt.show()

In [None]:
# Check missing values for other columns
missing = pre_df.isnull().sum()
missing_pct = (missing / len(pre_df)) * 100

missing_summary = pd.DataFrame({
    'Missing_Count': missing,
    'Missing_Percentage': missing_pct
})
print(missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False))

In [None]:
pre_df['League_clean'].unique()

In [None]:
# @title Consolidate League_clean missing values to 'unranked' category

# League_clean already contains 'unranked' values from extract_rank() function (for explicit text like "don't play ranked").
# However, some missing values (NaN) still exist from completely blank entries in original League field.

# Decision: Convert remaining NaN to 'unranked' to consolidate all non-ranked players into one category.
# Justification: Both explicit "unranked" responses and blank entries represent players who don't play ranked mode.

# full distribution before converting NaN to 'unranked'
print("\nLeague_clean distribution:")
print(pre_df['League_clean'].value_counts().sort_index())

print("Converting 2191 NaN League_clean rows to 'unranked'...")
# Convert NaN to 'unranked'
pre_df['League_clean'] = pre_df['League_clean'].fillna('unranked')

# Show full distribution after conversion
print("\nLeague_clean distribution:")
print(pre_df['League_clean'].value_counts().sort_index())

pre_df = pre_df.drop(columns=['League'])
print("'League' column dropped")
print(f"\nFinal League_clean categories: {pre_df['League_clean'].nunique()}")
print(pre_df['League_clean'].value_counts())

In [None]:
# @title Remove rows with missing Birthplace_ISO3

# Birthplace_ISO3 represents country of birth (3-letter ISO country code).
# Only 109 respondents (0.9%) missing this value.

# Decision: Drop these rows if geographic origin is relevant to analysis.
# Justification: Geographic/cultural background may influence gaming behavior and social connectedness.
# Minimal data loss (0.9%) makes row deletion acceptable if birthplace is needed for demographic analysis.

pre_df = pre_df.dropna(subset=['Birthplace_ISO3'])

In [None]:
# @title Remove rows with missing Residence_ISO3

# Residence_ISO3 represents current country of residence (3-letter ISO country code).
# Only 101 respondents (0.8%) missing this value.

# Decision: Drop these rows if current location is relevant to analysis.
# Justification: Current residence may affect gaming culture, community access, and social factors.
# Minimal data loss (0.8%) makes row deletion acceptable if residence is needed for geographic analysis.

pre_df = pre_df.dropna(subset=['Residence_ISO3'])

In [None]:
# @title Remove rows with missing Work (employment status)

# Work represents employment status (e.g., Employed, Student, Unemployed).
# Only 31 respondents (0.2%) missing this value.

# Decision: Drop these rows.
# Justification: Employment status is important demographic context for understanding gaming patterns
# and time availability. Minimal data loss (0.2%) makes row deletion preferable to imputation.

pre_df = pre_df.dropna(subset=['Work'])

In [None]:
# Check missing values for other columns
missing = pre_df.isnull().sum()
missing_pct = (missing / len(pre_df)) * 100

missing_summary = pd.DataFrame({
    'Missing_Count': missing,
    'Missing_Percentage': missing_pct
})
print(missing_summary[missing_summary['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False))

In [None]:
# @title Clean GADE missing values using categorical imputation

# GADE contains NA (not applicable) values.
# 615 respondents (4.89%) missing this value.

# Decision: Use categorical imputation to replace the missing values
# Justification: Using the most common response as the baseline for missing data will preserve the overall distribution.

# Check missing count
print(f"Missing GADE values (before): {pre_df['GADE'].isnull().sum()}")

# Calculate the mode (most common value)
gade_mode = pre_df['GADE'].mode()[0]
print(f"Most common value (Mode): {gade_mode}")

# Fill missing values with the mode
pre_df['GADE'] = pre_df['GADE'].fillna(gade_mode)

# Verify cleanup
print(f"Missing GADE values (after): {pre_df['GADE'].isnull().sum()}")
print(pre_df['GADE'].value_counts())

In [None]:
pre_df.info()

# Exploratory Data Analysis

In [None]:
gaming_df = pre_df
gaming_df_multiplayer = gaming_df[gaming_df["Playstyle"].str.contains("Multiplayer", case=False, na=False)]
gamingdf_use = gaming_df_multiplayer[["Hours", "SWL_T", "SPIN_T"]].copy()

In [None]:
gamingdf_use.info()

## Variables

1. Hours
2. SWL_T
3. SPIN_T

In [None]:
gamingdf_use["Hours"].describe()

In [None]:
gamingdf_use["SPIN_T"].describe()

In [None]:
gamingdf_use["SWL_T"].describe()

In [None]:
plt.hist(gamingdf_use["Hours"], bins=30)
plt.title("Distribution of Multiplayer Gaming Hours")
plt.xlabel("Hours")
plt.ylabel("Frequency")
plt.show()

## Q1. Is multiplayer playtime associated with social anxiety (SPIN_T)?

In [None]:
plt.scatter(gamingdf_use["Hours"], gamingdf_use["SPIN_T"])
plt.xlabel("Multiplayer Hours")
plt.ylabel("Social Anxiety (SPIN_T)")
plt.show()

In [None]:
gamingdf_use[["Hours", "SPIN_T"]].corr(method="spearman")

In [None]:

gamingdf_use[["Hours", "SPIN_T"]].corr(method="pearson")

## Q2. Is gaming time associated with life satisfaction (SWL_T)?

In [None]:
plt.scatter(gamingdf_use["Hours"], gamingdf_use["SWL_T"])
plt.xlabel("Multiplayer Hours")
plt.ylabel("Life Satisfaction (SWL_T)")
plt.show()

In [None]:
gamingdf_use[["Hours", "SWL_T"]].corr(method="spearman")

In [None]:
gamingdf_use[["Hours", "SWL_T"]].corr(method="pearson")