# Data Analysis
##### Analyzing the Pre-Processed Binary Encoded Data

Using standard python and pandas methods to examine and analyze the dataset

#### Imports

In [10]:
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

import pandas as pd
import matplotlib.pyplot as plt
import pre_processing as pp

In [11]:
df = pp.preprocess_pipeline(r"Updated_Gaming_Survey_Responses.xlsx")

[INFO] Excel dataset loaded successfully. Shape: (500, 22)
[INFO] Basic cleaning applied.
[INFO] 'timestamp' column removed.
[INFO] Age column converted to binary columns -> Age_Teen, Age_Young_Adult, Age_Adult, Age_Mid_Adult.
[INFO] Location column converted to binary columns -> 'Location_India', 'Location_US', 'Location_Other'.
[INFO] Gender column cleaned.
[INFO] Gender column converted to binary columns.
[INFO] 'How often do you play video games?' converted to binary columns -> Gaming_Daily, Gaming_Weekly, Gaming_Monthly, Gaming_Rarely_Never.
[INFO] 'How many hours do you typically spend gaming in a week?' column cleaned and renamed -> 'Gaming_Hours'.
[INFO] 'Gaming_Hours' converted to binary columns.
[INFO] 'Which device do you play games on the most?(Check all that apply)' column cleaned and expanded into one-hot device columns.
[INFO] 'What genres of video games do you play? (Check all that apply)' column cleaned and expanded into one-hot genre columns.
[INFO] 'What is your favo

In [12]:
df.to_csv('data/processed_data.csv', index=False)

#### Data Loading

In [None]:
# Replace with the actual path to your CSV
df = pd.read_csv(r"data/processed_data.csv")

# Show data as boolean values
df = df.astype(bool)

print("Dataset shape:", df.shape)
print("Sample transactions:")
df.head()

Dataset shape: (500, 67)
Sample transactions:


Unnamed: 0,Age_Teen,Age_Young_Adult,Age_Adult,Age_Mid_Adult,Location_India,Location_US,Location_Other,Gender_Female,Gender_Male,Gaming_Daily,...,Spend_lt100,Spend_100-500,Spend_500-1000,Spend_1000plus,Reason_Fun,Reason_Stress_Relief,Reason_Skills_Competition,Reason_Socialize,Reason_Learning,Reason_Other
0,1,0,0,0,1,0,0,0,1,1,...,1,0,0,0,1,1,0,0,0,0
1,0,1,0,0,1,0,0,0,1,0,...,1,0,0,0,0,0,1,0,0,0
2,0,1,0,0,1,0,0,0,1,0,...,0,1,0,0,1,0,0,0,0,0
3,1,0,0,0,0,0,1,1,0,0,...,1,0,0,0,1,0,0,0,0,0
4,1,0,0,0,1,0,0,1,0,1,...,1,0,0,0,1,1,1,1,0,0


#### Let:
- $ I = \{ i_1, i_2, ... , i_m \} $ be the set of all items
- $ D = \{ T_1, T_2, ... , T_N \} $ be the transaction database
- $ N = | D | $ be the number of transactions

## Analysis

### Item supports

$ \text{count}(i) = \sum _{t=1}^N 1(i \in T) $

$ \text{support}(i) = \frac{\text{count(i)}}{\text{N}} $

In [14]:
freq_support = pd.DataFrame({
    "Frequency": df.sum(),
    "Support": df.mean()
}).sort_values("Frequency", ascending=False)

top_10 = freq_support.head(10).reset_index()
bottom_10 = freq_support.tail(10).reset_index()

top_10.rename(columns={"index": "Top 10 Items"}, inplace=True)
bottom_10.rename(columns={"index": "Bottom 10 Items"}, inplace=True)

combined = pd.concat([top_10, bottom_10], axis=1)

combined

Unnamed: 0,Top 10 Items,Frequency,Support,Bottom 10 Items,Frequency.1,Support.1
0,Reason_Fun,334,0.668,Favorite_Game_chess,20,0.04
1,Genre_Action/Adventure,298,0.596,Favorite_Game_clash_of_clans,20,0.04
2,Device_Mobile,294,0.588,Favorite_Game_red_dead_redemption_2,20,0.04
3,Gender_Female,268,0.536,Favorite_Game_many,20,0.04
4,Location_India,265,0.53,Favorite_Game_fortnite,20,0.04
5,Reason_Stress_Relief,264,0.528,Favorite_Game_god_of_war_ragnarok,20,0.04
6,Game_Mode_Both,263,0.526,Reason_Other,15,0.03
7,Discovery_Social_Media,263,0.526,Gaming_Hours_0-1_hour,12,0.024
8,Genre_FPS,238,0.476,Reason_Learning,12,0.024
9,Gender_Male,232,0.464,Device_Handheld,10,0.02


### Transaction lengths

$ \text{transaction length} = \sum_{j} X_{ij} $

In [15]:
transaction_lengths = df.sum(axis=1)

basket_size = transaction_lengths.describe().to_frame(name="Transaction Length/Basket Size")

basket_size

Unnamed: 0,Transaction Length/Basket Size
count,500.0
mean,15.668
std,2.552412
min,12.0
25%,14.0
50%,15.0
75%,17.0
max,24.0


### Sparsity/Density of the Dataset

$ \text{density} = \frac{\text{total ones}}{N \times M} $

In [16]:
total_entries = df.shape[0] * df.shape[1]
total_ones = df.values.sum()

density = (total_ones / total_entries)
print("Density:", density)

Density: 0.2338507462686567


### Sample Transactions

In [17]:
transaction = df.loc[5]
items_in_transaction = transaction[transaction].index.tolist()

print(items_in_transaction)

['Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen']


  items_in_transaction = transaction[transaction].index.tolist()


In [18]:
transaction = df.loc[255]
items_in_transaction = transaction[transaction].index.tolist()

print(items_in_transaction)

['Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen', 'Age_Teen', 'Age_Young_Adult', 'Age_Young_Adult', 'Age_Teen', 'Age_Young_Adult', 'Age_Young_Adult', 'Age_Teen', 'Age_Teen']


  items_in_transaction = transaction[transaction].index.tolist()
