## Reducing Size of Dataframes While Maintaining Complexity 

Running models on my current dataframes is difficult with the amount of computational power I have available to me.  Thus this notebook is documenting my process of stratified sampling my two notebooks.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy import stats
import scipy as sci

In [2]:
import pandas as pd
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
from scipy.sparse import csr_matrix
from mlxtend.frequent_patterns import apriori, association_rules

# Function to reduce the memory usage of a DataFrame.
def reduce_memory(df):
    for col in df.columns:
        if df[col].dtype == 'float64':
            df[col] = df[col].astype('float32')
        if df[col].dtype == 'int64':
            df[col] = df[col].astype('int32')
    return df

# Generator function to load data in chunks.
def data_generator(df, chunksize=10000):
    for i in range(0, df.shape[0], chunksize):
        yield df.iloc[i:i+chunksize]


games = reduce_memory(pd.read_csv('../Data/games.csv'))
users = reduce_memory(pd.read_csv('../Data/users.csv'))
recommendations = reduce_memory(pd.read_csv('../Data/recommendations.csv'))
recommendations2 = reduce_memory(pd.read_csv('../Data/new_recommendations.csv'))

Our stratified sampling function requires at bare minimum at least 2 occurrences of either a user or an item in our recommendations dataframe to make a proper sampling.  Thus we have to choose one of these two options to stratify on, the next few lines of code determine which of the two options would be best based on the amount of data lost.

In [3]:
# Checks how many games appear in recommendations only once 

app_id_counts = recommendations['app_id'].value_counts()


single_occurrence_app_ids = app_id_counts[app_id_counts == 1].index


unique_app_id_recommendations = recommendations[recommendations['app_id'].isin(single_occurrence_app_ids)]

In [4]:
# Same as above but for the new dataframe I altered in Refining a Rating System

app_id_counts2 = recommendations2['app_id'].value_counts()


single_occurrence_app_ids2 = app_id_counts2[app_id_counts2 == 1].index


unique_app_id_recommendations2 = recommendations2[recommendations2['app_id'].isin(single_occurrence_app_ids2)]

In [5]:
# how many occurrences of apps only appearing once in both dataframes 

print(len(unique_app_id_recommendations))
print(len(unique_app_id_recommendations2))

291
370


In [6]:
# Identical as above but checking user ids instead of games 

user_id_counts = recommendations['user_id'].value_counts()


single_occurrence_user_ids = user_id_counts[user_id_counts == 1].index


unique_user_id_recommendations = recommendations[recommendations['user_id'].isin(single_occurrence_user_ids)]

In [7]:
user_id_counts2 = recommendations2['user_id'].value_counts()


single_occurrence_user_ids2 = user_id_counts2[user_id_counts2 == 1].index


unique_user_id_recommendations2 = recommendations2[recommendations2['user_id'].isin(single_occurrence_user_ids2)]

In [8]:
# 7.5 million unique users compared to 300-400 unique games 

print(len(unique_user_id_recommendations))
print(len(unique_user_id_recommendations2))

7573027
7452615


In [9]:
# Checking percentages of the unique user and app id's compared to the total amount in games and users 

print(len(unique_user_id_recommendations) / len(users)) 
print(len(unique_app_id_recommendations) / len(games)) 

0.5293578303578119
0.005720239031294229


From what we can see 53% of users in our users dataframe only make one recommendation in our recommendations dataframe.  Whereas 0.005% of games in our games dataframe appear once in our recommendations dataframe.  Meaning that we should stratify based off of games not users.  

In [10]:
# looking at some of the titles included in the single occurrences 

filtering = unique_app_id_recommendations.merge(games,on='app_id')
filtering2 = unique_app_id_recommendations2.merge(games,on='app_id')

In [11]:
# highest user_reviews for these are 922, I'd consider that removable.  I externally saved the csv and examined it more 
# in depth and didn't notice any noteworthy titles.
filtering.sort_values(by='user_reviews',ascending=False) 

Unnamed: 0,app_id,helpful,funny,date,is_recommended,hours,user_id,review_id,title,date_release,win,mac,linux,rating,positive_ratio,user_reviews,price_final,price_original,discount,steam_deck
0,2111850,8,6,2022-08-29,True,0.800000,12496174,20085246,Gunfire Reborn - Visitors of Spirit Realm,2022-08-28,True,False,False,Very Positive,90,922,6.99,6.99,0.0,True
163,844890,0,0,2021-11-10,True,71.599998,4357178,38103808,World of Warships — Starter Pack: Ishizuchi,2018-06-21,True,False,False,Mostly Positive,75,657,24.99,24.99,0.0,True
76,229660,0,0,2019-05-13,True,45.700001,12808645,28996859,Sonic and All-Stars Racing Transformed: Metal ...,2013-01-31,True,False,False,Very Positive,91,86,19.99,19.99,0.0,True
160,415800,0,0,2018-12-16,True,63.799999,9558316,37987211,X Rebirth: Home of Light,2016-02-25,True,True,True,Very Positive,92,82,4.99,9.99,50.0,True
96,987730,2,0,2020-07-25,True,1.100000,764368,29414338,Class - Plague Doctor,2018-12-13,True,True,True,Positive,86,46,4.99,4.99,0.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
111,1919440,10,0,2022-06-05,True,0.500000,5421505,36667562,SPE:X,2022-05-18,True,False,False,Positive,100,10,6.99,6.99,0.0,True
112,1978580,0,0,2022-09-02,True,8.900000,769934,36675956,SGS Operation Downfall,2022-08-09,True,True,False,Mostly Positive,70,10,19.99,19.99,0.0,True
120,1055950,2,0,2021-03-20,True,10.700000,4291951,36871468,Feels,2019-11-13,True,False,False,Positive,100,10,3.99,3.99,0.0,True
206,1598220,0,2,2021-12-17,True,18.000000,8440912,39186449,Moon Observatory Iris,2021-05-01,True,True,True,Positive,100,10,19.99,19.99,0.0,True


In [12]:
# removing single occurence games from full datasets 

recommendations = recommendations[~recommendations['app_id'].isin(filtering['app_id'])]
recommendations2 = recommendations2[~recommendations2['app_id'].isin(filtering2['app_id'])]

In [13]:
from sklearn.model_selection import StratifiedShuffleSplit

# stratification functions, sampling 10% of the data stratifying on app_id 
target_column = recommendations2['app_id']

split = StratifiedShuffleSplit(n_splits=1, test_size=0.10, random_state=42)

for train_idx, sample_idx in split.split(recommendations2, target_column):
    stratified_sample2 = recommendations2.iloc[sample_idx]

In [14]:
# repeated for second dataframe

target_column = recommendations['app_id']

split = StratifiedShuffleSplit(n_splits=1, test_size=0.10, random_state=42)

for train_idx, sample_idx in split.split(recommendations, target_column):
    stratified_sample = recommendations.iloc[sample_idx]

In [15]:
stratified_sample2

Unnamed: 0,app_id,helpful,funny,date,is_recommended,hours,user_id,review_id,average_hours,above_average_playtime,...,review_age,products,reviews,review_ratio,title,rating,positive_ratio,user_reviews,game_review_ratio,composite_rating
14902172,700330,0,0,2020-05-20,1,15.000000,6451963,39409765,75.000000,0,...,955,72,16,0.222222,SCP: Secret Laboratory,1.2,91,154538,0.562910,10
30661131,584400,0,0,2019-06-29,1,4.300000,5038981,17749572,18.500000,0,...,1281,311,13,0.041801,Sonic Mania,1.2,93,18473,0.576355,8
37690205,1644960,0,0,2022-05-31,0,326.299988,9126227,11002420,140.000000,1,...,214,117,12,0.102564,NBA 2K22,1.0,58,29569,0.349386,5
26963620,1816670,0,0,2022-10-09,0,2.700000,11400715,13500336,19.200001,0,...,83,182,4,0.021978,GUNDAM EVOLUTION,1.0,52,18407,0.777911,3
8598819,220200,0,0,2018-11-22,1,314.600006,2411207,8876196,148.000000,1,...,1500,123,4,0.032520,Kerbal Space Program,1.3,95,94712,0.857114,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30918699,35140,0,0,2020-04-13,1,3.900000,10561108,35457937,15.800000,0,...,992,20,2,0.100000,Batman: Arkham Asylum Game of the Year Edition,1.3,96,33409,0.714987,7
8861132,8870,0,0,2014-06-23,1,11.200000,12112479,23292282,16.600000,0,...,3113,242,2,0.008264,BioShock Infinite,1.2,93,99379,0.652764,7
3462732,271590,4,0,2021-10-23,0,695.099976,9049206,5524726,209.300003,1,...,434,553,40,0.072333,Grand Theft Auto V,1.2,86,1484122,0.108836,5
39760994,815780,26,4,2022-07-29,1,42.599998,3257175,31235948,39.700001,1,...,155,8,1,0.125000,PIPE by BMX Streets,1.2,93,5032,0.678060,7


In [16]:
stratified_sample

Unnamed: 0,app_id,helpful,funny,date,is_recommended,hours,user_id,review_id
38799055,414700,0,0,2018-09-14,True,23.000000,7357147,38799055
29964690,477870,0,0,2021-08-02,True,6.900000,2981405,29964690
36118150,976310,0,0,2022-01-07,True,75.199997,5102463,36118150
14278486,253980,0,0,2015-03-27,False,1.400000,12315031,14278486
20827726,444090,0,0,2021-08-16,True,9.200000,1244334,20827726
...,...,...,...,...,...,...,...,...
17650721,55230,2,0,2014-11-02,True,44.200001,3195856,17650721
5389297,236850,0,0,2021-01-20,True,673.599976,12195184,5389297
20599452,248820,0,0,2014-07-04,True,105.300003,12442711,20599452
14216491,1517290,0,0,2022-12-18,False,32.799999,784405,14216491


In [17]:
# saving our stratified dataframes for use in other notebooks, commented out to avoid dataframe generation unless desired

# stratified_sample.to_csv('../Data/sampled_recommendations.csv',index=False)
# stratified_sample2.to_csv('../Data/sampled_composite_recommendations.csv',index=False)