## Stratified Sampling code: Movielens

We will use this code to build a stratified sample from the original `Movielens` dataset. Since the dataset is so big, we won't be able to run the models on the whole dataset. Instead, we will build as large of a sample as possible for the models and run them through this sample. For this purpose, we are going to use `Stratified K-Fold` to ensure the sample is composed of the proper distribution of data.

In [3]:
import numpy as np
import pandas as pd

In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all" # to make jupyter print all outputs, not just the last one
from IPython.core.display import HTML # to pretty print pandas df and be able to copy them over (e.g. to ppt slides)

In [5]:
df = pd.read_parquet('cleaned/movielens_parquet')

Exporting and importing in parquet leaded to the the conversion of the lists in movielens data in ``review_data`` column to ``numpy`` arrays. We need to convert them back to lists to use the same approach as in the ``EDA Netflix.ipynb`` to take samples.

In [6]:
# convert numpy arrays to lists in the 'review_data' column
df['review_data'] = df['review_data'].apply(lambda x: x.tolist() if isinstance(x, np.ndarray) else x)


### Explanation of Stratified Sampling Code for Movielens Dataset

This code is designed to create a stratified sample from the Movielens dataset based on the number of reviews for each movie. Here's a breakdown of the steps involved:

1. Count the Number of Reviews per Movie:
The first step involves counting the number of reviews per movie in the dataset. This is achieved by creating a new column called num_reviews and populating it with the count of reviews for each movie.

2. Divide the Dataset into Strata:
Next, the dataset is divided into strata based on the number of reviews. Quintiles are calculated to determine the boundaries for these strata, ensuring an equal distribution of data. These boundaries are adjusted for monotonic increase and assigned labels 'Q1' through 'Q5' to represent the five strata.

3. Categorize Movies into Strata:
Movies are then categorized into these strata based on the number of reviews they have received. This is done by creating a new column named review_stratum, which assigns each movie to a specific stratum according to its number of reviews.

Now, let's delve into the code to see how these steps are implemented.

In [7]:
# Step 1: Count the number of dictionaries per row in the review_data column, replacing NaN with 0
df['num_reviews'] = df['review_data'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Step 2: Divide the dataset into strata based on the number of reviews for each movie
quintiles = df['num_reviews'].quantile([0, 0.20, 0.40, 0.60, 0.80, 1.0])

# Adjust the boundaries to ensure monotonic increase
stratum_boundaries = [0, quintiles[0.20], quintiles[0.40], quintiles[0.60], quintiles[0.80], quintiles[1.0]]
stratum_labels = ['Q1', 'Q2', 'Q3', 'Q4', 'Q5']

# Create a new column to categorize movies into strata based on the number of reviews
df['review_stratum'] = pd.cut(df['num_reviews'], bins=stratum_boundaries, labels=stratum_labels)

In the following snippet we will create the sample:

1. Count the Number of Reviews per Movie in the original Dataframe:
Initially, the code counts the number of reviews per movie in the dataset and stores the count in a new column in the sample called ``num_reviews``.

2. Divide the Sample Dataset into Strata:
The dataset is divided into strata based on the number of reviews per movie. Quintiles are computed to determine the boundaries for these strata, ensuring a proportional distribution of movies across different strata.

3. Define Sample Size per Stratum:
A fixed sample size is defined for each stratum to ensure adequate representation in the final sampled dataset.

4. Apply Random Sampling within Each Stratum:
Within each stratum, random sampling techniques are applied to select movies. This ensures that the sampled dataset reflects a diverse range of movies across different strata.

5. Create the Sampled DataFrame:
Finally, the sampled movies are collected to create the sampled DataFrame, containing essential movie information such as ID, review data, genres, year, title, review stratum, and the number of reviews.

In [8]:
import random

sampled_df = df[df.columns]

# Step 1: Count the number of dictionaries per row in the review_data column
sampled_df['num_reviews'] = df['review_data'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Step 2: Divide the dataset into strata based on the number of reviews for each movie
quintiles = sampled_df['num_reviews'].quantile([0, 0.20, 0.40, 0.60, 0.80, 1.0])
# Adjust the boundaries to include fewer movies with a large number of reviews
stratum_boundaries = [0, quintiles[0.20], quintiles[0.40], quintiles[0.60], quintiles[0.80], quintiles[1.0]]
stratum_labels = ['Q1', 'Q2', 'Q3', 'Q4','Q5']

# Create a new column to categorize movies into strata based on the number of reviews
sampled_df['review_stratum'] = pd.cut(sampled_df['num_reviews'], bins=stratum_boundaries, labels=stratum_labels)

# Step 3: Define sample size per stratum
sample_size_per_stratum = 100

# Step 4: Within each stratum, apply random sampling techniques to select movies
sampled_movies = []

# Iterate over each stratum
for stratum in sampled_df['review_stratum'].dropna().unique():  # Drop NaN values
    # Filter movies in the current stratum
    stratum_movies = sampled_df[sampled_df['review_stratum'] == stratum]
    
    # Apply simple random sampling to select movies within the stratum
    sampled_indices = random.sample(list(stratum_movies.index), sample_size_per_stratum)
    sampled_movies.extend(sampled_indices)

# Step 5: Create the sampled DataFrame
sampled_df_movielens = sampled_df.loc[sampled_movies, ['movieId', 'review_data', 'genres', 'year', 'title', 'review_stratum','num_reviews']]

We summarize the information in the sample:

The following code will analyze the sampled dataset from the Movielens dataset. It will extract user IDs from the review data, count the number of unique users and total reviews, and calculate averages for the number of reviews per unique user and per movie ID. Finally, it will print out these summary statistics. Overall, it will provide insights into the user engagement and review distribution within the sampled dataset.

In [9]:
# Extract all user IDs from the 'review_data' column using list comprehension
user_ids = [review_entry.get('userId') for row in sampled_df_movielens['review_data'] for review_entry in row if review_entry.get('userId')]

# Count the number of unique users and reviews
unique_users = set(user_ids)
amount_of_reviews = len(user_ids)

# Calculate averages
avg_reviews_per_unique_user = amount_of_reviews / len(unique_users)
avg_reviews_per_movie_id = amount_of_reviews / len(sampled_df_movielens)

# Print results
print("There are {} reviews in the sampled dataframe.".format(amount_of_reviews))
print("There are {} unique users who have reviewed a movie.".format(len(unique_users)))
print("There are {} movieIds in the sampled dataset.".format(len(sampled_df_movielens)))
print("A unique user places {} reviews on average in the sampled dataset.".format(round(avg_reviews_per_unique_user)))
print("A movieId receives {} reviews on average in the sampled dataset.".format(round(avg_reviews_per_movie_id)))

There are 50381 reviews in the sampled dataframe.
There are 37841 unique users who have reviewed a movie.
There are 500 movieIds in the sampled dataset.
A unique user places 1 reviews on average in the sampled dataset.
A movieId receives 101 reviews on average in the sampled dataset.


### T-Tests to Compare the Means:

We will perform t-tests to compare the mean of the global population to that of the strata. This analysis will allow us to ascertain whether there are significant differences in the distribution of certain characteristics across different strata. By conducting these tests, we aim to gain insights into potential variations or patterns within the dataset that may influence our understanding of the underlying population.

In [10]:
from scipy.stats import ttest_ind

# Define the strata
strata = sampled_df_movielens['review_stratum'].unique()

# Perform t-tests for each stratum
t_statistics = {}
p_values = {}
for stratum in strata:
    # Extract the 'num_reviews' column for the current stratum
    sampled_num_reviews_stratum = sampled_df_movielens[sampled_df_movielens['review_stratum'] == stratum]['num_reviews']
    population_num_reviews_stratum = df[df['review_stratum'] == stratum]['num_reviews']
    
    # Perform the t-test
    t_statistic, p_value = ttest_ind(sampled_num_reviews_stratum, population_num_reviews_stratum)
    
    # Store the results
    t_statistics[stratum] = t_statistic
    p_values[stratum] = p_value

# Print the results
print("T-test Results:")
for stratum in strata:
    print(f"Stratum: {stratum}")
    print(f"T-statistic: {t_statistics[stratum]}")
    print(f"P-value: {p_values[stratum]}")
    alpha = 0.05
    if p_values[stratum] < alpha:
        print("The difference in means is statistically significant (reject the null hypothesis)")
    else:
        print("The difference in means is not statistically significant (fail to reject the null hypothesis)")


T-test Results:
Stratum: Q5
T-statistic: 1.2592854628851795
P-value: 0.20795623155902943
The difference in means is not statistically significant (fail to reject the null hypothesis)
Stratum: Q4
T-statistic: -0.8454720209041986
P-value: 0.3978709545561665
The difference in means is not statistically significant (fail to reject the null hypothesis)
Stratum: Q3
T-statistic: -1.2799855650550944
P-value: 0.2006020214097563
The difference in means is not statistically significant (fail to reject the null hypothesis)
Stratum: Q2
T-statistic: nan
P-value: nan
The difference in means is not statistically significant (fail to reject the null hypothesis)
Stratum: Q1
T-statistic: nan
P-value: nan
The difference in means is not statistically significant (fail to reject the null hypothesis)


  res = hypotest_fun_out(*samples, **kwds)


The t-test results indicate that there are no statistically significant differences in means across the various strata (Q1 to Q5) of the dataset. In each stratum, the p-values obtained are greater than the typical significance level of 0.05, suggesting that we fail to reject the null hypothesis of no difference in means. 

Therefore, based on these results, we conclude that there is no evidence to suggest that the means of the populations represented by each stratum significantly differ from the overall population mean. However, it's worth noting that there were computational issues preventing the calculation of t-statistics and p-values for Stratum Q2 and Q1, but based on the available information, the same conclusion can be inferred. Overall, this suggests that the distribution of certain characteristics across different strata is consistent with the overall population distribution.

Finally we export the stratified sample to use it in the models.

In [11]:
sampled_df_movielens.to_parquet('cleaned/strat_sample_movielens')