# Case Study of the'TinyStories' Dataset
The 'TinyStories' dataset is a large dataset that was generated using AI story request prompts and responses. It helps facilitate quick fine-tuning of models that were trained with real-world data - such as LLaMA. As this is a large dataset, and this data was generated with the help of AI, the data may have unintentionally been skewed or biased during generation. It is essential to examine the contents of this dataset and whether it is suitable to our applications content-moderation objectives with regard to literacy, mental-health, and creativity.

In [1]:
# importing libraries
from pyspark.sql import SparkSession
from datasets import load_dataset

In [2]:
from pyspark.sql.functions import udf, col, countDistinct
from pyspark.sql.types import StringType, ArrayType, IntegerType
from pyspark.accumulators import AccumulatorParam 

In [3]:
import pandas as pd

## Setting up the Distributed Processing Environment

In [4]:
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr-serverless connect --application-id 00fra2001bfrlm09 --language python --emr-execution-role-arn arn:aws:iam::597161074694:role/service-role/AmazonEMR-ServiceRole-20250211T131858

Waiting for EMR Serverless application state to become STARTED
Waiting for EMR Serverless application state to become STARTED
Initiating EMR Serverless connection..
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
0,00frdaescp02ap0a,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


In [5]:
# connecting to the spark session
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "64g") \
    .appName('spark') \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/01 01:10:22 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Statistical Metrics

## Loading the Data

In [6]:
data = load_dataset("skeskinen/TinyStories-GPT4")

In [7]:
data

DatasetDict({
    train: Dataset({
        features: ['story', 'summary', 'source', 'prompt', 'words', 'features'],
        num_rows: 2745100
    })
})

In [8]:
train = spark.createDataFrame(data['train'])

In [16]:
num_partitions = 8  # or another appropriate number
train = spark.createDataFrame(data['train']).repartition(num_partitions)

In [17]:
# Verify the number of partitions
print(f"Number of partitions: {train.rdd.getNumPartitions()}")

25/04/01 01:48:01 WARN TaskSetManager: Stage 1 contains a task of very large size (223671 KiB). The maximum recommended task size is 1000 KiB.
[Stage 1:>                                                        (0 + 16) / 16]

Number of partitions: 8




## Mental Health
As for mental-health, goals may be defined in objective terms by inferring the emotions, subjects, topics, genres, or any commonly used keywords related to mental-health within the data. For this purpose, we can use the list of words and narrative features provided within the prompts provided to the AI in the training data for each row in the dataset. By counting the occurences of the 'words' and 'features' columns used within the prompt, we can make some insights of the mental-health objectives.

In [29]:
class WordCountAccumulator(AccumulatorParam):
    def zero(self, value):
        return {}

    def addInPlace(self, v1, v2):
        # v1 is the accumulator's current state
        # v2 is the new dictionary being added from a partition
        for word, count in v2.items():
            if word is not None and word.strip():
                v1[word] = v1.get(word, 0) + count
        return v1

In [23]:
def normalize_and_count_unique(strings):
    if strings is None:
        return None
    normalized_strings = [s.strip().lower() for s in strings if s is not None and s.strip()]
    return normalized_strings
# the sparkerized function to normalize the words/features for comparison
normalize_count_udf = udf(normalize_and_count_unique, ArrayType(StringType())) # Assuming string type

In [48]:
# applying the accumulator across partitions of the dataset based on col type
def process_features_partition(partition_iterator, col_type):
    # Initialize local counter for this partition
    partition_features = {}
    
    # Process each row in the partition
    for row in partition_iterator:
        features = row["normalized_" + col_type]
        if features is not None:
            # Handle both string and list cases
            words = features.split() if isinstance(features, str) else features
            
            # Count words in this row
            for word in words:
                if word and word.strip():
                    word = word.strip()
                    partition_features[word] = partition_features.get(word, 0) + 1
    
    # Add the partition counts to the accumulator
    if partition_features and col_type == 'features':
        features_accumulator.add(partition_features)
    elif partition_features and col_type == 'words':
        words_accumulator.add(partition_features)
    
    # Return the iterator for the partition
    return iter([1])  # Return dummy value to force evaluation

### Narrative Features

In [49]:
features_accumulator = spark.sparkContext.accumulator(
    {}, WordCountAccumulator())

In [50]:
# Ensure processed as lists of words
train = train.withColumn("normalized_features", normalize_count_udf(col("features")))

In [51]:
# Ensure proper partitioning
num_partitions = 200  # Adjust based on your cluster size
train = train.repartition(num_partitions)

# Force evaluation and verify processing
col_type = "features"  # or whatever column name you want to process
total_partitions = train.rdd.mapPartitions(
    lambda partition: process_features_partition(partition, col_type)
).count()

                                                                                

In [52]:
# Get the final word counts
features_accumulator.value

{'dialogue': 1470404,
 'foreshadowing': 250789,
 'twist': 539383,
 'moralvalue': 274152,
 'conflict': 250696,
 'badending': 250300}

In [53]:
feature_occurences = pd.Series(features_accumulator.value)
feature_occurences

dialogue         1470404
foreshadowing     250789
twist             539383
moralvalue        274152
conflict          250696
badending         250300
dtype: int64

In [59]:
feature_occurences.describe()

count    6.000000e+00
mean     5.059540e+05
std      4.859297e+05
min      2.503000e+05
25%      2.507192e+05
50%      2.624705e+05
75%      4.730752e+05
max      1.470404e+06
dtype: float64

### Words

In [54]:
words_accumulator = spark.sparkContext.accumulator(
    {}, WordCountAccumulator())

In [55]:
train = train.withColumn("normalized_words", normalize_count_udf(col("words")))

In [56]:
# Force evaluation and verify processing
col_type = "words"  # or whatever column name you want to process
total_partitions = train.rdd.mapPartitions(
    lambda partition: process_features_partition(partition, col_type)
).count()

                                                                                

In [57]:
word_occurrences = pd.Series(words_accumulator.value)
word_occurrences.head()

reply      7065
engine     2509
bald      11245
feed       7070
lawyer     2529
dtype: int64

In [58]:
word_occurrences.describe()

count     1603.000000
mean      5137.429819
std       3471.319457
min       2418.000000
25%       2563.000000
50%       2623.000000
75%       7031.000000
max      20563.000000
dtype: float64

In [60]:
spark.stop()

## Creativity
Objectively speaking, creativity is a hard goal to define as it can be objectively defined in a multitude of way, as are the aforementioned topics of literacy and mental-health. However, many people might agree that creativity is somehow unique. Therefore, it may be possible to define the goal of creativity by understanding the level of variance in the models responses to similar prompts. 

# Statistical Analysis
In this section the statistical metrics calculated from the training data is analyzed and visualized for making insights with regards to the stated objectives. Furthermore, given the complexity in understanding the level of literacy from the data - as no 'literacy level' column is within the data - a language model is used to classify the responses.

## Literacy
WonderWords' literacy goals may be defined in objective terms by classifying whether the responses are at a lower or a higher reading level. As our application is targeting a youth demographic, utilizing the categorization system used in most libraries and school systems will help illustrate whether the training data is biased towards a specific set of reading levels.

## Mental Health

## Creativity