# Case Study of the'TinyStories' Dataset
The 'TinyStories' dataset is a large dataset that was generated using AI story request prompts and responses. It helps facilitate quick fine-tuning of models that were trained with real-world data - such as LLaMA. As this is a large dataset, and this data was generated with the help of AI, the data may have unintentionally been skewed or biased during generation. It is essential to examine the contents of this dataset and whether it is suitable to our applications content-moderation objectives with regard to literacy, mental-health, and creativity.

In [1]:
# importing libraries
from pyspark.sql import SparkSession
from datasets import load_dataset

In [23]:
from pyspark.sql.functions import udf, col, countDistinct
from pyspark.sql.types import StringType, ArrayType, IntegerType
from pyspark.accumulators import AccumulatorParam 

In [121]:
import pandas as pd

## Setting up the Distributed Processing Environment

In [2]:
%load_ext sagemaker_studio_analytics_extension.magics
%sm_analytics emr-serverless connect --application-id 00fra2001bfrlm09 --language python --emr-execution-role-arn arn:aws:iam::597161074694:role/service-role/AmazonEMR-ServiceRole-20250211T131858

Initiating EMR Serverless connection..
Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
1,00frarnisqpud90a,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


In [3]:
# connecting to the spark session
spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "64g") \
    .appName('spark') \
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/28 23:47:04 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Statistical Metrics

## Loading the Data

In [4]:
data = load_dataset("skeskinen/TinyStories-GPT4")

In [7]:
data

DatasetDict({
    train: Dataset({
        features: ['story', 'summary', 'source', 'prompt', 'words', 'features'],
        num_rows: 2745100
    })
})

In [None]:
train = spark.createDataFrame(data['train'])

## Mental Health
As for mental-health, goals may be defined in objective terms by inferring the emotions, subjects, topics, genres, or any commonly used keywords related to mental-health within the data. For this purpose, we can use the list of words and narrative features provided within the prompts provided to the AI in the training data for each row in the dataset. By counting the occurences of the 'words' and 'features' columns used within the prompt, we can make some insights of the mental-health objectives.

In [110]:
class WordCountAccumulator(AccumulatorParam):
    def zero(self, value):
        return {}  # Initialize an empty dictionary

    def addInPlace(self, accum, value):
        # Ensure value is split into words if it's a string
        if isinstance(value, str):
            value = value.split()  # Split the string into words

        for word in value:  # Iterate through the list of words
            if word is not None and word.strip():  # Check for non-empty words
                word = word.strip()  # Remove leading/trailing whitespace
                if word in accum:
                    accum[word] += 1
                else:
                    accum[word] = 1
        return accum

    def merge(self, accum1, accum2):  # Combine multiple dictionaries
        for key, value in accum2.items():
            if key is not None and key.strip():  # Check for non-empty keys
                if key not in accum1:
                    accum1[key] = value
                else:
                    accum1[key] += value
        return accum1  

In [113]:
def normalize_and_count_unique(strings):
    if strings is None:
        return None
    normalized_strings = [s.strip().lower() for s in strings if s is not None and s.strip()]
    return normalized_strings
# the sparkerized function to normalize the words/features for comparison
normalize_count_udf = udf(normalize_and_count_unique, ArrayType(StringType())) # Assuming string type

### Narrative Features

In [112]:
features_accumulator = spark.sparkContext.accumulator(
    {}, WordCountAccumulator())

In [114]:
# Ensure processed as lists of words
train = train.withColumn("normalized_features", normalize_count_udf(col("features")))
train.foreach(lambda row: features_accumulator.add(
    {word: 1 for word in (row["normalized_features"].split() if isinstance(row["normalized_features"], str) else row["normalized_features"])
     if word is not None and word.strip()}
))

25/03/29 01:29:42 WARN TaskSetManager: Stage 27 contains a task of very large size (223671 KiB). The maximum recommended task size is 1000 KiB.
25/03/29 01:29:59 WARN TaskSetManager: Stage 28 contains a task of very large size (223671 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [122]:
feature_occurences = pd.Series(features_accumulator.value)
feature_occurences

dialogue         16
badending        16
twist            16
conflict         16
foreshadowing    16
moralvalue       16
dtype: int64

### Words

In [None]:
words_accumulator = spark.sparkContext.accumulator(
    {}, WordCountAccumulator())

In [21]:
train = train.withColumn("normalized_words", normalize_count_udf(col("words")))
train.foreach(lambda row: words_accumulator.add(
    {word: 1 for word in (row["normalized_words"].split() if isinstance(row["normalized_words"], str) else row["normalized_words"])
     if word is not None and word.strip()}
))

25/03/29 00:21:38 WARN TaskSetManager: Stage 1 contains a task of very large size (223671 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

In [123]:
word_occurrences = pd.Series(words_accumulator.value)
word_occurrences.head()

fight     16
grill     16
bright    16
kick      16
bridge    16
dtype: int64

## Creativity
Objectively speaking, creativity is a hard goal to define as it can be objectively defined in a multitude of way, as are the aforementioned topics of literacy and mental-health. However, many people might agree that creativity is somehow unique. Therefore, it may be possible to define the goal of creativity by understanding the level of variance in the models responses to similar prompts. 

# Statistical Analysis
In this section the statistical metrics calculated from the training data is analyzed and visualized for making insights with regards to the stated objectives. Furthermore, given the complexity in understanding the level of literacy from the data - as no 'literacy level' column is within the data - a language model is used to classify the responses.

## Literacy
WonderWords' literacy goals may be defined in objective terms by classifying whether the responses are at a lower or a higher reading level. As our application is targeting a youth demographic, utilizing the categorization system used in most libraries and school systems will help illustrate whether the training data is biased towards a specific set of reading levels.

## Mental Health

## Creativity