# Personal Information
Name: **Friso Harlaar**

StudentID: **12869384**

Email: [**friso.harlaar@student.uva.nl**](friso.harlaar@student.uva.nl)

Submitted on: **23.03.2025**

# Data Context
**I will be using two main datasets in this thesis. The first one will contain images scraped manually from the [aesthetics wiki](https://aesthetics.fandom.com/wiki/Aesthetics_Wiki), it will be used to finetune a Visual Transformar to create an aesthetics classifier. The second dataset will be a books dataset, which contains metadata of books, such as the title, author(s), genre, etc. While also containing the description of the book, reviews and the cover image. This will be used to train a multimodal model which takes both the textual description, reviews, metadata and cover image as input and classify the book into an aesthetic.**

# Data Description

**Present here the results of your exploratory data analysis. Note that there is no need to have a "story line" - it is more important that you show your understanding of the data and the methods that you will be using in your experiments (i.e. your methodology).**

**As an example, you could show data, label, or group balances, skewness, and basic characterizations of the data. Information about data frequency and distributions as well as results from reduction mechanisms such as PCA could be useful. Furthermore, indicate outliers and how/why you are taking them out of your samples, if you do so.**

**The idea is, that you conduct this analysis to a) understand the data better but b) also to verify the shapes of the distributions and whether they meet the assumptions of the methods that you will attempt to use. Finally, make good use of images, diagrams, and tables to showcase what information you have extracted from your data.**

As you can see, you are in a jupyter notebook environment here. This means that you should focus little on writing text and more on actually exploring your data. If you need to, you can use the amsmath environment in-line: $e=mc^2$ or also in separate equations such as here:

\begin{equation}
    e=mc^2 \mathrm{\space where \space} e,m,c\in \mathbb{R}
\end{equation}

Furthermore, you can insert images such as your data aggregation diagrams like this:

![image](example.png)

In [1]:
# Imports
import glob
import gzip
import json
import numpy as np
import os
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.notebook import tqdm

### Data Loading

**Aesthetic images**

These were scraped from the [aesthetics wiki](https://aesthetics.fandom.com/wiki/Aesthetics_Wiki). A list of 24 aesthetics, which was curated by [Giolo & Berghman](https://firstmonday.org/ojs/index.php/fm/article/view/12723), was used, however 2 of the 24 aesthetics were removed and the FrogCore aesthetic was made a subaesthetic, meaning that it doesn't have it's own page anymore, which made the scraping difficult. 

To create more training data, the images will be flipped horizontally.

In [2]:
# Load your data here
base_path = "data/aesthetic_images/"

# Get list of all aesthetic folders
aesthetic_folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f))]

# Create a list to store the counts
counts = []

# Count files in each folder and get additional statistics
for aesthetic in aesthetic_folders:
    folder_path = os.path.join(base_path, aesthetic)
    image_files = glob.glob(os.path.join(folder_path, "*"))
    
    # Calculate total size in MB
    total_size_bytes = sum(os.path.getsize(file) for file in image_files)
    total_size_mb = total_size_bytes / (1024 * 1024)
    
    counts.append({
        "aesthetic": aesthetic,
        "image_count": len(image_files),
        "total_size_mb": round(total_size_mb, 2),
        "avg_size_mb": round(total_size_mb / len(image_files), 2) if image_files else 0
    })

# Sort by image count
df_image_counts = pd.DataFrame(counts)
df_image_counts = df_image_counts.sort_values("image_count", ascending=False)
df_image_counts

Unnamed: 0,aesthetic,image_count,total_size_mb,avg_size_mb
0,Frogcore,182,35.86,0.2
14,Kidcore,75,28.39,0.38
18,Dark_Academia,63,17.28,0.27
16,Fairy_Kei,60,7.77,0.13
19,Traumacore,59,19.03,0.32
7,Cottagecore,55,21.11,0.38
8,Ethereal,50,12.76,0.26
6,Vaporwave,47,44.19,0.94
10,Bloomcore,40,11.29,0.28
3,Cyberpunk,33,28.3,0.86


**Books dataset**

In [3]:
# There are multiple files in the goodreads dataset
# Here is an overview of each file:
# https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html
BOOKS_PATH = r'data/goodreads/goodreads_books/'

# All book datasets
book_files = glob.glob(os.path.join(BOOKS_PATH, "*.gz"))

print(book_files)

['data/goodreads/goodreads_books/goodreads_book_works.json.gz', 'data/goodreads/goodreads_books/goodreads_book_genres_initial.json.gz', 'data/goodreads/goodreads_books/goodreads_book_authors.json.gz', 'data/goodreads/goodreads_books/goodreads_book_series.json.gz', 'data/goodreads/goodreads_books/goodreads_books.json.gz']


### Analysis 1: 
Make sure to add some explanation of what you are doing in your code. This will help you and whoever will read this a lot in following your steps.

In [4]:
MAIN_BOOKS_PATH = r'data/goodreads/goodreads_books/goodreads_books.json.gz'

def read_goodreads_data(file_path, max_rows=None, sample_size=10000, return_sample=True):
    """
    Read Goodreads JSON.GZ data into a DataFrame
    
    Parameters:
    -----------
    file_path : str
        Path to the goodreads_books.json.gz file
    max_rows : int, optional
        Maximum number of rows to read (None = read all)
    sample_size : int, optional
        Number of rows to sample if return_sample=True
    return_sample : bool, default=True
        If True, return a random sample instead of the full dataset
        
    Returns:
    --------
    DataFrame containing book data
    """
    all_books = []
    total_processed = 0
    
    # For sampling
    if return_sample:
        # First pass to count total lines (if we need exact sampling)
        if not max_rows:
            print("Counting total records for sampling...")
            with gzip.open(file_path, 'rt', encoding='utf-8') as f:
                total_lines = sum(1 for _ in tqdm(f))
            sampling_rate = min(1.0, sample_size / total_lines)
            print(f"Sampling rate: {sampling_rate:.4f} ({sample_size} of {total_lines:,})")
        else:
            # If max_rows is specified, use that for sampling rate calculation
            total_lines = max_rows
            sampling_rate = min(1.0, sample_size / max_rows)
    
    # Read the file
    print(f"Reading data{' (sampling)' if return_sample else ''}...")
    with gzip.open(file_path, 'rt', encoding='utf-8') as f:
        for i, line in tqdm(enumerate(f)):
            # Stop if we reached max_rows
            if max_rows and i >= max_rows:
                break
                
            # Sample if requested
            if return_sample and np.random.random() > sampling_rate:
                continue
                
            try:
                # Parse JSON line and append to list
                book = json.loads(line.strip())
                all_books.append(book)
                total_processed += 1
                
                # Print progress for large datasets
                if total_processed % 100000 == 0 and not return_sample:
                    print(f"Processed {total_processed:,} records")
                    
            except json.JSONDecodeError:
                print(f"Error parsing JSON at line {i}")
    
    print(f"Creating DataFrame with {len(all_books):,} records...")
    df = pd.DataFrame(all_books)
    
    return df

# 1. Get a sample of books (fastest)
# sample_df = read_goodreads_data(
#     MAIN_BOOKS_PATH, 
#     return_sample=True, 
#     sample_size=10000
# )
# print(f"Sample DataFrame shape: {sample_df.shape}")
# sample_df.head()

# 2. Read the first N books
# first_n_df = read_goodreads_data(
#    '../goodreads/goodreads_books/goodreads_books.json.gz',
#    max_rows=100000,
#    return_sample=False
# )

# 3. Read all books (requires a lot of memory)
df = read_goodreads_data(
   MAIN_BOOKS_PATH,
   return_sample=False
)
df.replace('', np.nan, inplace=True)

Reading data...


0it [00:00, ?it/s]

Processed 100,000 records
Processed 200,000 records
Processed 300,000 records
Processed 400,000 records
Processed 500,000 records
Processed 600,000 records
Processed 700,000 records
Processed 800,000 records
Processed 900,000 records
Processed 1,000,000 records
Processed 1,100,000 records
Processed 1,200,000 records
Processed 1,300,000 records
Processed 1,400,000 records
Processed 1,500,000 records
Processed 1,600,000 records
Processed 1,700,000 records
Processed 1,800,000 records
Processed 1,900,000 records
Processed 2,000,000 records
Processed 2,100,000 records
Processed 2,200,000 records
Processed 2,300,000 records
Creating DataFrame with 2,360,655 records...


In [8]:
display(df.columns)
NO_IMAGE_LINK = 'https://s.gr-assets.com/assets/nophoto/book/111x148-bcc042a9c91a29c1d680899eff700a03.png'
display(df[df['image_url'] == NO_IMAGE_LINK].shape)
display(df.isna().sum())


Index(['isbn', 'text_reviews_count', 'series', 'country_code', 'language_code',
       'popular_shelves', 'asin', 'is_ebook', 'average_rating', 'kindle_asin',
       'similar_books', 'description', 'format', 'link', 'authors',
       'publisher', 'num_pages', 'publication_day', 'isbn13',
       'publication_month', 'edition_information', 'publication_year', 'url',
       'image_url', 'book_id', 'ratings_count', 'work_id', 'title',
       'title_without_series'],
      dtype='object')

(981061, 29)

isbn                     983373
text_reviews_count          524
series                        0
country_code                490
language_code           1060153
popular_shelves               0
asin                    1891138
is_ebook                    490
average_rating              524
kindle_asin             1345725
similar_books                 0
description              412233
format                   646754
link                        524
authors                       0
publisher                654362
num_pages                764133
publication_day         1024429
isbn13                   780263
publication_month        882945
edition_information     2142642
publication_year         599625
url                         524
image_url                   490
book_id                       0
ratings_count               524
work_id                     524
title                         7
title_without_series          7
dtype: int64

In [15]:
display(df[(df['image_url'] == NO_IMAGE_LINK) & (df['description'].isna())].shape)

(285692, 29)

### Analysis 2: 

In [None]:
# ...

### Analysis n:

In [None]:
# ...