# Personal Information
Name: **Friso Harlaar**

StudentID: **12869384**

Email: [**friso.harlaar@student.uva.nl**](friso.harlaar@student.uva.nl)

Submitted on: **23.03.2025**

# Data Context
**I will be using two main datasets in this thesis. The first one will contain images scraped manually from the [aesthetics wiki](https://aesthetics.fandom.com/wiki/Aesthetics_Wiki), it will be used to finetune a Visual Transformar to create an aesthetics classifier. The second dataset will be a books dataset, which contains metadata of books, such as the title, author(s), genre, etc. While also containing the description of the book, reviews and the cover image. This will be used to train a multimodal model which takes both the textual description, reviews, metadata and cover image as input and classify the book into an aesthetic.**

# Data Description

**Present here the results of your exploratory data analysis. Note that there is no need to have a "story line" - it is more important that you show your understanding of the data and the methods that you will be using in your experiments (i.e. your methodology).**

**As an example, you could show data, label, or group balances, skewness, and basic characterizations of the data. Information about data frequency and distributions as well as results from reduction mechanisms such as PCA could be useful. Furthermore, indicate outliers and how/why you are taking them out of your samples, if you do so.**

**The idea is, that you conduct this analysis to a) understand the data better but b) also to verify the shapes of the distributions and whether they meet the assumptions of the methods that you will attempt to use. Finally, make good use of images, diagrams, and tables to showcase what information you have extracted from your data.**

As you can see, you are in a jupyter notebook environment here. This means that you should focus little on writing text and more on actually exploring your data. If you need to, you can use the amsmath environment in-line: $e=mc^2$ or also in separate equations such as here:

\begin{equation}
    e=mc^2 \mathrm{\space where \space} e,m,c\in \mathbb{R}
\end{equation}

Furthermore, you can insert images such as your data aggregation diagrams like this:

![image](example.png)

In [1]:
# Imports
import os
import numpy as np
import pandas as pd
import glob
import matplotlib.pyplot as plt
import seaborn as sns

### Data Loading

**Aesthetic images**

In [2]:
# Load your data here
base_path = "data/aesthetic_images/"

# Get list of all aesthetic folders
aesthetic_folders = [f for f in os.listdir(base_path) if os.path.isdir(os.path.join(base_path, f))]

# Create a list to store the counts
counts = []

# Count files in each folder and get additional statistics
for aesthetic in aesthetic_folders:
    folder_path = os.path.join(base_path, aesthetic)
    image_files = glob.glob(os.path.join(folder_path, "*"))
    
    # Calculate total size in MB
    total_size_bytes = sum(os.path.getsize(file) for file in image_files)
    total_size_mb = total_size_bytes / (1024 * 1024)
    
    counts.append({
        "aesthetic": aesthetic,
        "image_count": len(image_files),
        "total_size_mb": round(total_size_mb, 2),
        "avg_size_mb": round(total_size_mb / len(image_files), 2) if image_files else 0
    })

# Sort by image count
df_image_counts = pd.DataFrame(counts)
df_image_counts = df_image_counts.sort_values("image_count", ascending=False)
df_image_counts

Unnamed: 0,aesthetic,image_count,total_size_mb,avg_size_mb
11,Frogcore,182,35.86,0.2
15,Kidcore,75,28.39,0.38
7,Dark_Academia,63,17.28,0.27
9,Fairy_Kei,60,7.77,0.13
20,Traumacore,59,19.03,0.32
5,Cottagecore,55,21.11,0.38
8,Ethereal,50,12.76,0.26
22,Vaporwave,47,44.19,0.94
3,Bloomcore,40,11.29,0.28
6,Cyberpunk,33,28.3,0.86


**Books dataset**

In [None]:
BOOKS_PATH = r'..\goodreads\goodreads_books'

# All book datasets
book_files = glob.glob(os.path.join(BOOKS_PATH, "*.gz"))

print(book_files)

### Analysis 1: 
Make sure to add some explanation of what you are doing in your code. This will help you and whoever will read this a lot in following your steps.

In [3]:
# Also don't forget to comment your code
# This way it's also easier to spot thought errors along the way

### Analysis 2: 

In [4]:
# ...

### Analysis n:

In [5]:
# ...