# Data exploration

In this notebook we look at the data we are gonna use.

In [1]:
import json
import os
import random
from image_to_poem.data.kaggle_poems import KagglePoems

DATA_PATH = '../data/'

## Kaggle poems

The Kaggle poems consists of two sub data sets: "topics" and "forms". In topics the poems are sorted into folders depending on the topics they describe. In forms the poems are sorted into folders depending on the form they are written in. We will use the topics data set.

### Topics exploration

Below we take a look at the available topics.

In [2]:
topics = os.listdir(DATA_PATH+"kaggle_poems/topics")
print(f"There are {len(topics)} topics in the dataset.")

There are 144 topics in the dataset.


In [3]:
topic_counts = {}

for topic in topics:
    topic_counts[topic] = len(os.listdir(DATA_PATH+"kaggle_poems/topics/"+topic))

print(f"Each topic has between {min(topic_counts.values())} and {max(topic_counts.values())} poems.")

Each topic has between 97 and 100 poems.


### Poems exploration

In [5]:
poems = KagglePoems("../data/kaggle_poems/")
poems.stats

Reading poems:  20%|██        | 2876/14331 [00:00<00:01, 6876.31it/s]

Could not read ../data/kaggle_poems/topics/chicago/ChicagoPoemsOfComptonImNotTheLeastBitAfraidToTheThoughtOfChicagoIDrinkCoolAideOrLemonadeCosaNostraByThemICouldNeverDieTheFbiLikeStandingNextToAFlyPoembyJoshuaAaronGuillory.txt
Could not read ../data/kaggle_poems/topics/chicago/ChicagoPoemsOfComptonImNotTheLeastBitAfraidToTheThoughtOfChicagoIDrinkKoolAidOrLemonadeCosaNostraByThemICouldNeverDieTheFbiLikeStandingNextToAFlyPoembyJoshuaAaronGuillory.txt


Reading poems:  71%|███████   | 10105/14331 [00:01<00:00, 6540.37it/s]

Could not read ../data/kaggle_poems/topics/racism/RacismPoemsTranslationOfRacismIsAroundMeEverywhereByFrancisDugganαªåαª«αª░αªÜαª░αª¬αª╢αª¢αº£αºƒαªåαª¢αª£αªñαª¼αªªαª¼αª╖αª«αª▓αª½αª░αª¿αª╕αª╕αªíαªùαª¿PoembyAlamSayed.txt


Reading poems: 100%|██████████| 14331/14331 [00:02<00:00, 6520.28it/s]


{'num_poems': 14331,
 'num_words': 2631184,
 'vocab_size': 174557,
 'avg_poem_length': 183.60086525713487}

Let's take a look at some poems.

In [6]:
poems.get_example()

File index: 8458
Topic: music
Title: Young Laughters And My Music
Author: Augusta Davies Webster
------------------------------------
Young laughters, and my music! Aye till now
The voice can reach no blending minors near;
'Tis the bird's trill because the spring is here
And spring means trilling on a blossomy bough;
'Tis the spring joy that has no why or how,
But sees the sun and hopes not nor can fear--
Spring is so sweet and spring seems all the year.
Dear voice, the first-come birds but trill as thou.
Oh music of my heart, be thus for long:
Too soon the spring bird learns the later song;
Too soon a sadder sweetness slays content
Too soon! There comes new light on onward day,
There comes new perfume o'er a rosier way:
Comes not again the young spring joy that went.


### Poem stats

In [10]:
# poems = [read_poem(path) for path in paths]
# print(f"Number of poems: {len(poems)}")


poems.get_example()



Could not read ../data/kaggle_poems/topics/chicago/ChicagoPoemsOfComptonImNotTheLeastBitAfraidToTheThoughtOfChicagoIDrinkCoolAideOrLemonadeCosaNostraByThemICouldNeverDieTheFbiLikeStandingNextToAFlyPoembyJoshuaAaronGuillory.txt
Could not read ../data/kaggle_poems/topics/chicago/ChicagoPoemsOfComptonImNotTheLeastBitAfraidToTheThoughtOfChicagoIDrinkKoolAidOrLemonadeCosaNostraByThemICouldNeverDieTheFbiLikeStandingNextToAFlyPoembyJoshuaAaronGuillory.txt




Could not read ../data/kaggle_poems/topics/racism/RacismPoemsTranslationOfRacismIsAroundMeEverywhereByFrancisDugganαªåαª«αª░αªÜαª░αª¬αª╢αª¢αº£αºƒαªåαª¢αª£αªñαª¼αªªαª¼αª╖αª«αª▓αª½αª░αª¿αª╕αª╕αªíαªùαª¿PoembyAlamSayed.txt


Reading poems: 100%|██████████| 14331/14331 [00:02<00:00, 6250.58it/s]


Topic: river
Title: River Mates
Author: Padraic Colum
---------------------
I’LL be an otter, and I’ll let you swim
A mate beside me; we will venture down
A deep, dark river, when the sky above
Is shut of the sun; spoilers are we,
Thick-coated; no dog’s tooth can bite at our veins
With eyes and ears of poachers; deep-earthed ones
Turned hunters; let him slip past
The little vole; my teeth are on an edge
For the King-fish of the River!
I hold him up
The glittering salmon that smells of the sea;
I hold him high and whistle!
Now we go
Back to our earths; we will tear and eat
Sea-smelling salmon; you will tell the cubs
I am the Booty-bringer, I am the Lord
Of the River; the deep, dark, full and flowing River.


In [7]:
poems.stats

{'num_poems': 14331,
 'num_words': 2631184,
 'vocab_size': 174557,
 'avg_poem_length': 183.60086525713487}

In [15]:
p = "data/kaggle_poems/topics\work\\WorkPoemsA37SheWasGivenLightWorkPoembyRajaramRamachandran.txt"
p.replace("\\", "/")

'data/kaggle_poems/topics/work/WorkPoemsA37SheWasGivenLightWorkPoembyRajaramRamachandran.txt'

## MultiM poem images

In [6]:
image_json = json.load(open(DATA_PATH + "multim_poem.json"))
print("Example entry: ", image_json[0])

Example entry:  {'id': 0, 'image_url': 'https://farm2.staticflickr.com/1086/1002051357_0e9162423e.jpg', 'poem': 'what is lovely never dies\nbut passes into other loveliness\nstar-dust or sea-foam flower or winged air'}
