# Légende
* [Pré-traitement](#1)
    * [Lowercase everything](#11)
    * [Remove apostrophes](#12)
    * [Remove \r and \n](#13)
    * [Remove the punctuations](#14)
* [Part-of-speech tagging](#2) 
* [Data Augmentation](#3)

# Pré-traitement <a class="anchor" id="1"></a>

- Characters to lowercase
- Remove apostrophes, whitespaces and punctuations
- Adding tokens START, END and N to differentiate new lines and poems.

In [1]:
import nltk
import pandas as pd

from nltk.tokenize import word_tokenize, sent_tokenize

# Download nltk packages for POS Tagging
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    print("Downloading nltk.averaged_perceptron_tagger...")
    nltk.download('averaged_perceptron_tagger')

# Pandas display options
pd.set_option('display.max_colwidth', None)

In [2]:
df_PoetryFoundationData = pd.read_csv("../data_raw/PoetryFoundationData.csv")
df_PoetryFoundationData.drop(columns=["Title", "Poet", "Unnamed: 0"], inplace=True)
df_PoetryFoundationData.rename(columns={"Poem": "poem", "Tags": "labels"}, inplace=True)
df_PoetryFoundationData.tail(3)

Unnamed: 0,poem,labels
13851,\r\r\n\r\r\n,
13852,"\r\r\n Philosophic\r\r\nin its complex, ovoid emptiness,\r\r\na skillful pundit coined it as a sort\r\r\n of stopgap doorstop for those\r\r\n quaint equations Romans never\r\r\ndreamt of. In form completely clever\r\r\nand discrete—a mirror come unsilvered, loose watch face without the works, a hollowed globe from tip to toe\r\r\nunbroken, it evades the grappling\r\r\nhooks of mass, tilts the thin rim of no thing, remains embryonic sum, non-cogito.\r\r\n","Arts & Sciences,Philosophy"
13853,"\r\r\nDear Writers, I’m compiling the first in what I hope is a series of publications I’m calling artists among artists. The theme for issue 1 is “Faggot Dinosaur.” I hope to hear from you! Thank you and best wishes.","Relationships,Gay, Lesbian, Queer,Arts & Sciences,Poetry & Poets,Social Commentaries,Gender & Sexuality"


In [3]:
df_PoetryFoundationData.isna().sum()

poem        0
labels    955
dtype: int64

### Lowercase everything <a class="anchor" id="11"></a>

In [4]:
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.lower()

### Remove apostrophes <a class="anchor" id="12"></a>

In [5]:
# Here, I remove the apostrophes and join the left and right parts. E.g. "I'm" -> "Im"
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("'", "", regex=False)
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("’", "", regex=False)

### Remove \r and \n <a class="anchor" id="13"></a>

In [6]:
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("\r", "")
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.strip("\n")
df_PoetryFoundationData = df_PoetryFoundationData[df_PoetryFoundationData["poem"] != ""]

# Adding tokens START, END and N (newline) for each poem
df_PoetryFoundationData['poem'] = 'START ' + df_PoetryFoundationData['poem'] + ' END'
df_PoetryFoundationData['poem'] = df_PoetryFoundationData['poem'].str.replace('\n', ' N ')

### Remove the punctuations <a class="anchor" id="14"></a>

In [7]:
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("[^\w\s]", "", regex=True)

In [8]:
df_PoetryFoundationData.tail(5)

Unnamed: 0,poem,labels
13835,START dear writers im compiling the first in what i hope is a series of publications im calling artists among artists the theme for issue 1 is faggot dinosaur i hope to hear from you thank you and best wishes END,"Relationships,Gay, Lesbian, Queer,Arts & Sciences,Poetry & Poets,Social Commentaries,Gender & Sexuality"
13848,START the wise men will unlearn your name N above your head no star will flame N one weary sound will be the same N the hoarse roar of the gale N the shadows fall from your tired eyes N as your lone bedside candle dies N for here the calendar breeds nights N till stores of candles fail N what prompts this melancholy key N a long familiar melody N it sounds again so let it be N let it sound from this night N let it sound in my hour of death N as gratefulness of eyes and lips N for that which sometimes makes us lift N our gaze to the far sky N you glare in silence at the wall N your stocking gapes no gifts at all N its clear that you are now too old N to trust in good saint nick N that its too late for miracles N but suddenly lifting your eyes N to heavens light you realize N your life is a sheer gift END,"Living,Death,Growing Old,Time & Brevity,Nature,Winter,New Year"
13849,START wed like to talk with you about fear they said so N many people live in fear these days they drove up N all four of them in a small car nice boy they said N beautiful dogs they said so friendly the man ahead N of the woman the other two waiting in the drive i N was outside digging up the garden no one home i said N what are you selling anyway im not interested i N said well you have a nice day they said heres our N card theres a phone number you can call anytime N any other houses down this road anyone else live N here wed like to talk to them about living in fear END,"Living,Social Commentaries,Popular Culture"
13852,START philosophic N in its complex ovoid emptiness N a skillful pundit coined it as a sort N of stopgap doorstop for those N quaint equations romans never N dreamt of in form completely clever N and discretea mirror come unsilvered loose watch face without the works a hollowed globe from tip to toe N unbroken it evades the grappling N hooks of mass tilts the thin rim of no thing remains embryonic sum noncogito END,"Arts & Sciences,Philosophy"
13853,START dear writers im compiling the first in what i hope is a series of publications im calling artists among artists the theme for issue 1 is faggot dinosaur i hope to hear from you thank you and best wishes END,"Relationships,Gay, Lesbian, Queer,Arts & Sciences,Poetry & Poets,Social Commentaries,Gender & Sexuality"


# Part-of-speech tagging <a class="anchor" id="2"></a>

In [20]:
poem_example = df_PoetryFoundationData["poem"][0]
for sent in sent_tokenize(poem_example):
    wordtokens = word_tokenize(sent)
    print(nltk.pos_tag(wordtokens),end='\n\n')

[('START', 'NNP'), ('dog', 'VBZ'), ('bone', 'NN'), ('stapler', 'NN'), ('N', 'NNP'), ('cribbage', 'NN'), ('board', 'NN'), ('garlic', 'JJ'), ('press', 'NN'), ('N', 'NNP'), ('because', 'IN'), ('this', 'DT'), ('window', 'NN'), ('is', 'VBZ'), ('looselacks', 'JJ'), ('N', 'NNP'), ('suction', 'NN'), ('lacks', 'VBZ'), ('grip', 'JJ'), ('N', 'NNP'), ('bungee', 'NN'), ('cord', 'NN'), ('bootstrap', 'NN'), ('N', 'NNP'), ('dog', 'NN'), ('leash', 'NN'), ('leather', 'NN'), ('belt', 'VBD'), ('N', 'NNP'), ('because', 'IN'), ('this', 'DT'), ('window', 'NN'), ('had', 'VBD'), ('sash', 'VBN'), ('cords', 'NNS'), ('N', 'NNP'), ('they', 'PRP'), ('frayed', 'VBD'), ('they', 'PRP'), ('broke', 'VBD'), ('N', 'NNP'), ('feather', 'RB'), ('duster', 'RB'), ('thatch', 'NN'), ('of', 'IN'), ('straw', 'JJ'), ('empty', 'JJ'), ('N', 'NNP'), ('bottle', 'NN'), ('of', 'IN'), ('elmers', 'NNS'), ('glue', 'VBP'), ('N', 'NNP'), ('because', 'IN'), ('this', 'DT'), ('window', 'NN'), ('is', 'VBZ'), ('loudits', 'JJ'), ('hinges', 'NNS'), 

# Data Augmentation <a class="anchor" id="3"></a>

- stemming and lemmatization
- rephrase text