# Légende
* [Pré-traitement](#1)
    * [Lowercase everything](#11)
    * [Remove apostrophes](#12)
    * [Remove \r and \n](#13)
    * [Remove the punctuations](#14)
    * [Remove stop word](#15)
* [Part-of-speech tagging](#2) 
* [Data Augmentation](#3)

# Pré-traitement <a class="anchor" id="1"></a>

- Characters to lowercase
- Remove apostrophes, whitespaces and punctuations
- Adding tokens START, END and N to differentiate new lines and poems.

In [11]:
import nltk
import pandas as pd

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
 
try:
    nltk.data.find('corpora/stopwords')
except:
    print("Downloading nltk.stopwords...")
    nltk.download('stopwords')

# Download nltk packages for POS Tagging
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except LookupError:
    print("Downloading nltk.averaged_perceptron_tagger...")
    nltk.download('averaged_perceptron_tagger')

# Pandas display options
pd.set_option('display.max_colwidth', None)

In [12]:
df_PoetryFoundationData = pd.read_csv("../data_raw/PoetryFoundationData.csv")
df_PoetryFoundationData.drop(columns=["Title", "Poet", "Unnamed: 0"], inplace=True)
df_PoetryFoundationData.rename(columns={"Poem": "poem", "Tags": "labels"}, inplace=True)
df_PoetryFoundationData.tail(3)

Unnamed: 0,poem,labels
13851,\r\r\n\r\r\n,
13852,"\r\r\n Philosophic\r\r\nin its complex, ovoid emptiness,\r\r\na skillful pundit coined it as a sort\r\r\n of stopgap doorstop for those\r\r\n quaint equations Romans never\r\r\ndreamt of. In form completely clever\r\r\nand discrete—a mirror come unsilvered, loose watch face without the works, a hollowed globe from tip to toe\r\r\nunbroken, it evades the grappling\r\r\nhooks of mass, tilts the thin rim of no thing, remains embryonic sum, non-cogito.\r\r\n","Arts & Sciences,Philosophy"
13853,"\r\r\nDear Writers, I’m compiling the first in what I hope is a series of publications I’m calling artists among artists. The theme for issue 1 is “Faggot Dinosaur.” I hope to hear from you! Thank you and best wishes.","Relationships,Gay, Lesbian, Queer,Arts & Sciences,Poetry & Poets,Social Commentaries,Gender & Sexuality"


In [13]:
df_PoetryFoundationData.isna().sum()

poem        0
labels    955
dtype: int64

### Lowercase everything <a class="anchor" id="11"></a>

In [14]:
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.lower()

### Remove apostrophes <a class="anchor" id="12"></a>

In [15]:
# Here, I remove the apostrophes and join the left and right parts. E.g. "I'm" -> "Im"
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("'", "", regex=False)
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("’", "", regex=False)

### Remove \r and \n <a class="anchor" id="13"></a>

In [16]:
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("\r", "")
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.strip("\n")
df_PoetryFoundationData = df_PoetryFoundationData[df_PoetryFoundationData["poem"] != ""]

"""
# Adding tokens START, END and N (newline) for each poem
df_PoetryFoundationData['poem'] = 'START ' + df_PoetryFoundationData['poem'] + ' END'
df_PoetryFoundationData['poem'] = df_PoetryFoundationData['poem'].str.replace('\n', ' N ')
"""

"\n# Adding tokens START, END and N (newline) for each poem\ndf_PoetryFoundationData['poem'] = 'START ' + df_PoetryFoundationData['poem'] + ' END'\ndf_PoetryFoundationData['poem'] = df_PoetryFoundationData['poem'].str.replace('\n', ' N ')\n"

### Remove the punctuations <a class="anchor" id="14"></a>

In [17]:
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].str.replace("[^\w\s]", "", regex=True)

### Remove stop word <a class="anchor" id="15"></a>

In [18]:
stopwords = set(stopwords.words('english'))
df_PoetryFoundationData["poem"] = df_PoetryFoundationData["poem"].apply(lambda x: ' '.join([word for word in x.split() if word not in (stopwords)]))

In [19]:
df_PoetryFoundationData.tail(5)

Unnamed: 0,poem,labels
13835,dear writers im compiling first hope series publications im calling artists among artists theme issue 1 faggot dinosaur hope hear thank best wishes,"Relationships,Gay, Lesbian, Queer,Arts & Sciences,Poetry & Poets,Social Commentaries,Gender & Sexuality"
13848,wise men unlearn name head star flame one weary sound hoarse roar gale shadows fall tired eyes lone bedside candle dies calendar breeds nights till stores candles fail prompts melancholy key long familiar melody sounds let let sound night let sound hour death gratefulness eyes lips sometimes makes us lift gaze far sky glare silence wall stocking gapes gifts clear old trust good saint nick late miracles suddenly lifting eyes heavens light realize life sheer gift,"Living,Death,Growing Old,Time & Brevity,Nature,Winter,New Year"
13849,wed like talk fear said many people live fear days drove four small car nice boy said beautiful dogs said friendly man ahead woman two waiting drive outside digging garden one home said selling anyway im interested said well nice day said heres card theres phone number call anytime houses road anyone else live wed like talk living fear,"Living,Social Commentaries,Popular Culture"
13852,philosophic complex ovoid emptiness skillful pundit coined sort stopgap doorstop quaint equations romans never dreamt form completely clever discretea mirror come unsilvered loose watch face without works hollowed globe tip toe unbroken evades grappling hooks mass tilts thin rim thing remains embryonic sum noncogito,"Arts & Sciences,Philosophy"
13853,dear writers im compiling first hope series publications im calling artists among artists theme issue 1 faggot dinosaur hope hear thank best wishes,"Relationships,Gay, Lesbian, Queer,Arts & Sciences,Poetry & Poets,Social Commentaries,Gender & Sexuality"


# Part-of-speech tagging <a class="anchor" id="2"></a>

In [20]:
poem_example = df_PoetryFoundationData["poem"][0]
for sent in sent_tokenize(poem_example):
    wordtokens = word_tokenize(sent)
    print(nltk.pos_tag(wordtokens),end='\n\n')

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/home/cricri/nltk_data'
    - '/home/cricri/Desktop/EPITA_S8/my-env/nltk_data'
    - '/home/cricri/Desktop/EPITA_S8/my-env/share/nltk_data'
    - '/home/cricri/Desktop/EPITA_S8/my-env/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


# Data Augmentation <a class="anchor" id="3"></a>

- stemming and lemmatization
- rephrase text