## Data preprocessing

1. basic
- remove unnecessary columns
- remove outlier
- change to small letters

2. experiment1 : special characters
- don't remove special characters
- remove all special characters

3. experiment2 : words
- use raw corpus 
- use only main words like stems or nouns

4. experiment3 : data augmentation
- use original data without augmentation
- increase just target according to standard error, but uncahnge text data.
- increase target according to standard error, and add some noise or change other word slighly on text.

5. experiment4 : train/test split
- random split
- split according to class label

In [49]:
import pandas as pd

In [50]:
original_train = pd.read_csv("data/0_original/train.csv")

## 1. basic

**remove unnecessary columns**

In [51]:
data = original_train.drop(["url_legal", "license"], axis=1)
data.head()

Unnamed: 0,id,excerpt,target,standard_error
0,c12129c31,When the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,"All through dinner time, Mrs. Fayre was somewh...",-0.315372,0.480805
2,b69ac6792,"As Roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,And outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,Once upon a time there were Three Bears who li...,0.247197,0.510845


**remove outlier**

In [52]:
print(len(data))
data = data[data["standard_error"] != 0]
print(len(data))

2834
2833


**change to small letters**

In [53]:
def change_to_small_letter(x):
    return x.lower()

In [54]:
data["excerpt"] = data["excerpt"].apply(change_to_small_letter)

In [55]:
data.head()

Unnamed: 0,id,excerpt,target,standard_error
0,c12129c31,when the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,"all through dinner time, mrs. fayre was somewh...",-0.315372,0.480805
2,b69ac6792,"as roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,and outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,once upon a time there were three bears who li...,0.247197,0.510845


**save**

In [56]:
import os

In [57]:
if not os.path.isdir("data/1_basic"):
    os.mkdir("data/1_basic")

In [58]:
data.to_csv("data/1_basic/train.csv")

## 2. special characters

**don't remove special characters**

use data/1_basic_train.csv

**remove special characters**

In [59]:
import re
import copy

In [60]:
def remove_by_pattern(x):
    return re.sub("[^a-zA-Z0-9 ]", " ", x)

In [86]:
no_sp_data = copy.deepcopy(data)
no_sp_data["excerpt"] = no_sp_data["excerpt"].apply(remove_by_pattern)

In [62]:
data.head()

Unnamed: 0,id,excerpt,target,standard_error
0,c12129c31,when the young people returned to the ballroom...,-0.340259,0.464009
1,85aa80a4c,"all through dinner time, mrs. fayre was somewh...",-0.315372,0.480805
2,b69ac6792,"as roger had predicted, the snow departed as q...",-0.580118,0.476676
3,dd1000b26,and outside before the palace a great garden w...,-1.054013,0.450007
4,37c1b32fb,once upon a time there were three bears who li...,0.247197,0.510845


**save**

In [63]:
if not os.path.isdir("data/2_no_special_cahracters"):
    os.mkdir("data/2_no_special_cahracters")

In [64]:
no_sp_data.to_csv("data/2_no_special_cahracters/train.csv")

## 3. words

In [65]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

**exist sp, split by stem**

In [66]:
stem_data = copy.deepcopy(data)

In [67]:
def split_stem(x):
    return " ".join(nltk.word_tokenize(x))

In [68]:
stem_data["excerpt"] = stem_data["excerpt"].apply(split_stem)

In [69]:
stem_data.loc[0]["excerpt"]

'when the young people returned to the ballroom , it presented a decidedly changed appearance . instead of an interior scene , it was a winter landscape . the floor was covered with snow-white canvas , not laid on smoothly , but rumpled over bumps and hillocks , like a real snow field . the numerous palms and evergreens that had decorated the room , were powdered with flour and strewn with tufts of cotton , like snow . also diamond dust had been lightly sprinkled on them , and glittering crystal icicles hung from the branches . at each end of the room , on the wall , hung a beautiful bear-skin rug . these rugs were for prizes , one for the girls and one for the boys . and this was the game . the girls were gathered at one end of the room and the boys at the other , and one end was called the north pole , and the other the south pole . each player was given a small flag which they were to plant on reaching the pole . this would have been an easy matter , but each traveller was obliged t

This result looks like just splitted by white space and just on difference whice is special characters are splitted from wrod. 
So, decide not to use this data.

**exist sp, other stemmer**

In [76]:
def get_stem_result(string, stemmer):
    stems = []

    for element in string.split(" "):
        stems.append(stemmer.stem(element))

    return " ".join(stems)

In [77]:
stemmer = nltk.stem.SnowballStemmer('english')
get_stem_result(stem_data.loc[0]["excerpt"], stemmer)

'when the young peopl return to the ballroom , it present a decid chang appear . instead of an interior scene , it was a winter landscap . the floor was cover with snow-whit canva , not laid on smooth , but rumpl over bump and hillock , like a real snow field . the numer palm and evergreen that had decor the room , were powder with flour and strewn with tuft of cotton , like snow . also diamond dust had been light sprinkl on them , and glitter crystal icicl hung from the branch . at each end of the room , on the wall , hung a beauti bear-skin rug . these rug were for prize , one for the girl and one for the boy . and this was the game . the girl were gather at one end of the room and the boy at the other , and one end was call the north pole , and the other the south pole . each player was given a small flag which they were to plant on reach the pole . this would have been an easi matter , but each travel was oblig to wear snowsho .'

In [78]:
stemmer = nltk.stem.PorterStemmer()
get_stem_result(stem_data.loc[0]["excerpt"], stemmer)

'when the young peopl return to the ballroom , it present a decidedli chang appear . instead of an interior scene , it wa a winter landscap . the floor wa cover with snow-whit canva , not laid on smoothli , but rumpl over bump and hillock , like a real snow field . the numer palm and evergreen that had decor the room , were powder with flour and strewn with tuft of cotton , like snow . also diamond dust had been lightli sprinkl on them , and glitter crystal icicl hung from the branch . at each end of the room , on the wall , hung a beauti bear-skin rug . these rug were for prize , one for the girl and one for the boy . and thi wa the game . the girl were gather at one end of the room and the boy at the other , and one end wa call the north pole , and the other the south pole . each player wa given a small flag which they were to plant on reach the pole . thi would have been an easi matter , but each travel wa oblig to wear snowsho .'

In [79]:
stemmer = nltk.stem.LancasterStemmer()
get_stem_result(stem_data.loc[0]["excerpt"], stemmer)

'when the young peopl return to the ballroom , it pres a decid chang appear . instead of an intery scen , it was a wint landscap . the flo was cov with snow-white canva , not laid on smooth , but rumpl ov bump and hillock , lik a real snow field . the num palm and evergreen that had dec the room , wer powd with flo and strewn with tuft of cotton , lik snow . also diamond dust had been light sprinkled on them , and glit cryst icic hung from the branch . at each end of the room , on the wal , hung a beauty bear-skin rug . thes rug wer for priz , on for the girl and on for the boy . and thi was the gam . the girl wer gath at on end of the room and the boy at the oth , and on end was cal the nor pol , and the oth the sou pol . each play was giv a smal flag which they wer to plant on reach the pol . thi would hav been an easy mat , but each travel was oblig to wear snowsho .'

Lancaster stemmer is over-split the word. So some of words cannot be understood that meaning.

Porter stemmer is better than Lancaster stemmer, but still have same issue.

So, use SnowballStemmer in this case.

※ Special characters dont't have meaning in stem analyzed data, so use the data which don't comprised special characters.

In [107]:
stem_data = copy.deepcopy(no_sp_data)

In [108]:
def get_stem_result(string):
    stemmer = nltk.stem.SnowballStemmer('english')
    stems = []
    
    for element in nltk.word_tokenize(string):
        stems.append(stemmer.stem(element))

    return " ".join(stems)

In [109]:
stem_data["excerpt"] = stem_data["excerpt"].apply(get_stem_result)

In [112]:
stem_data.head()

Unnamed: 0,id,excerpt,target,standard_error
0,c12129c31,when the young peopl return to the ballroom it...,-0.340259,0.464009
1,85aa80a4c,all through dinner time mrs fayr was somewhat ...,-0.315372,0.480805
2,b69ac6792,as roger had predict the snow depart as quick ...,-0.580118,0.476676
3,dd1000b26,and outsid befor the palac a great garden was ...,-1.054013,0.450007
4,37c1b32fb,onc upon a time there were three bear who live...,0.247197,0.510845


**save**

In [91]:
if not os.path.isdir("data/3_stemming"):
    os.mkdir("data/3_stemming")

In [92]:
stem_data.to_csv("data/3_stemming/train_just_stemming.csv")

**lemmatizing**

In [98]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [103]:
lemmatizer = WordNetLemmatizer()

lemmas = []

for element in nltk.word_tokenize(no_sp_data.loc[0]["excerpt"]):
    lemmas.append(lemmatizer.lemmatize(element))

print(" ".join(lemmas))

when the young people returned to the ballroom it presented a decidedly changed appearance instead of an interior scene it wa a winter landscape the floor wa covered with snow white canvas not laid on smoothly but rumpled over bump and hillock like a real snow field the numerous palm and evergreen that had decorated the room were powdered with flour and strewn with tuft of cotton like snow also diamond dust had been lightly sprinkled on them and glittering crystal icicle hung from the branch at each end of the room on the wall hung a beautiful bear skin rug these rug were for prize one for the girl and one for the boy and this wa the game the girl were gathered at one end of the room and the boy at the other and one end wa called the north pole and the other the south pole each player wa given a small flag which they were to plant on reaching the pole this would have been an easy matter but each traveller wa obliged to wear snowshoe


In [104]:
lemmatizer = WordNetLemmatizer()

lemmas = []

for element in nltk.word_tokenize(data.loc[0]["excerpt"]):
    lemmas.append(lemmatizer.lemmatize(element))

print(" ".join(lemmas))

when the young people returned to the ballroom , it presented a decidedly changed appearance . instead of an interior scene , it wa a winter landscape . the floor wa covered with snow-white canvas , not laid on smoothly , but rumpled over bump and hillock , like a real snow field . the numerous palm and evergreen that had decorated the room , were powdered with flour and strewn with tuft of cotton , like snow . also diamond dust had been lightly sprinkled on them , and glittering crystal icicle hung from the branch . at each end of the room , on the wall , hung a beautiful bear-skin rug . these rug were for prize , one for the girl and one for the boy . and this wa the game . the girl were gathered at one end of the room and the boy at the other , and one end wa called the north pole , and the other the south pole . each player wa given a small flag which they were to plant on reaching the pole . this would have been an easy matter , but each traveller wa obliged to wear snowshoe .


In [120]:
lemmatizer = WordNetLemmatizer()

lemmas = []

for element in nltk.word_tokenize(stem_data.loc[0]["excerpt"]):
    lemmas.append(lemmatizer.lemmatize(element))

print(" ".join(lemmas))

when the young peopl return to the ballroom it present a decid chang appear instead of an interior scene it wa a winter landscap the floor wa cover with snow white canva not laid on smooth but rumpl over bump and hillock like a real snow field the numer palm and evergreen that had decor the room were powder with flour and strewn with tuft of cotton like snow also diamond dust had been light sprinkl on them and glitter crystal icicl hung from the branch at each end of the room on the wall hung a beauti bear skin rug these rug were for prize one for the girl and one for the boy and this wa the game the girl were gather at one end of the room and the boy at the other and one end wa call the north pole and the other the south pole each player wa given a small flag which they were to plant on reach the pole this would have been an easi matter but each travel wa oblig to wear snowsho


Verb "be" is changed wrong by lemmatizer. So, don't use lemmatizer.

**only use main word using part-of-speech**

In [122]:
from nltk.tag import pos_tag
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [123]:
pos_tag(nltk.word_tokenize(data.loc[0]["excerpt"]))

[('when', 'WRB'),
 ('the', 'DT'),
 ('young', 'JJ'),
 ('people', 'NNS'),
 ('returned', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('ballroom', 'NN'),
 (',', ','),
 ('it', 'PRP'),
 ('presented', 'VBD'),
 ('a', 'DT'),
 ('decidedly', 'RB'),
 ('changed', 'VBN'),
 ('appearance', 'NN'),
 ('.', '.'),
 ('instead', 'RB'),
 ('of', 'IN'),
 ('an', 'DT'),
 ('interior', 'JJ'),
 ('scene', 'NN'),
 (',', ','),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('a', 'DT'),
 ('winter', 'NN'),
 ('landscape', 'NN'),
 ('.', '.'),
 ('the', 'DT'),
 ('floor', 'NN'),
 ('was', 'VBD'),
 ('covered', 'VBN'),
 ('with', 'IN'),
 ('snow-white', 'JJ'),
 ('canvas', 'NN'),
 (',', ','),
 ('not', 'RB'),
 ('laid', 'VBN'),
 ('on', 'IN'),
 ('smoothly', 'NN'),
 (',', ','),
 ('but', 'CC'),
 ('rumpled', 'VBD'),
 ('over', 'IN'),
 ('bumps', 'NNS'),
 ('and', 'CC'),
 ('hillocks', 'NNS'),
 (',', ','),
 ('like', 'IN'),
 ('a', 'DT'),
 ('real', 'JJ'),
 ('snow', 'JJ'),
 ('field', 'NN'),
 ('.', '.'),
 ('the', 'DT'),
 ('numerous', 'JJ'),
 ('palms', 'NNS'),
 ('

In [125]:
pos_tag(nltk.word_tokenize(no_sp_data.loc[0]["excerpt"]))

[('when', 'WRB'),
 ('the', 'DT'),
 ('young', 'JJ'),
 ('people', 'NNS'),
 ('returned', 'VBD'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('ballroom', 'NN'),
 ('it', 'PRP'),
 ('presented', 'VBD'),
 ('a', 'DT'),
 ('decidedly', 'RB'),
 ('changed', 'VBN'),
 ('appearance', 'NN'),
 ('instead', 'RB'),
 ('of', 'IN'),
 ('an', 'DT'),
 ('interior', 'JJ'),
 ('scene', 'NN'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('a', 'DT'),
 ('winter', 'NN'),
 ('landscape', 'NN'),
 ('the', 'DT'),
 ('floor', 'NN'),
 ('was', 'VBD'),
 ('covered', 'VBN'),
 ('with', 'IN'),
 ('snow', 'JJ'),
 ('white', 'JJ'),
 ('canvas', 'NN'),
 ('not', 'RB'),
 ('laid', 'VBN'),
 ('on', 'IN'),
 ('smoothly', 'RB'),
 ('but', 'CC'),
 ('rumpled', 'VBD'),
 ('over', 'IN'),
 ('bumps', 'NNS'),
 ('and', 'CC'),
 ('hillocks', 'NNS'),
 ('like', 'IN'),
 ('a', 'DT'),
 ('real', 'JJ'),
 ('snow', 'JJ'),
 ('field', 'NN'),
 ('the', 'DT'),
 ('numerous', 'JJ'),
 ('palms', 'NNS'),
 ('and', 'CC'),
 ('evergreens', 'VBZ'),
 ('that', 'WDT'),
 ('had', 'VBD'),
 ('decorated', 'VBN'

In [126]:
pos_tag(nltk.word_tokenize(stem_data.loc[0]["excerpt"]))

[('when', 'WRB'),
 ('the', 'DT'),
 ('young', 'JJ'),
 ('peopl', 'NN'),
 ('return', 'NN'),
 ('to', 'TO'),
 ('the', 'DT'),
 ('ballroom', 'NN'),
 ('it', 'PRP'),
 ('present', 'VBD'),
 ('a', 'DT'),
 ('decid', 'NN'),
 ('chang', 'NN'),
 ('appear', 'VBP'),
 ('instead', 'RB'),
 ('of', 'IN'),
 ('an', 'DT'),
 ('interior', 'JJ'),
 ('scene', 'NN'),
 ('it', 'PRP'),
 ('was', 'VBD'),
 ('a', 'DT'),
 ('winter', 'NN'),
 ('landscap', 'NN'),
 ('the', 'DT'),
 ('floor', 'NN'),
 ('was', 'VBD'),
 ('cover', 'RB'),
 ('with', 'IN'),
 ('snow', 'JJ'),
 ('white', 'JJ'),
 ('canva', 'NN'),
 ('not', 'RB'),
 ('laid', 'VBN'),
 ('on', 'IN'),
 ('smooth', 'NNS'),
 ('but', 'CC'),
 ('rumpl', 'NN'),
 ('over', 'IN'),
 ('bump', 'NN'),
 ('and', 'CC'),
 ('hillock', 'NN'),
 ('like', 'IN'),
 ('a', 'DT'),
 ('real', 'JJ'),
 ('snow', 'JJ'),
 ('field', 'NN'),
 ('the', 'DT'),
 ('numer', 'JJ'),
 ('palm', 'NN'),
 ('and', 'CC'),
 ('evergreen', 'NN'),
 ('that', 'WDT'),
 ('had', 'VBD'),
 ('decor', 'VBN'),
 ('the', 'DT'),
 ('room', 'NN'),
 ('we

Sometime, stemmed word's POSs are different with original word. So, POS 
should take precedence than stemming.

In [130]:
stop_list = ["DT", "EX", "JJ", "JJR", "JJS", "LS", "RB", "RBR", "RBS", "UH"]

In [129]:
def pos_tagging(x):
    words = []
    word_pos = pos_tag(nltk.word_tokenize(x))
    
    for word, pos in word_pos:
        if pos not in stop_list:
            words.append(word)
            
    return " ".join(words)

In [131]:
pos_data = copy.deepcopy(no_sp_data)
pos_data["excerpt"] = pos_data["excerpt"].apply(pos_tagging)

In [132]:
pos_data.loc[0]["excerpt"]

'when people returned to ballroom it presented changed appearance of scene it was winter landscape floor was covered with canvas laid on but rumpled over bumps and hillocks like field palms and evergreens that had decorated room were powdered with flour and strewn with tufts of cotton like snow diamond dust had been sprinkled on them and glittering icicles hung from branches at end of room on wall hung bear rug rugs were for prizes one for girls and one for boys and was game girls were gathered at one end of room and boys at and one end was called pole and player was given flag which they were to plant on reaching pole would have been matter but traveller was obliged to wear snowshoes'

In [133]:
pos_data.to_csv("data/3_stemming/train_remove_some_pos.csv")

**nltk stop words**

In [135]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [136]:
from nltk.corpus import stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [137]:
def remove_stop_words(x):
    words = []
    
    for element in nltk.word_tokenize(x):
        if element not in stopwords.words('english'):
            words.append(element)

    return " ".join(words)

In [138]:
stop_data = copy.deepcopy(no_sp_data)
stop_data["excerpt"] = stop_data["excerpt"].apply(remove_stop_words)

In [140]:
stop_data.loc[0]["excerpt"]

'young people returned ballroom presented decidedly changed appearance instead interior scene winter landscape floor covered snow white canvas laid smoothly rumpled bumps hillocks like real snow field numerous palms evergreens decorated room powdered flour strewn tufts cotton like snow also diamond dust lightly sprinkled glittering crystal icicles hung branches end room wall hung beautiful bear skin rug rugs prizes one girls one boys game girls gathered one end room boys one end called north pole south pole player given small flag plant reaching pole would easy matter traveller obliged wear snowshoes'

In [141]:
stop_data.to_csv("data/3_stemming/train_remove_nltk_stop_words.csv")

Numner of words

train_just_stemming > train_remove_some_pos > train_remove_nltk_stop_words