# English Data PreProcessing

## [주요 고려 사항]
1. `dot(.)`과 `apostrophe(')` 처리
    - 'u.s.'와 'u.s.s.r.'과 같은 약자처리를 어떻게 할 것인가?
    - 'america's'와 같은 소유격을 어떻게 처리할 것인가?
        1. 처음 Cleaning 때, `dot(.)`과 `apostrophe(')`는 제거하지 않음
            - `dot(.)`
                - 'u.s', 'u.s.s.r'과 같은 약자를 유지시키기 위한 처리
            - `apostrophe(')`
                - 'america's'와 같은 소유격을 유지시켜서 Tokenizing때 's를 분리시키기 위함.
        2. Tokenizing 이후, `dot(.)`과 `apostrophe(')`를 유지시켜야 하는 Token들 외에는 특수문자 제거
            1. `apostrophe(')`와 `dot(.)`을 가진 Token들을 출력해보고 유지시킬 Token들의 목록을 결정
            2. `apostrophe(')`를 유지시킬 Token들 외의 모든 Token들에서 `apostrophe(')` 및 특수문자 제거
                - `dot(.)`은 다음 단계에서 예외처리를 하며 제거해야 하므로, 이 단계에서는 모든 `dot(.)`을 유지시킴
            3. `dot(.)`을 유지시킬 Token들 외의 모든 Token들에서 `dot(.)` 및 특수문자 제거

## 1. Module Import

In [1]:
# self defined Modules
from myModules.utils import DataLoader, merge
from myModules.preprocess.english import cleaning, remove_stopwords, tagging, dot_and_apostrophe, convert_pos, lemmatization, to_pickle, check_pos

# General Modules
import pandas as pd
import numpy as np
import warnings
from tqdm.notebook import tqdm
import pickle
import re
import glob

warnings.filterwarnings('ignore')

# Read File
import glob

# NLP
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## 2. Data Loader

In [2]:
DATA_ROOT = './Data/3구간/'

PERIOD_1 = DATA_ROOT + '1시기/1시기_ST/'
PERIOD_2 = DATA_ROOT + '2시기/2시기_ST/'
PERIOD_3 = DATA_ROOT + '3시기/3시기_ST/'

RESULT_ROOT = './Result/3구간/'

RESULT_1 = RESULT_ROOT + '/1시기/ST/'
RESULT_2 = RESULT_ROOT + '/2시기/ST/'
RESULT_3 = RESULT_ROOT + '/3시기/ST/'

In [3]:
files_list_1 = glob.glob(PERIOD_1+'*.txt')
files_list_2 = glob.glob(PERIOD_2+'*.txt')
files_list_3 = glob.glob(PERIOD_3+'*.txt')

texts_1 = DataLoader(files_list_1, mode='ST')
texts_2 = DataLoader(files_list_2, mode='ST')
texts_3 = DataLoader(files_list_3, mode='ST')

## 3. PreProcess

### 3-1. Data Cleaning

- `dot(.)`과 `apostrophe(')`는 제거하지 않음

In [4]:
cleaned_1 = cleaning(data=texts_1)
cleaned_2 = cleaning(data=texts_2)
cleaned_3 = cleaning(data=texts_3)

### 3-2. Tokenizing

In [5]:
tokenized_1 = [word_tokenize(text) for text in cleaned_1]
tokenized_2 = [word_tokenize(text) for text in cleaned_2]
tokenized_3 = [word_tokenize(text) for text in cleaned_3]

#### Period 1

In [6]:
symbol = dot_and_apostrophe(data=tokenized_1)

##### apostrophe와 dot을 가진 token들 시각화

In [7]:
symbol.token_with_apostrophe()
symbol.token_with_dot()

apostrophe를 가진 token : 
{"'ve", "o'clock", "'madam", "n't", "'m", "'are", "'liberty", "'ll", "'german", "'", "'system", "'s", "'heat", "'d", "'democracy", "'blamed", "'into", "'mvd", "'structure"}
dot을 가진 token : 
{'co.', 'a.m.', 'i.', 'm.', 'ph.d.', 'n.', 'col.', 't.', 'u.s.s.r.', 'jr.', 'frightened.to', 'u.s.', 'dr.', '...', 'mrs.', 'p.m.', '.', 'mr.', 'st.', 'u.n.', 'f.', 's.', 'u.', 'a.', 'e.', '..', 'camps.if', 'oct.', 'w.', 'gen.', 'p.', 'messrs.', 'v.'}


##### exception 목록 설정

In [8]:
apostrophe_exception = ["'ll", "'s", "'ve", "n't"]
dot_exception = ["u.s.s.r.", "dr.", "messrs.", "gen.", "u.n.", "a.m.", "st.", "u.s.", "ph.d", "jr.", "p.m.", "mrs.", "mr."]

symbol.set_exception(apostrophe_exception=apostrophe_exception, dot_exception=dot_exception)

In [9]:
symbol.print_exception()

apostrophe exceptions : 
["'ll", "'s", "'ve", "n't"]
dot exceptions : 
['u.s.s.r.', 'dr.', 'messrs.', 'gen.', 'u.n.', 'a.m.', 'st.', 'u.s.', 'ph.d', 'jr.', 'p.m.', 'mrs.', 'mr.']


##### apostrophe 처리

In [10]:
tokenized_1_ = symbol.remove_apostrophe(data=tokenized_1)

Processed Tokens : 
{"'are", "'d", "'liberty", "'democracy", "o'clock", "'mvd", "'blamed", "'german", "'", "'madam", "'system", "'into", "'structure", "'m", "'heat"}


##### dot 처리

In [11]:
tokenized_1__ = symbol.remove_dot(data=tokenized_1_)

Processed Tokens : 
{'', 'co.', 'i.', 'm.', 'ph.d.', 'col.', 'n.', 't.', 'frightened.to', '...', '.', 'f.', 's.', 'u.', 'a.', 'e.', '..', 'camps.if', 'oct.', 'w.', 'p.', 'v.'}


##### 제거해야할 token 검사

In [12]:
symbol.check_invalid_tokens(data=tokenized_1__)

Remaining invalid Symbol : {'', 'i', 'g', 'e', 'p', 'f', 'x', 'h', 'm', 'b', 's', 'a', 'v', 'n', 'w', 'd', 'j', 'k', 'o', 't', 'y', 'r', 'u'}


##### 길이가 1이거나 필요없는 특수문자인 Token들 삭제

In [13]:
tokenized_1___ = symbol.remove_invalid_tokens(data=tokenized_1__)

Removed Tokens : 
{'', 'i', 'g', 'e', 'p', 'f', 'x', 'h', 'm', 'b', 's', 'a', 'v', 'n', 'w', 'd', 'j', 'k', 'o', 't', 'y', 'r', 'u'}


##### 남아있는 invalid한 token이 있는지 검사

In [14]:
symbol.check_invalid_tokens(data=tokenized_1___)

There is no invalid symbol


#### Peiod 2

In [15]:
symbol = dot_and_apostrophe(data=tokenized_2)

##### apostrophe와 dot을 가진 token들 시각화

In [16]:
symbol.token_with_apostrophe()
symbol.token_with_dot()

apostrophe를 가진 token : 
{"'reprisals", "'", "'for", "n't", "'m", "'s"}
dot을 가진 token : 
{'r.', 'tyranny.the', 'u.s.a.', 'i.', 'm.', 'n.', 't.', 'u.s.s.r.', 'g.', 'b.', 'l.', 'dr.', 'h.', 'p.m.', '.', 'mr.', 'o.', 's.', 'a.', 'e.', '..', 'w.', 'gen.', 'p.', 'c.', 'messrs.', 'v.'}


##### exception 목록 설정

In [17]:
apostrophe_exception = ["'s", "n't"]
dot_exception = ["u.s.s.r.", "dr.", "messrs.", "gen.", "u.s.a.", "p.m.", "mr."]

symbol.set_exception(apostrophe_exception=apostrophe_exception, dot_exception=dot_exception)

In [18]:
symbol.print_exception()

apostrophe exceptions : 
["'s", "n't"]
dot exceptions : 
['u.s.s.r.', 'dr.', 'messrs.', 'gen.', 'u.s.a.', 'p.m.', 'mr.']


##### apostrophe 처리

In [19]:
tokenized_2_ = symbol.remove_apostrophe(data=tokenized_2)

Processed Tokens : 
{"'for", "'reprisals", "'", "'m"}


##### dot 처리

In [20]:
tokenized_2__ = symbol.remove_dot(data=tokenized_2_)

Processed Tokens : 
{'', 'r.', 'tyranny.the', 'i.', 'm.', 'n.', 't.', 'g.', 'b.', 'l.', 'h.', '.', 'o.', 's.', 'a.', 'e.', '..', 'w.', 'p.', 'c.', 'v.'}


##### 제거해야할 Token들 검사

In [21]:
symbol.check_invalid_tokens(data=tokenized_2__)

Remaining invalid Symbol : {'', 'i', 'g', 'e', 'p', 'f', 'c', 'h', 'm', 'b', 's', 'a', 'v', 'n', 'w', 'd', 'o', 'l', 't', 'r'}


##### 길이가 1이거나 필요없는 특수문자인 token 제거

In [22]:
tokenized_2___ = symbol.remove_invalid_tokens(data=tokenized_2__)

Removed Tokens : 
{'', 'i', 'g', 'e', 'p', 'f', 'c', 'h', 'm', 'b', 's', 'a', 'v', 'n', 'w', 'd', 'o', 'l', 't', 'r'}


##### 남아있는 Invalid한 Token이 있는지 확인

In [23]:
symbol.check_invalid_tokens(data=tokenized_2___)

There is no invalid symbol


#### period 3

In [24]:
symbol = dot_and_apostrophe(tokenized_3)

##### apostrophe와 dot을 가진 token들 시각화

In [25]:
symbol.token_with_apostrophe()
symbol.token_with_dot()

apostrophe를 가진 token : 
{"'ve", "'has", "'d", "'vas", "o'clock", "'spontaneous", "'recession", "'ll", "'", "n't", "'s"}
dot을 가진 token : 
{'r.', 's.s.r', 'a.m.', 'i.', 'm.', 'col.', 'n.', 't.', '..................', 'g.', 'u.s.s.r.', 'jr.', 'b.', 'l.', 'dr.', '...', 'prof.', 'j.', 'mrs.', 'h.', 'maj.', 'p.m.', '.', 'mr.', 'o.', 'st.', 'd.', 'f.', 's.', 'u.', 'a.', 'u.n.r.r.a', 'e.', 'w.', 'gen.', 'p.', 'c.', 'v.'}


##### exception 목록 설정

In [26]:
apostrophe_exception = ["'ll", "'s", "'ve", "n't"]
dot_exception = ["u.s.s.r.", "dr.", "s.s.r", "a.m.", "st.", "prof.", "u.n.r.r.a", "jr.", "maj.", "p.m.", "mrs.", "mr."]

symbol.set_exception(apostrophe_exception=apostrophe_exception, dot_exception=dot_exception)

In [27]:
symbol.print_exception()

apostrophe exceptions : 
["'ll", "'s", "'ve", "n't"]
dot exceptions : 
['u.s.s.r.', 'dr.', 's.s.r', 'a.m.', 'st.', 'prof.', 'u.n.r.r.a', 'jr.', 'maj.', 'p.m.', 'mrs.', 'mr.']


##### apostrophe 처리

In [28]:
tokenized_3_ = symbol.remove_apostrophe(tokenized_3)

Processed Tokens : 
{'``', "'has", "'d", "'vas", "o'clock", "'spontaneous", "'recession", "'"}


##### dot 처리

In [29]:
tokenized_3__ = symbol.remove_dot(tokenized_3_)

Processed Tokens : 
{'', 'r.', 'i.', 'm.', 'n.', 'col.', 't.', '..................', 'g.', 'b.', 'l.', '...', 'j.', 'h.', '.', 'o.', 'd.', 'f.', 's.', 'u.', 'a.', 'e.', 'w.', 'gen.', 'p.', 'c.', 'v.'}


##### 제거해야할 token 확인

In [30]:
symbol.check_invalid_tokens(tokenized_3__)

Remaining invalid Symbol : {'', 'i', 'g', 'e', 'p', 'f', 'x', 'c', 'h', 'm', 'b', 's', 'a', 'v', 'n', 'w', 'd', 'j', 'o', 'l', 't', 'r', 'u'}


##### 길이가 1이거나 필요없는 특수문자인 token 제거

In [31]:
tokenized_3___ = symbol.remove_invalid_tokens(tokenized_3__)

Removed Tokens : 
{'', 'i', 'g', 'e', 'p', 'f', 'x', 'c', 'h', 'm', 'b', 's', 'a', 'v', 'n', 'w', 'd', 'j', 'o', 'l', 't', 'r', 'u'}


##### 남아있는 INvalid한 token이 있는지 확인

In [32]:
symbol.check_invalid_tokens(tokenized_3___)

There is no invalid symbol


### 3-3. Remove StopWords

In [33]:
stopwords = nltk.corpus.stopwords.words('english')
new_stopwords = ['would', 'could', 'might', 'need', 'can', 'must', \
    'one', 'two', 'upon', 'may', 'perhaps', 'living', 'seem', 'also', 'ii', 'ofthe',
    'also', 'much', 'therefore', "'ll", "'ve", "n't"]

wo_stopword_1 = remove_stopwords(tokenized_1___, stopwords, new_stopwords)
wo_stopword_2 = remove_stopwords(tokenized_2___, stopwords, new_stopwords)
wo_stopword_3 = remove_stopwords(tokenized_3___, stopwords, new_stopwords)

### 3-4. Tagging

In [34]:
pos_table = pd.read_pickle("processed-data/pos-table.pkl")

In [35]:
tagged_1 = tagging(wo_stopword_1)
tagged_2 = tagging(wo_stopword_2)
tagged_3 = tagging(wo_stopword_3)

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/11 [00:00<?, ?it/s]

#### Period 1

In [36]:
pos = check_pos(tagged_1)

In [37]:
pos.pos_with_symbol()

tagged token with apostrophe : 
{"'s": {'POS'}}
tagged token with dot : 
{'u.s.s.r.': {'JJ', 'VBP'}, 'jr.': {'NN', 'VBP'}, 'p.m.': {'RB'}, 'u.s.': {'JJ'}, 'mr.': {'JJ', 'RB', 'VBP', 'NNP', 'RBS', 'NN'}, 'gen.': {'JJ', 'NN', 'VBP'}, 'dr.': {'JJ', 'VBP'}, 'st.': {'JJ', 'NN'}, 'messrs.': {'NN'}, 'a.m.': {'JJ'}, 'u.n.': {'NN'}, 'mrs.': {'NNS'}}


In [38]:
pos.pos_without_symbol()

tagged token without apostrophe : 
{'s': ['NN']}
tagged token without dot : 
{'ussr': ['NN'], 'jr': ['NN'], 'pm': ['NN'], 'us': ['PRP'], 'mr': ['NN'], 'gen': ['NN'], 'dr': ['NN'], 'st': ['NN'], 'messrs': ['NN'], 'am': ['VBP'], 'un': ['NN'], 'mrs': ['NN']}


#### Period 2

In [39]:
pos = check_pos(tagged_2)

In [40]:
pos.pos_with_symbol()

tagged token with apostrophe : 
{"'s": {'POS'}}
tagged token with dot : 
{'u.s.s.r.': {'JJ'}, 'p.m.': {'JJ'}, 'mr.': {'JJ', 'FW', 'NNS', 'VBP', 'NNP', 'VB', 'RBS', 'NN', 'VBZ'}, 'u.s.a.': {'NN'}, 'gen.': {'JJ'}, 'dr.': {'NN'}, 'messrs.': {'NNS'}}


In [41]:
pos.pos_without_symbol()

tagged token without apostrophe : 
{'s': ['NN']}
tagged token without dot : 
{'ussr': ['NN'], 'pm': ['NN'], 'mr': ['NN'], 'usa': ['NN'], 'gen': ['NN'], 'dr': ['NN'], 'messrs': ['NN']}


#### Period 3

In [42]:
pos = check_pos(tagged_3)

In [43]:
pos.pos_with_symbol()

tagged token with apostrophe : 
{"'s": {'POS'}}
tagged token with dot : 
{'u.s.s.r.': {'JJ'}, 'jr.': {'NN'}, 'p.m.': {'RB', 'NN', 'VBP'}, 'mr.': {'JJ', 'RBR', 'FW', 'VBD', 'NNS', 'RB', 'VBP', 'NNP', 'VB', 'RBS', 'NN', 'VBZ'}, 's.s.r': {'NN'}, 'dr.': {'JJ', 'NN', 'VBZ', 'VBP'}, 'st.': {'JJ'}, 'maj.': {'NN'}, 'a.m.': {'RB', 'NN', 'VBD'}, 'prof.': {'NN'}, 'mrs.': {'NN'}, 'u.n.r.r.a': {'RB', 'JJ'}}


In [44]:
pos.pos_without_symbol()

tagged token without apostrophe : 
{'s': ['NN']}
tagged token without dot : 
{'ussr': ['NN'], 'jr': ['NN'], 'pm': ['NN'], 'mr': ['NN'], 'ssr': ['NN'], 'dr': ['NN'], 'st': ['NN'], 'maj': ['NN'], 'am': ['VBP'], 'prof': ['NN'], 'mrs': ['NN'], 'unrra': ['NN']}


### 3-5. adress POS of token with symbols

In [45]:
tagged_1_ = convert_pos(data=tagged_1, key=".", target_pos="NN")
tagged_2_ = convert_pos(data=tagged_2, key=".",  target_pos="NN")
tagged_3_ = convert_pos(data=tagged_3, key=".",  target_pos="NN")

### 3-6. Lemmatization

In [46]:
lemmatizer = WordNetLemmatizer()

#### All pos

In [47]:
lemmatize = lemmatization(tagged_1, lemmatizer, pos_table, allowed_pos=['noun', 'verb', 'adjective', 'adverb'])
lemmatized_1_all = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_2, lemmatizer, pos_table, allowed_pos=['noun', 'verb', 'adjective', 'adverb'])
lemmatized_2_all = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_3, lemmatizer, pos_table, allowed_pos=['noun', 'verb', 'adjective', 'adverb'])
lemmatized_3_all = lemmatize.lemmatize()

#### Nouns

In [48]:
lemmatize = lemmatization(tagged_1, lemmatizer, pos_table, allowed_pos=['noun'])
lemmatized_1_noun = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_2, lemmatizer, pos_table, allowed_pos=['noun'])
lemmatized_2_noun = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_3, lemmatizer, pos_table, allowed_pos=['noun'])
lemmatized_3_noun = lemmatize.lemmatize()

#### Verbs

In [49]:
lemmatize = lemmatization(tagged_1, lemmatizer, pos_table, allowed_pos=['verb'])
lemmatized_1_verb = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_2, lemmatizer, pos_table, allowed_pos=['verb'])
lemmatized_2_verb = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_3, lemmatizer, pos_table, allowed_pos=['verb'])
lemmatized_3_verb = lemmatize.lemmatize()

#### Adjectives

In [50]:
lemmatize = lemmatization(tagged_1, lemmatizer, pos_table, allowed_pos=['adjective'])
lemmatized_1_adjective = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_2, lemmatizer, pos_table, allowed_pos=['adjective'])
lemmatized_2_adjective = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_3, lemmatizer, pos_table, allowed_pos=['adjective'])
lemmatized_3_adjective = lemmatize.lemmatize()

#### Adverbs

In [51]:
lemmatize = lemmatization(tagged_1, lemmatizer, pos_table, allowed_pos=['adverb'])
lemmatized_1_adverb = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_2, lemmatizer, pos_table, allowed_pos=['adverb'])
lemmatized_2_adverb = lemmatize.lemmatize()

lemmatize = lemmatization(tagged_3, lemmatizer, pos_table, allowed_pos=['adverb'])
lemmatized_3_adverb = lemmatize.lemmatize()

## 4. Save PreProcessed Data

In [52]:
SAVE_ROOT = './processed-data/'

SAVE_1 = SAVE_ROOT + 'period-1/'
SAVE_2 = SAVE_ROOT + 'period-2/'
SAVE_3 = SAVE_ROOT + 'period-3/'

### 4-1. Preprocessed data to pickle file

#### all pos

In [53]:
to_pickle(data=lemmatized_1_all, file_name="lemmatized-all", root=SAVE_1)
to_pickle(data=lemmatized_2_all, file_name="lemmatized-all", root=SAVE_2)
to_pickle(data=lemmatized_3_all, file_name="lemmatized-all", root=SAVE_3)

#### noun

In [54]:
to_pickle(data=lemmatized_1_noun, file_name="lemmatized-noun", root=SAVE_1)
to_pickle(data=lemmatized_2_noun, file_name="lemmatized-noun", root=SAVE_2)
to_pickle(data=lemmatized_3_noun, file_name="lemmatized-noun", root=SAVE_3)

#### verb

In [55]:
to_pickle(data=lemmatized_1_verb, file_name="lemmatized-verb", root=SAVE_1)
to_pickle(data=lemmatized_2_verb, file_name="lemmatized-verb", root=SAVE_2)
to_pickle(data=lemmatized_3_verb, file_name="lemmatized-verb", root=SAVE_3)

#### adjective

In [56]:
to_pickle(data=lemmatized_1_adjective, file_name="lemmatized-adjective", root=SAVE_1)
to_pickle(data=lemmatized_2_adjective, file_name="lemmatized-adjective", root=SAVE_2)
to_pickle(data=lemmatized_3_adjective, file_name="lemmatized-adjective", root=SAVE_3)

#### adverb

In [57]:
to_pickle(data=lemmatized_1_adverb, file_name="lemmatized-adverb", root=SAVE_1)
to_pickle(data=lemmatized_2_adverb, file_name="lemmatized-adverb", root=SAVE_2)
to_pickle(data=lemmatized_3_adverb, file_name="lemmatized-adverb", root=SAVE_3)

### 4-2. Tagged data to pickle file

In [58]:
to_pickle(data=tagged_1_, file_name="tagged", root=SAVE_1)
to_pickle(data=tagged_2_, file_name="tagged", root=SAVE_2)
to_pickle(data=tagged_3_, file_name="tagged", root=SAVE_3)