# SC42x 
## 자연어처리 (Natural Language Processing)

# Part 1 : 개념 요약

> 다음의 키워드에 대해서 **한 줄**로 간단하게 요약해주세요. (세션 노트를 참고하여도 좋습니다.)<br/>
> **Tip : 아래 문제를 먼저 수행한 후 모델 학습 등 시간이 오래 걸리는 셀이 실행되는 동안 아래 내용을 작성하면 시간을 절약할 수 있습니다.**

**N421**
- Stopwords(불용어)
- Stemming과 Lemmatization
- Bag-of-Words
- TF-IDF

**N422**
- Word2Vec
- fastText

**N423**
- RNN
- LSTM, GRU
- Attention

# Part 2 : Fake/Real News Dataset

한 주간 자연어처리 기법을 배우면서 여러분은 다양한 기술들을 접했습니다.<br/>
어떻게 텍스트 데이터를 다뤄야 하는지, 텍스트를 벡터화 하는 법, 문서에서 토픽을 모델하는 법 등 다양한 NLP 기법을 배웠는데요.<br/>
이번 스프린트 챌린지에선 [Fake/Real News Dataset](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)을 사용하여 배운 것들을 복습해보는 시간을 갖겠습니다.

__*주의:*__  모델의 성능을 최대한 끌어올리는 것이 아닌 모델 구동에 초점을 맞춰주세요.<br/>
모든 문제를 완료한 후에도 "시간이 남았다면" 정확도를 올리는 것에 도전하시는 것을 추천드립니다.

In [None]:
# 코드 실행 전 seed를 지정하겠습니다.
import numpy as np
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

## 2.0 데이터셋을 불러옵니다.

- 위 캐글 링크에서 데이터셋을 받아 업로드 후 데이터셋에 저장합니다.<br/>
(해당 방법은 시간이 꽤 걸리므로 drive_mount 나 kaggle 연동 하시는 것을 추천드립니다.)

- 'label' 열을 만들어 Fake = 1, True = 0 로 레이블링해줍니다.

In [None]:
from google.colab import files

files.upload()

Saving Fake.csv to Fake.csv
Saving True.csv to True.csv


In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

fake = pd.read_csv('/content/drive/My Drive/sc42x/Fake.csv') ### Fake.csv
real = pd.read_csv('/content/drive/My Drive/sc42x/True.csv') ### True.csv

In [None]:
fake['label'] = 1
real['label'] = 0

In [None]:
df = pd.concat([fake,real])
df = df.sample(frac=1).reset_index(drop=True)
df = df[['text', 'label']]
df

Unnamed: 0,text,label
0,"A little more than 25 years ago, the New York ...",1
1,WASHINGTON (Reuters) - Two prominent Republica...,0
2,(Reuters) - Electricity generator Florida Powe...,0
3,"Saturday afternoon, a group of protesters gath...",1
4,(Reuters) - The New York Times’s editorial boa...,0
...,...,...
44893,The video below is from an ABC News interview ...,1
44894,WASHINGTON (Reuters) - White supremacists and ...,0
44895,WASHINGTON (Reuters) - U.S. House Democratic L...,0
44896,Trigger warning If liberal schools and badass ...,1


In [None]:
df.shape

(44898, 2)

In [None]:
df.dtypes

text     object
label     int64
dtype: object

In [None]:
df.isna().sum()

text     0
label    0
dtype: int64

## 2.1 TF-IDF 를 활용하여 특정 뉴스와 유사한 뉴스 검색하기

시간상 특별한 전처리 없이 아래 태스크를 수행하겠습니다.

### 2.1.1 TFidfVectorizer를 사용하여 문서-단어 행렬(Document-Term Matrix) 만들기

In [None]:
# 이 곳에 답안을 작성하시길 바랍니다.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_features=3000)

In [None]:
dtm_tfidf = tfidf.fit_transform(df['text'])
dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())
dtm_tfidf

Unnamed: 0,00,000,10,100,11,12,13,14,15,16,17,18,19,20,200,2000,2001,2003,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2017the,2018,2019,2020,21,21st,21wire,22,23,...,won,wonder,word,words,work,worked,worker,workers,working,works,world,worried,worry,worse,worst,worth,wouldn,wounded,write,writing,written,wrong,wrongdoing,wrote,www,xi,yeah,year,years,yemen,yes,yesterday,york,young,youth,youtube,zero,zika,zone,zuma
0,0.0,0.081838,0.000000,0.140853,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.038475,0.0,0.000000,0.0,0.0,0.048258,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.027142,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.025594,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.045424,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.019978,0.067230,0.0,0.0,0.0,0.028639,0.035694,0.0,0.000000,0.091834,0.0,0.0,0.0
1,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.051549,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.052642,0.000000,0.11273,0.0,0.066929,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.043190,0.060002,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.036661,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
2,0.0,0.000000,0.040608,0.000000,0.000000,0.047228,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.042416,0.0,0.0,0.000000,0.0,0.063633,0.0,0.000000,0.000000,0.000000,0.0,0.05468,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.034452,0.0,0.0,0.0,0.053964,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
3,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.054817,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.0,0.000000,0.059933,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.078687,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.159758,0.0,0.0,0.0
4,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.060056,0.062583,0.0,0.00000,0.0,0.055604,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.056641,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.097346,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.042549,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,0.0,0.000000,0.000000,0.056090,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.064388,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.042053,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.154955,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.031822,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.063116,0.000000,0.0,0.0,0.0
44894,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.023418,0.0,0.0,0.000000,0.0,0.000000,0.0,0.035329,0.030041,0.031305,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.033758,0.000000,0.000000,0.0,0.048694,0.0,0.0,0.000000,0.000000,0.016654,0.0,0.0,0.0,0.021284,0.000000,0.0,0.000000,0.034124,0.0,0.0,0.0
44895,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0
44896,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.000000,0.0,0.000000,0.000000,0.000000,0.0,0.00000,0.0,0.000000,0.000000,0.000000,0.000000,0.00000,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.048043,0.041285,0.0,0.000000,0.0,0.0,0.056548,0.000000,0.000000,0.0,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.0,0.0,0.0


### 2.1.2 KNN 알고리즘을 사용하여 유사한 문서 검색하기

- 42번 인덱스와 가장 유사한 5개 문서(42번 포함)의 인덱스와 해당 인덱스의 레이블을 나타내주세요.
- NN 모델의 파라미터 중 `algorithm = 'kd_tree'` 로 설정합니다.

In [None]:
# 이 곳에 답안을 작성하시길 바랍니다.

In [None]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
knn.fit(dtm_tfidf)

NearestNeighbors(algorithm='kd_tree', leaf_size=30, metric='minkowski',
                 metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                 radius=1.0)

In [None]:
distance, result = knn.kneighbors([dtm_tfidf.iloc[42].values])
result

array([[   42,  6991, 15249,  5520,   119]])

In [None]:
for i in result:
    print(df.label[i])

42       0
6991     0
15249    0
5520     1
119      1
Name: label, dtype: int64


## 2.2 Keras Embedding을 사용하여 분류하기

### 2.2.0 데이터셋 split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df.text, df.label, test_size=0.2, random_state = 42)

In [None]:
print('X_train.shape : ', X_train.shape)
print('X_test.shape : ', X_test.shape)
print('y_train.shape : ', y_train.shape)
print('y_test.shape : ', y_test.shape)

X_train.shape :  (35918,)
X_test.shape :  (8980,)
y_train.shape :  (35918,)
y_test.shape :  (8980,)


### 2.2.1 단어 벡터의 평균을 이용하여 분류해보기

N422에서 했던 단어 임베딩 벡터의 평균을 사용하여 문장을 분류하는 작업을 수행해봅시다.<br/>
인스턴스마다 텍스트 길이가 길고 시간이 오래 걸리므로 시간상 epoch 수를 **10 이하**로 하는 것을 추천드립니다.<br/>
모델 구동이 목적이므로 임베딩 차원 수를 크지 않게(50이하)로 설정해주세요.<br/>
> **Tip : 모델이 학습하는 동안 2.2.3의 내용을 작성하면 시간을 절약할 수 있습니다.**


In [None]:
# 이 곳에 답안을 작성하시길 바랍니다

In [None]:
from keras.preprocessing import sequence
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
print(f'Mean length of train set: {np.mean([len(sent) for sent in X_train], dtype=int)}')
print(f'Mean length of train set: {np.mean([len(sent) for sent in X_test], dtype=int)}')

Mean length of train set: 2471
Mean length of train set: 2458


In [None]:
max_words = 30000
max_len = 3000
tok = Tokenizer(num_words=max_words)
tok.fit_on_texts(X_train)
sequences = tok.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences,maxlen=max_len)

In [None]:
model1 = Sequential()
model1.add(Embedding(max_words, 50, input_length=max_len))
model1.add(GlobalAveragePooling1D())
model1.add(Dense(1, activation='sigmoid'))

In [None]:
model1.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model1.fit(sequences_matrix, y_train, batch_size=256, epochs=10, validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f9c7dcc9a10>

In [None]:
test_sequences = tok.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences,maxlen=max_len)

In [None]:
acc1 = model1.evaluate(test_sequences_matrix,y_test)
print(acc1)

[0.27402135729789734, 0.943875253200531]


### 2.2.2 LSTM을 사용하여 텍스트 분류 수행해보기

N423에서 했던 단어 임베딩 벡터의 평균을 사용하여 문장을 분류하는 작업을 수행해봅시다.<br/>
인스턴스마다 텍스트 길이가 길어 시간이 매우 오래 걸리므로 <br/>
층을 최소한으로 쌓고, epoch 수를 **3 이하**로 하는 것을 추천드립니다.<br/>
> **Tip : 모델이 학습하는 동안 2.2.3의 내용을 작성하면 시간을 절약할 수 있습니다.**


In [None]:
# 이 곳에 답안을 작성하시길 바랍니다

In [None]:
from keras.layers import LSTM

In [None]:
model2 = Sequential()
model2.add(Embedding(max_words, 50, input_length=max_len))
model2.add(LSTM(16))
model2.add(Dense(1, activation='sigmoid'))

In [None]:
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model2.fit(sequences_matrix, y_train, batch_size=256, epochs=5, validation_split=0.2)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x7f9c80343150>

In [None]:
acc2 = model2.evaluate(test_sequences_matrix,y_test)



### 2.2.3 위에서 실행한 내용에 대해 다시 알아봅시다.

#### a) 데이터셋을 학습할 때 사용하는 `pad_sequences`  메서드에 대해 설명해주세요.<br/>어떤 기능을 하나요? 모델을 학습할 때 왜 필요한가요?

*이곳에 답안을 입력해주세요*

#### b) 2.2.1과 2.2.2에서 사용한 각 모델의 evaluation 성능은 어떻게 나왔나요?<br/>각 모델의 장단점은 무엇이라고 생각하나요?

*이곳에 답안을 입력해주세요*

#### c) 종래의 RNN(Recurrent Neural Networks) 대신 LSTM(Long-Short Term Memory)을 사용하는 이유는 무엇인가요?<br/>(i.e. RNN에 비해 LSTM의 좋은 점을 설명해주세요.)

*이곳에 답안을 입력해주세요*

#### d) LSTM이나 RNN을 사용하는 예시를 **3개**이상 제시하고 해당되는 경우에 왜 LSTM이나 RNN을 사용하는 것 적절한지 간단하게 설명해주세요.

*이곳에 답안을 입력해주세요*

#### e) 이외에 N423, N424 에서 배운 자연어처리 모델과 관련된 키워드를 3개 이상 적어주세요. <br/> (해당 키워드에 대한 설명은 옵션입니다.)

*이곳에 답안을 입력해주세요*