<a href="https://colab.research.google.com/github/JangJiYeon12/AI-section-chalenge/blob/main/ai_sc42x.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SC42x 
## 자연어처리 (Natural Language Processing)

# Part 1 : 개념 요약

> 다음의 키워드에 대해서 **한 줄**로 간단하게 요약해주세요. (세션 노트를 참고하여도 좋습니다.)<br/>
> **Tip : 아래 문제를 먼저 수행한 후 모델 학습 등 시간이 오래 걸리는 셀이 실행되는 동안 아래 내용을 작성하면 시간을 절약할 수 있습니다.**

**N421**
- Stopwords(불용어)  
언어를 분석할 때, 의미가 없는 단어를 뜻한다.
- Stemming과 Lemmatization  
전자는 어간추출, 후자는 표제어 추출로 전자는 단순하게 단어를 잘라 추출하는 것이고, 후자는 뿌리단어를 찾아서 추출하는 것이다.
- Bag-of-Words  
단어가 자주 나올 수록 의미가 깊다고 여기는 방식이다. (순서는 고려하지 않는다)
- TF-IDF  
단어의 빈도에 따라 얼마나 그 단어가 중요한지를 보여주는 수치이다.

**N422**
- Word2Vec  
단어를 벡터화하여 이용하는 신경망 모델로 단어 간 유사도를 알 수 있다는 장점이 있다.
- fastText  
위와 유사한 신경망 모델이지만 부분 단어를 이용하여 벡터화하기 때문에 포함되지 않은 단어에 대해서도 유사도를 구할 수 있다는 장점이 있다.

**N423**
- RNN  
연속형 데이터를 처리하기 위한 신경망이다.
- LSTM, GRU  
LSTM은 RNN의 기울기 소실 문제를 해결한 모델, GRU는 LSTM의 간소화한 모델이다.
- Attention  
연관된 단어를 좀 더 집중(Attention)해서 보는 모델이다.

# Part 2 : Fake/Real News Dataset

한 주간 자연어처리 기법을 배우면서 여러분은 다양한 기술들을 접했습니다.<br/>
어떻게 텍스트 데이터를 다뤄야 하는지, 텍스트를 벡터화 하는 법, 문서에서 토픽을 모델하는 법 등 다양한 NLP 기법을 배웠는데요.<br/>
이번 스프린트 챌린지에선 [Fake/Real News Dataset](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)을 사용하여 배운 것들을 복습해보는 시간을 갖겠습니다.

**주의 : 모델의 성능을 최대한 끌어올리는 것이 아닌 모델 구동에 초점을 맞춰주세요.<br/>
모든 문제를 완료한 후에도 "시간이 남았다면" 정확도를 올리는 것에 도전하시는 것을 추천드립니다.**

In [1]:
# 코드 실행 전 seed를 지정하겠습니다.
import numpy as np
import tensorflow as tf

np.random.seed(42)
tf.random.set_seed(42)

## 2.0 데이터셋을 불러옵니다.

- 위 캐글 링크에서 데이터셋을 받아 업로드 합니다.<br/>
(직접 업로드하게 되면 시간이 꽤 걸리므로 **drive_mount** 나 **kaggle 연동**하시는 것을 추천드립니다.)

- 'label' 열을 만들어 Fake = 1, True = 0 로 레이블링해줍니다.
- 두 파일을 합쳐 하나의 데이터프레임에 저장해 준 후 데이터를 섞어줍니다.

In [2]:
!pip install kaggle

In [3]:
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Permission Warning 방지
!chmod 600 ~/.kaggle/kaggle.json

In [4]:
!kaggle datasets download -d clmentbisaillon/fake-and-real-news-dataset

fake-and-real-news-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [18]:
!unzip -qq "/content/fake-and-real-news-dataset.zip"

In [1]:
import pandas as pd

fake_df = pd.read_csv('Fake.csv')
true_df = pd.read_csv('True.csv')

In [2]:
fake_df['lable'] = 1
true_df['lable'] = 0

fake_df.head(5)

Unnamed: 0,title,text,subject,date,lable
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",1
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",1
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",1
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",1
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",1


In [4]:
fake_df.isnull().sum()

title      0
text       0
subject    0
date       0
lable      0
dtype: int64

In [5]:
true_df.isnull().sum()

title      0
text       0
subject    0
date       0
lable      0
dtype: int64

In [3]:
df = pd.concat([fake_df,true_df])

df.isnull().sum()

title      0
text       0
subject    0
date       0
lable      0
dtype: int64

In [4]:
df_suffled = df.sample(frac=1).reset_index(drop=True)

In [5]:
df_suffled.head(10)

Unnamed: 0,title,text,subject,date,lable
0,Paul Ryan Quotes Mel Gibson In A Desperate At...,Paul Ryan is learning quickly that John Boehne...,News,"February 5, 2016",1
1,Two Iraqis lead legal fight against Trump orde...,NEW YORK (Reuters) - A federal judge blocked t...,politicsNews,"January 28, 2017",0
2,McCain says 'no information' Russia sought to ...,WASHINGTON (Reuters) - There is “no informatio...,politicsNews,"December 12, 2016",0
3,SHOCKING! EVIDENCE SHOWS WHY OBAMA IS HEART OF...,There will be no peace in America until white...,politics,"Sep 8, 2015",1
4,FAMILY FEUD? Why President Trump Is Reportedly...,Is the last man standing in the West Wing abou...,politics,"Nov 2, 2017",1
5,U.S. lawmaker wounded in June shooting dischar...,WASHINGTON (Reuters) - U.S. Congressman Steve ...,politicsNews,"July 26, 2017",0
6,"MUSLIM MIGRANT Too Sick To Work, With Wife, 8 ...",This is probably not an untypical Muslim taxpa...,left-news,"May 27, 2016",1
7,Girl strapped with bomb kills five in Cameroon...,YAOUNDE (Reuters) - A girl with a bomb strappe...,worldnews,"September 13, 2017",0
8,Log Cabin Republicans Finally Grow A Pair Whe...,The idea of there being a group of people who ...,News,"October 23, 2016",1
9,"Iran says warns off U.S. U2 spy plane, drone",DUBAI (Reuters) - Iran s air defenses have for...,worldnews,"September 3, 2017",0


## 2.1 TF-IDF 를 활용하여 특정 뉴스와 유사한 뉴스 검색하기

시간상 특별한 **전처리 없이** 아래 태스크를 수행하겠습니다.

### 2.1.1 TFidfVectorizer를 사용하여 문서-단어 행렬(Document-Term Matrix) 만들기

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer

# CountVectorizer를 변수에 저장합니다.
tfidf = TfidfVectorizer(stop_words='english', max_features=1000)

dtm_tfidf = tfidf.fit_transform(df['text'])

dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())
dtm_tfidf



Unnamed: 0,000,10,100,11,12,13,14,15,16,17,...,working,world,wrong,wrote,year,years,yes,york,young,youtube
0,0.000000,0.0,0.0,0.046812,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.000000,0.049512,0.0,0.405149,0.090893,0.0,0.000000,0.000000,0.0
1,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.073549,0.000000,0.0,0.000000,0.000000,0.0,0.164598,0.000000,0.0
2,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.000000,0.000000,0.0,0.021965,0.000000,0.0,0.000000,0.000000,0.0
3,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.054342,0.000000,0.000000,0.0,0.035652,0.000000,0.0,0.000000,0.000000,0.0
4,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.066477,0.000000,0.0,0.000000,0.000000,0.0,0.000000,0.092711,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44893,0.078396,0.0,0.0,0.049753,0.050414,0.0,0.000000,0.048842,0.055866,0.0,...,0.000000,0.000000,0.000000,0.0,0.028707,0.032201,0.0,0.000000,0.000000,0.0
44894,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.000000,0.000000,0.0,0.155352,0.000000,0.0,0.000000,0.000000,0.0
44895,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.000000,0.000000,0.0,0.068743,0.154221,0.0,0.000000,0.000000,0.0
44896,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.000000,0.000000,0.000000,0.0,...,0.000000,0.164849,0.000000,0.0,0.064338,0.000000,0.0,0.000000,0.000000,0.0


### 2.1.2 KNN 알고리즘을 사용하여 유사한 문서 검색하기

- **42번 인덱스의 문서**와 가장 유사한 **5개 문서(42번 포함)의 인덱스**와 **해당 인덱스의 레이블**을 나타내주세요.
- NN 모델의 파라미터 중 `algorithm = 'kd_tree'` 로 설정합니다.

In [7]:
from sklearn.neighbors import NearestNeighbors

# dtm을 사용히 NN 모델을 학습시킵니다. (디폴트)최근접 5 이웃.
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_tfidf)

NearestNeighbors(algorithm='kd_tree')

In [9]:
nn.kneighbors([dtm_tfidf.iloc[42]])

  "X does not have valid feature names, but"


(array([[0.        , 0.99743405, 1.        , 1.        , 1.        ]]),
 array([[   42, 23876, 11290, 11292, 11295]]))

## 2.2 Keras Embedding을 사용하여 분류하기

### 2.2.0 데이터셋 split

- Train, Test 데이터셋으로 분리(Split)하여 주세요.

In [27]:
from sklearn.model_selection import train_test_split
import numpy as np

target = df_suffled['lable']
features = df_suffled['text']

In [36]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

### 2.2.1 단어 벡터의 평균을 이용하여 분류해보기

N422에서 했던 단어 임베딩 벡터의 평균을 사용하여 문장을 분류하는 작업을 수행해봅시다.<br/>
인스턴스마다 텍스트 길이가 길고 시간이 오래 걸리므로 시간상 epoch 수를 **10 이하**로 하는 것을 추천드립니다.<br/>
모델 구동이 목적이므로 임베딩 차원 수를 크지 않게(50이하)로 설정해주세요.<br/>
**권장사항 : `max_len` 은 텍스트 길이 평균보다 높게 설정해주세요.**<br/>

> **Tip : 모델이 학습하는 동안 2.2.3의 내용을 작성하면 시간을 절약할 수 있습니다.**


In [29]:
import tensorflow as tf

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.datasets import imdb

In [37]:
print(f"Train set shape : {X_train.shape}")
print(f"Test set shape : {X_test.shape}")

Train set shape : (31428,)
Test set shape : (13470,)


In [42]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train)

In [43]:
vocab_size = len(tokenizer.word_index) + 1
print(vocab_size)

118408


In [44]:
X_encoded = tokenizer.texts_to_sequences(X_train)

#
max_len = max(len(sent) for sent in X_encoded)
print(max_len)

8375


In [45]:
print(f'Mean length of train set: {np.mean([len(sent) for sent in X_train], dtype=int)}')

Mean length of train set: 2471


In [46]:
X_train=pad_sequences(X_encoded, maxlen=2500, padding='post')
y_train=np.array(y_train)

In [48]:
embedding_matrix = np.zeros((vocab_size, 40))

print(np.shape(embedding_matrix))

(118408, 40)


In [51]:
def get_vector(word):
    """
    해당 word가 word2vec에 있는 단어일 경우 임베딩 벡터를 반환
    """
    if word in wv:
        return wv[word]
    else:
        return None
 
for word, i in tokenizer.word_index.items():
    temp = get_vector(word)
    if temp is not None:
        embedding_matrix[i] = temp

NameError: ignored

In [52]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, Flatten

In [53]:
model = Sequential()
model.add(Embedding(vocab_size, 40, weights=[embedding_matrix], input_length=max_len, trainable=False))
model.add(GlobalAveragePooling1D()) # 입력되는 단어 벡터의 평균을 구하는 함수입니다.
model.add(Dense(1, activation='sigmoid'))

In [54]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.fit(X_train, y_train, batch_size=64, epochs=5, validation_split=0.2)

Epoch 1/5


ValueError: ignored

### 2.2.2 LSTM을 사용하여 텍스트 분류 수행해보기

N423에서 했던 단어 임베딩 벡터의 평균을 사용하여 문장을 분류하는 작업을 수행해봅시다.<br/>
인스턴스마다 텍스트 길이가 길어 시간이 매우 오래 걸리므로 <br/>
**층을 최소한으로 쌓고**, epoch 수를 **3 이하**로 하는 것을 추천드립니다.<br/>

> **Tip : 모델이 학습하는 동안 2.2.3의 내용을 작성하면 시간을 절약할 수 있습니다.**


In [None]:
# 이 곳에 답안을 작성하시길 바랍니다

### 2.2.3 위에서 실행한 내용에 대해 다시 알아봅시다.

#### a) 데이터셋을 학습할 때 사용하는 `pad_sequences`  메서드에 대해 설명해주세요.<br/>어떤 기능을 하나요? 모델을 학습할 때 왜 필요한가요?

*이곳에 답안을 입력해주세요*

#### b) 2.2.1과 2.2.2에서 사용한 각 모델의 evaluation 성능은 어떻게 나왔나요?<br/>각 모델의 장단점은 무엇이라고 생각하나요?

*이곳에 답안을 입력해주세요*

#### c) 종래의 RNN(Recurrent Neural Networks) 대신 LSTM(Long-Short Term Memory)을 사용하는 이유는 무엇인가요?<br/>(i.e. RNN에 비해 LSTM의 좋은 점을 설명해주세요.)

*이곳에 답안을 입력해주세요*

#### d) LSTM이나 RNN을 사용하는 예시를 **3개**이상 제시하고 해당되는 경우에 왜 LSTM이나 RNN을 사용하는 것 적절한지 간단하게 설명해주세요.

*이곳에 답안을 입력해주세요*

#### e) 이외에 N424 에서 배운 자연어처리 모델과 관련된 키워드를 3개 이상 적어주세요. <br/> (해당 키워드에 대한 설명은 옵션입니다.)

*이곳에 답안을 입력해주세요*

# Advanced Goals: 3점을 획득하기 위해선 아래의 조건 중 하나 이상을 만족해야합니다
 
- 2.1 에서 TF-IDF(`TfidfVectorizer`)가 아닌 방법을 사용하여 유사도 검색을 수행해보세요.<br/>
TF-IDF와 해당 방법의 차이를 설명해주세요. 
- 2.2 에서 사용한 방법을 재사용하되 하이퍼 파라미터를 조정하거나 모델 구조를 변경하여 성능을 올려봅시다.<br/>**(주의 : GridSearch, RandomSearch 등의 방법을 사용하여도 좋으나 시간이 오래 걸리므로 범위를 잘 선택해야 합니다.)**

In [None]:
# 이 곳에 답안을 작성하시길 바랍니다