# 1. Load Files

구글 드라이브에서 mount한 후 train과 test, sample submission 파일을 불러온다.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd
from bs4 import BeautifulSoup
from nltk.stem import WordNetLemmatizer
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
import matplotlib as plt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


- header=0 : 파일의 첫번째 줄이 열 이름
- delimiter: 구분자
- quoting=3 : 인용구(따옴표)를 무시하게 해준다.

In [3]:
train = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KUBIG/2024 winter basic study(NLP)/2주차/labeledTrainData.tsv', header=0, delimiter='\t', quoting=3)
test = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KUBIG/2024 winter basic study(NLP)/2주차/testData.tsv', header=0, delimiter='\t', quoting=3)
submit = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/KUBIG/2024 winter basic study(NLP)/2주차/sampleSubmission.csv')

# 2. 간략한 데이터 탐색

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   id         25000 non-null  object
 1   sentiment  25000 non-null  int64 
 2   review     25000 non-null  object
dtypes: int64(1), object(2)
memory usage: 586.1+ KB


In [5]:
train.head(5)

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [6]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      25000 non-null  object
 1   review  25000 non-null  object
dtypes: object(2)
memory usage: 390.8+ KB


In [7]:
test.head(5)

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."


데이터는 id와 review 내용으로 이루어져 있고, train의 경우 해당 문장이 긍정(1)인지 부정(0)인지 알려주는 sentiment 변수가 있다. 해당 변수가 데이터의 label이 되어 문장의 감정을 모델에게 학습시킨다.

또한 train dataset의 긍부정 label은 각각 12500개씩 동일하게 들어가 있다.

In [8]:
train['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

# 4. Text Preprocessing

## 4-1. html 태그 제거
첫번째 review를 보면 `<br>`이라는 줄바꿈 태그를 볼 수 있는데 다른 review들에도 동일하게 존재할 가능성이 있다. `BeautifulSoup`을 사용해서 분석에 방해가 되는 태그들을 지워주도록 하자.

In [9]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [10]:
from bs4 import BeautifulSoup

위 문장을 soup 객체로 바꿔준 후 `get_text()` 기능을 사용하여 text만 온전히 추출하는 작업을 진행한다. 첫번째 리뷰를 예시로 들었을 때 윗문장이 태그 제거 전, 아랫문장이 제거 이후이다.

In [11]:
example1 = BeautifulSoup(train["review"][0])
example1 = example1.get_text()
print("before deleting:", train['review'][0])
print("after deleting:", example1)

before deleting: "With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit wh

## 4-2. 정규표현식(regular expression)으로 알파벳만 남기기

각종 특수문자는 분석에 불필요한 정보이기 때문에 알파벳만 남기고 싶다면 정규표현식을 사용한다.

- `re.sub(대체할 패턴, 대체 이후, 대체할 범위)`: 알파벳을 '공백'으로 대체할 때 사용하는 기능.
- `^`: not의 의미.
- `a-z`: 소문자 전체, `A-Z`: 대문자 전체.

In [12]:
import re
letters_only = re.sub('[^a-zA-Z]', ' ', example1)

In [13]:
train['review'][0]

'"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally

In [14]:
letters_only

' With all this stuff going down at the moment with MJ i ve started listening to his music  watching the odd documentary here and there  watched The Wiz and watched Moonwalker again  Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent  Moonwalker is part biography  part feature film which i remember going to see at the cinema when it was originally released  Some of it has subtle messages about MJ s feeling towards the press and also the obvious message of drugs are bad m kay Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring  Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him The actual feature film bit when it finally starts is only on for    m

태그뿐만 아니라 '/', 온점과 같은 특수기호까지 제거된 것을 확인할 수 있음.

## 4-3. 토큰화(Tokenizing)
**토큰화(tokenizing)**는 의미를 갖는 최소 분석 단위인 토큰으로 corpus 덩어리를 쪼개어주는 작업. 영어에서는 대체로 띄어쓰기를 기준으로 토큰화를 진행함. (한국어의 경우, 형태소 분석을 진행한 후에 토큰화를 진행)

- split(): 띄어쓰기(단어) 단위로 토큰화를 진행하여 리스트로 변환
- lower(): 대문자를 소문자로 변환. 대문자와 소문자는 다른 단어로 구분되기 때문에 그대로 두면 복잡성을 만들어낼 수 있음. corpus 크기를 최대한 줄이기 위해 대소문자를 통일함.

* 무작정 소문자로 통합하는 것도 지양해야 함 !! (미국을 뜻하는 US와 우리를 뜻하는 us는 엄연히 다른 뜻)

In [15]:
lower_case = letters_only.lower()
token_words = lower_case.split()
print('토큰화 이후 생성된 토큰(단어) 개수: ', len(token_words))

토큰화 이후 생성된 토큰(단어) 개수:  437


## 4-4. 불용어(Stopword) 제거
단어들 중 무의미한 것들을 제거하는 작업이 필요하다. I, we, you, our 등 굉장히 자주 등장하지만 분석에서는 크게 중요하지 않은 단어들을 무의미하다고 볼 수 있는데 이들을 **불용어(stopword)**라고 부름.

`nltk`에서는 영어의 불용어들을 제공해줌. 예시 리뷰에서 불용어를 제거해주고 나니 토큰이 437개에서 219개로 확 줄어든 것을 볼 수 있다.

In [16]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords_list = stopwords.words('english')
print('nltk에 내장된 불용어 개수: ', len(stopwords_list))
print('불용어 예시: ', stopwords_list[11:21])
non_stopwords = [w for w in token_words if not w in stopwords_list]
print('불용어를 제거한 후 남은 토큰 개수: ', len(non_stopwords))

nltk에 내장된 불용어 개수:  179
불용어 예시:  ["you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself']
불용어를 제거한 후 남은 토큰 개수:  219


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 4-5. 어간 추출(Stemming)
어간 추출의 작업은 정확히 어간만 추출하는 것이 아니라 정해진 규칙에 따라 임의로 어미를 떼어놓는 작업이기에 100% 정확하게 어간만 추출할 수는 없음.

nltk의 포터 알고리즘의 규칙은 다음과 같습니다.
- ALIZE → AL
- ANCE → 제거
- ICAL → IC

예시를 들어보면, 아래와 같은 어간 추출을 합니다.

- formalize → formal
- allowance → allow
- electricical → electric

In [17]:
from nltk.stem import PorterStemmer
porter_stemmer = PorterStemmer()
stemmed_words = [porter_stemmer.stem(w) for w in non_stopwords]

## 4-6. 표제어 추출(Lemmatization)
- stemming: 단어 그 자체만을 고려. 규칙에 따라 접사 부분을 삭제하고 어간만 남김. 품사 정보가 훼손될 수 있고, 단어의 형태가 어색할 수 있음(수업교재)
- **lemmatization**: 단어의 품사 정보를 그대로 유지하지만 본래 단어의 품사 정보가 주어져야 함. 어간 추출(stemming)에 비해 단어의 형태를 적절하게 유지함.

kaggle의 다른 코드들을 참고한 결과 stemming보다 lemmatization을 하는 것이 더 나은 결과를 도출하는 것 같아서 lemmatization만 수행함.

In [18]:
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [19]:
lemmatized_words_1 = [wordnet_lemmatizer.lemmatize(w) for w in non_stopwords]
lemmatized_words = [wordnet_lemmatizer.lemmatize(w,"v") for w in lemmatized_words_1]

과거형, 진행형 등 시제가 적용된 동사에 대해 원형으로 돌려놓은 것을 확인할 수 있음.

In [20]:
difference = [w for w in lemmatized_words if not w in lemmatized_words_1]
difference

['go',
 'listen',
 'watch',
 'watch',
 'watch',
 'go',
 'release',
 'feel',
 'go',
 'bore',
 'consent',
 'bite',
 'exclude',
 'convince',
 'overhear',
 'rant',
 'supply',
 'turn',
 'come',
 'work',
 'perform',
 'close']

In [21]:
difference = [w for w in lemmatized_words_1 if not w in lemmatized_words]
difference

['going',
 'started',
 'listening',
 'watching',
 'watched',
 'watched',
 'thought',
 'going',
 'released',
 'feeling',
 'going',
 'boring',
 'consenting',
 'making',
 'made',
 'bit',
 'excluding',
 'convincing',
 'overheard',
 'ranted',
 'wanted',
 'supplying',
 'turning',
 'came',
 'filming',
 'working',
 'performing',
 'gave',
 'closed']

In [22]:
print(non_stopwords[41:60]) # none
print(stemmed_words[41:60]) # stemming
print(lemmatized_words[41:60]) # lemmatization

['released', 'subtle', 'messages', 'mj', 'feeling', 'towards', 'press', 'also', 'obvious', 'message', 'drugs', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless']
['releas', 'subtl', 'messag', 'mj', 'feel', 'toward', 'press', 'also', 'obviou', 'messag', 'drug', 'bad', 'kay', 'visual', 'impress', 'cours', 'michael', 'jackson', 'unless']
['release', 'subtle', 'message', 'mj', 'feel', 'towards', 'press', 'also', 'obvious', 'message', 'drug', 'bad', 'kay', 'visually', 'impressive', 'course', 'michael', 'jackson', 'unless']


## 4-7. 하나의 함수로 표현
앞선 4-1 ~ 4-6 과정을 하나의 함수로 통합한다.

In [23]:
stopwords_list = set(stopwords.words('english'))
wordnet_lemmatizer = WordNetLemmatizer()

def review_to_words(raw_review):
  except_tag = BeautifulSoup(raw_review).get_text() # html 태그 제거
  letters_only = re.sub("[^a-zA-Z]", " ", except_tag) # 알파벳만 남기기
  token_words = letters_only.lower().split() # 소문자로 통합하고 단어 단위로 토큰화
  non_stopwords = [w for w in token_words if not w in stopwords_list] # 불용어 삭제
  lemmatized_words = [wordnet_lemmatizer.lemmatize(w) for w in non_stopwords]
  lemmatized_words = [wordnet_lemmatizer.lemmatize(w,"v") for w in lemmatized_words] # Lemmatization
  words = " ".join(lemmatized_words) # 리스트에 있는 단어들을 띄어쓰기 단위로 묶어 문자열로 반환
  return words

# 5. BOW(Bag Of Words) 형태로 변환
단어들의 출현 빈도에 집중하여 텍스트를 수치화하는 표현 방식. 각 단어에 고유한 정수 인덱스를 부여한 후, 각 인덱스 위치에 단어 등장 빈도를 표현.

해당 작업을 사이킷런의 `CountVectorizer`를 수행해줌.

In [24]:
num_reviews = train['review'].size

clean_train_reviews = []
for i in range(0, num_reviews):
  if (i+1) % 5000 == 0: #실행이 잘되는지 확인하기 위해 5000개 실행될 때마다 확인문구 print
    print('Review {} of {}'.format(i+1, num_reviews))
  clean_train_reviews.append(review_to_words(train['review'][i]))

  except_tag = BeautifulSoup(raw_review).get_text() # html 태그 제거


Review 5000 of 25000
Review 10000 of 25000
Review 15000 of 25000
Review 20000 of 25000
Review 25000 of 25000


In [25]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer = 'word', # 학습 단위
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             min_df = 2, # 토큰이 나타날 최소 문서 개수
                             ngram_range = (1, 2), # 단어 묶음 개수
                             max_features = 4000) # 토큰 최대 개수(컬럼 최대 개수)

## TfidfVectorizer

TF-IDF(단어 빈도 * 문서 빈도 역수)를 학습시키는 함수.
- 단어 빈도: 특정 단어가 한 문서 내에서 출현한 빈도
- 문서 빈도: 특정 단어가 출현한 전체 문서의 개수

정관사 a, the는 영어에서 굉장히 많이 쓰이지만 출현한 빈도에 비해 그렇게 중요하지 않은 단어. 단어 빈도는 높아도 중요도가 낮은 단어들을 걸러주기 위해 단어가 출현한 문서의 빈도도 카운트해줄 필요가 있음.

In [26]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df = 4, # 토큰이 나타날 최소 문서 개수
                             analyzer = 'word', # 학습 단위
                             ngram_range = (1,2), # 단어 묶음 개수
                             max_features = 1000) # 토큰(컬럼) 최대 개수

In [27]:
train_data_features = vectorizer.fit_transform(clean_train_reviews)
train_data_features.shape

(25000, 1000)

In [28]:
# get_feature_names(): 토큰 이름들 추출
vocab = vectorizer.get_feature_names()
vocab[:20]

AttributeError: 'TfidfVectorizer' object has no attribute 'get_feature_names'

In [29]:
# get_feature_names_out()으로 이름 바뀜!
vocab = vectorizer.get_feature_names_out()
vocab[:20]

array(['ability', 'able', 'absolutely', 'accent', 'accept', 'across',
       'act', 'action', 'actor', 'actress', 'actual', 'actually',
       'adaptation', 'add', 'admit', 'adult', 'adventure', 'age', 'ago',
       'agree'], dtype=object)

In [30]:
print(vocab[:30])
print('='*100)
print(train_data_features.toarray())

['ability' 'able' 'absolutely' 'accent' 'accept' 'across' 'act' 'action'
 'actor' 'actress' 'actual' 'actually' 'adaptation' 'add' 'admit' 'adult'
 'adventure' 'age' 'ago' 'agree' 'air' 'alien' 'allow' 'almost' 'alone'
 'along' 'already' 'also' 'although' 'always']
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


In [31]:
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression

x = train_data_features
y = train['sentiment']

x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.3, random_state=42)

# 트리 알고리즘 3개를 사용
rf = RandomForestClassifier(n_estimators = 200,
                            n_jobs = -1,
                            random_state=42,
                            max_depth=20)

xgb = XGBClassifier(n_estimators=200,
                    max_depth=10,
                    learning_rate=0.05,
                    objective='binary:logistic')

lgbm = LGBMClassifier(n_estimators=200,
                    max_depth=10,
                    metric='binary_logloss')

In [32]:
xgb.fit(x_train, y_train)
y_pred_xgb = xgb.predict_proba(x_val)

In [33]:
lgbm.fit(x_train, y_train)
y_pred_lgbm = lgbm.predict_proba(x_val)

[LightGBM] [Info] Number of positive: 8738, number of negative: 8762
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.141777 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 197823
[LightGBM] [Info] Number of data points in the train set: 17500, number of used features: 1000
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499314 -> initscore=-0.002743
[LightGBM] [Info] Start training from score -0.002743


In [34]:
rf.fit(x_train, y_train)
y_pred_rf = rf.predict_proba(x_val)

In [37]:
print('Random Forest AUC Score :', roc_auc_score(y_val, y_pred_rf[:,1]))
print('XGBoost AUC Score :', roc_auc_score(y_val, y_pred_xgb[:,1]))
print('LGBM AUC Score :', roc_auc_score(y_val, y_pred_lgbm[:,1]))

Random Forest AUC Score : 0.9076715878903935
XGBoost AUC Score : 0.9164858292593361
LGBM AUC Score : 0.9262710316820311


# 8. test 데이터 추론

In [35]:
num_reviews = test['review'].size

clean_test_reviews = []
for i in range(0, num_reviews):
     if (i + 1) % 5000 == 0 :  #실행이 잘 되는지 확인하기 위해 5000개 실행될때마다 확인문구 print
         print('Review {} of {}'.format(i+1, num_reviews))
     clean_test_reviews.append(review_to_words(test['review'][i]))

  except_tag = BeautifulSoup(raw_review).get_text() # html 태그 제거


Review 5000 of 25000
Review 10000 of 25000
Review 15000 of 25000
Review 20000 of 25000
Review 25000 of 25000


In [36]:
test_data_features = vectorizer.fit_transform(clean_test_reviews)
test_data_features = test_data_features.toarray()

In [38]:
# 3개 알고리즘 중 원하는 것으로 predict
result = lgbm.predict(test_data_features)
submit['sentiment'] = result



In [39]:
submit.head(10)

Unnamed: 0,id,sentiment
0,12311_10,1
1,8348_2,0
2,5828_4,1
3,7186_2,1
4,12128_7,0
5,2913_8,0
6,4396_1,0
7,395_2,1
8,10616_1,0
9,9074_9,0


In [40]:
test.head(10)

Unnamed: 0,id,review
0,"""12311_10""","""Naturally in a film who's main themes are of ..."
1,"""8348_2""","""This movie is a disaster within a disaster fi..."
2,"""5828_4""","""All in all, this is a movie for kids. We saw ..."
3,"""7186_2""","""Afraid of the Dark left me with the impressio..."
4,"""12128_7""","""A very accurate depiction of small time mob l..."
5,"""2913_8""","""...as valuable as King Tut's tomb! (OK, maybe..."
6,"""4396_1""","""This has to be one of the biggest misfires ev..."
7,"""395_2""","""This is one of those movies I watched, and wo..."
8,"""10616_1""","""The worst movie i've seen in years (and i've ..."
9,"""9074_9""","""Five medical students (Kevin Bacon, David Lab..."


In [41]:
submit.to_csv("./submission_240121.csv", index=False)