### End-to-End NLP(EDA&ML) with Sentiment Analysis

해당 노트북에서는 BOW, tf-idf같은 문자 인코딩 기술이 어떻게 동작하는지 텍스트의 감성 분석을 통해 살펴본다.

사용할 데이터의 경우 감성에 대한 척도가 0, 1, 2, 3, 4로 라벨링 되어 있는 영화 리뷰 데이터이다.

<center>

|라벨링|정도|
|:--:|:--:|
|0|negative(부정)|
|1|somehow negative(약간 부정)|
|2|neutral(보통)|
|3|somehow positive(약간 긍정)|
|4|positive(긍정)|

</center>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import nltk

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk import ngrams

import string, re

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings, os

필요한 설정을 간단하게 진행한다.

In [2]:
pio.renderers.default = "notebook_connected"
warnings.filterwarnings('ignore')

In [3]:
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [4]:
train = pd.read_csv("./data/sentiment/train.tsv", sep="\t")
test = pd.read_csv("./data/sentiment/test.tsv", sep="\t")

### EDA진행

In [5]:
train.shape, test.shape

((156060, 4), (66292, 3))

In [6]:
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [7]:
test.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [9]:
train.isna().sum().sum()

0

데이터에 sentiment의 등급마다 설명을 추가해줍니다.

In [10]:
map_dict = {
    0 : "negative",
    1 : "somehow negative",
    2 : "neutral",
    3 : "somehow positive",
    4 : "positive"
}

train['sentiment_class'] = train['Sentiment'].map(map_dict)
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,sentiment_class
0,1,1,A series of escapades demonstrating the adage ...,1,somehow negative
1,2,1,A series of escapades demonstrating the adage ...,2,neutral
2,3,1,A series,2,neutral
3,4,1,A,2,neutral
4,5,1,series,2,neutral


`string.punctuation`을 활용하여 구두점을 제거합니다.

- string.punctuation은 아래와 같은 특수문자들을 포함합니다.

```
!"#$%&'()*+,-./:;<=>?@[\\]^_`{|}~
```

In [11]:
def remove_puncutation(text):
    return ''.join([x for x in text if x not in string.punctuation])

In [12]:
train['Phrase'] = train['Phrase'].apply(lambda x : remove_puncutation(x))
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,sentiment_class
0,1,1,A series of escapades demonstrating the adage ...,1,somehow negative
1,2,1,A series of escapades demonstrating the adage ...,2,neutral
2,3,1,A series,2,neutral
3,4,1,A,2,neutral
4,5,1,series,2,neutral


2개 이하의 문자로 이루어진 단어를 제거합니다.

In [13]:
def words_with_more_than_three_chars(text):
    return ' '.join([x for x in text.split() if len(x) > 3])

In [14]:
train['Phrase'] = train['Phrase'].apply(lambda x : words_with_more_than_three_chars(x))
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,sentiment_class
0,1,1,series escapades demonstrating adage that what...,1,somehow negative
1,2,1,series escapades demonstrating adage that what...,2,neutral
2,3,1,series,2,neutral
3,4,1,,2,neutral
4,5,1,series,2,neutral


In [15]:
print(train.shape)
train.loc[train['Phrase']==''].shape

(156060, 5)


(2129, 5)

In [16]:
train = train.loc[train['Phrase'] != '']

In [17]:
train.shape

(153931, 5)

### 불용어 제거

nltk에 내장되어 있는 stopwords를 적용하고, 제거합니다.

In [18]:
stopword = stopwords.words('english')
train['Phrase'] = train['Phrase'].apply(lambda x : ' '.join([word for word in x.split() if word not in stopword]))
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,sentiment_class
0,1,1,series escapades demonstrating adage good goos...,1,somehow negative
1,2,1,series escapades demonstrating adage good goose,2,neutral
2,3,1,series,2,neutral
4,5,1,series,2,neutral
5,6,1,escapades demonstrating adage good goose,2,neutral


### 감성 카테고리 확인

In [18]:
train.groupby('Sentiment')['Sentiment'].count()

Sentiment
0     7057
1    27141
2    77773
3    32767
4     9193
Name: Sentiment, dtype: int64

### 타겟 클래스의 Distribution

In [19]:
data = train.groupby('sentiment_class')['sentiment_class'].count()

fig = px.bar(y = data, x = data.index.tolist(), title='Target Classes')
fig.update_yaxes(tickformat=',', title='count')
fig.update_xaxes(title='classes')
fig.update_layout(width=800, height=700)
fig.show()

### 타겟 클래스 간 비율 확인

In [20]:
sentiment_counts = train['sentiment_class'].value_counts().reset_index()
sentiment_counts.columns = ['sentiment_class', 'count']
sentiment_counts['percentage'] = (sentiment_counts['count'] / train.shape[0]) * 100

pie_chart = px.pie(sentiment_counts, names='sentiment_class', values='percentage', title='% Target class', labels={'percentage':'% of Total'})
pie_chart.update_traces(textinfo='percent+label')
pie_chart.update_layout(width=800, height=700)
pie_chart.show()

### 문장 길이 컬럼 추가

In [21]:
train['PhraseLength'] = train['Phrase'].apply(lambda x : len(x))
train.sort_values(by="PhraseLength", ascending=False).head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment,sentiment_class,PhraseLength
149938,149939,8167,hard imagine anyone managing steal movie only ...,4,positive,215
149939,149940,8167,hard imagine anyone managing steal movie only ...,4,positive,215
149940,149941,8167,hard imagine anyone managing steal movie only ...,3,somehow positive,215
149941,149942,8167,hard imagine anyone managing steal movie only ...,4,positive,215
54876,54877,2734,Filmmakers Dana JanklowiczMann Amir Mann area ...,3,somehow positive,213


### 각 클래스에 대한 문장의 길이 시각화

In [22]:
import plotly.graph_objects as go

negative_legnth = train[train['sentiment_class'] == 'negative']['PhraseLength']
somehow_negative_legnth = train[train['sentiment_class'] == 'somehow negative']['PhraseLength']
neutral_legnth = train[train['sentiment_class'] == 'neutral']['PhraseLength']
somehow_positive_legnth = train[train['sentiment_class'] == 'somehow positive']['PhraseLength']
positive_legnth = train[train['sentiment_class'] == 'positive']['PhraseLength']

length_dist = go.Figure()
length_dist.add_trace(go.Histogram(x=negative_legnth, histnorm = 'probability'))
length_dist.add_trace(go.Histogram(x=somehow_negative_legnth, histnorm = 'probability'))
length_dist.add_trace(go.Histogram(x=neutral_legnth, histnorm = 'probability'))
length_dist.add_trace(go.Histogram(x=somehow_positive_legnth, histnorm = 'probability'))
length_dist.add_trace(go.Histogram(x=positive_legnth, histnorm = 'probability'))
length_dist.update_layout(width=1200, height=700)
length_dist.show()

### WordCloud를 활용하여 자주 나오는 단어 시각화

In [23]:
from wordcloud import WordCloud, STOPWORDS
stopword_ = set(STOPWORDS)

word_cloud_common_words = []
for _, row in train.iterrows():
    word_cloud_common_words.append(row['Phrase'])
word_cloud_common_words

['series escapades demonstrating adage that what good goose also good gander some which occasionally amuses none which amounts much story',
 'series escapades demonstrating adage that what good goose',
 'series',
 'series',
 'escapades demonstrating adage that what good goose',
 'escapades demonstrating adage that what good goose',
 'escapades',
 'demonstrating adage that what good goose',
 'demonstrating adage',
 'demonstrating',
 'adage',
 'adage',
 'that what good goose',
 'that',
 'what good goose',
 'what',
 'good goose',
 'good goose',
 'good',
 'goose',
 'goose',
 'goose',
 'also good gander some which occasionally amuses none which amounts much story',
 'also good gander some which occasionally amuses none which amounts much story',
 'also',
 'also',
 'good gander some which occasionally amuses none which amounts much story',
 'gander some which occasionally amuses none which amounts much story',
 'gander some which occasionally amuses none which amounts much story',
 'gander',

In [24]:
wordcloud = WordCloud(width=1600, height=600, background_color='white', stopwords=stopword_, min_font_size=5).generate(''.join(word_cloud_common_words))

word_cloud_image = px.imshow(wordcloud)
word_cloud_image.update_layout(width=800, height=700)
word_cloud_image.show()

### 단어 등장 빈도 체크

In [25]:
text_list = word_cloud_common_words.copy()

total_words = ''.join(text_list)
total_words = word_tokenize(total_words)
total_words

['series',
 'escapades',
 'demonstrating',
 'adage',
 'that',
 'what',
 'good',
 'goose',
 'also',
 'good',
 'gander',
 'some',
 'which',
 'occasionally',
 'amuses',
 'none',
 'which',
 'amounts',
 'much',
 'storyseries',
 'escapades',
 'demonstrating',
 'adage',
 'that',
 'what',
 'good',
 'gooseseriesseriesescapades',
 'demonstrating',
 'adage',
 'that',
 'what',
 'good',
 'gooseescapades',
 'demonstrating',
 'adage',
 'that',
 'what',
 'good',
 'gooseescapadesdemonstrating',
 'adage',
 'that',
 'what',
 'good',
 'goosedemonstrating',
 'adagedemonstratingadageadagethat',
 'what',
 'good',
 'goosethatwhat',
 'good',
 'goosewhatgood',
 'goosegood',
 'goosegoodgoosegoosegoosealso',
 'good',
 'gander',
 'some',
 'which',
 'occasionally',
 'amuses',
 'none',
 'which',
 'amounts',
 'much',
 'storyalso',
 'good',
 'gander',
 'some',
 'which',
 'occasionally',
 'amuses',
 'none',
 'which',
 'amounts',
 'much',
 'storyalsoalsogood',
 'gander',
 'some',
 'which',
 'occasionally',
 'amuses',
 '

In [26]:
freq_words = FreqDist(total_words)
freq_words

FreqDist({'that': 9438, 'with': 6133, 'this': 3449, 'film': 3435, 'movie': 3077, 'than': 2919, 'from': 2803, 'more': 2644, 'about': 2616, 'have': 2261, ...})

In [27]:
word_frequency = FreqDist(freq_words)

In [28]:
word_frequency

FreqDist({'that': 9438, 'with': 6133, 'this': 3449, 'film': 3435, 'movie': 3077, 'than': 2919, 'from': 2803, 'more': 2644, 'about': 2616, 'have': 2261, ...})

In [29]:
len(freq_words), len(word_frequency)

(97196, 97196)

In [30]:
mode_20 = px.bar(pd.DataFrame(word_frequency, index=[0]).T.sort_values(by=[0], ascending=False).head(20))
mode_20.update_layout(width=1000, height=600)
mode_20.update_yaxes(tickformat=",")
mode_20.show()

### 부정적인 감성을 표현할 때 자주 사용된 단어

In [31]:
neg_text_list = []
for idx, row in train[train['sentiment_class']=='negative'].iterrows():
    neg_text_list.append(row['Phrase'])

neg_text_list

['would have hard time sitting through this',
 'have hard time sitting through this',
 'Aggressive selfglorification manipulative whitewash',
 'selfglorification manipulative whitewash',
 'Trouble Every plodding mess',
 'plodding mess',
 'plodding mess',
 'could hate same reason',
 'hate',
 'hate',
 'Oedekerk realization childhood dream martialarts flick proves that sometimes dreams youth should remain just that',
 'baseball movies that hard mythic',
 'Hampered paralyzed selfindulgent script that aims poetry ends sounding like satire',
 'selfindulgent script',
 'There very little sense what going here',
 'avoid',
 'almost feels movie more interested entertaining itself than amusing',
 'movie progression into rambling incoherence gives meaning phrase fatal script error',
 'movie progression into rambling incoherence',
 'gives meaning phrase fatal script error',
 'fatal script error',
 'fatal script error',
 'Tartakovsky team some freakish powers visual charm five writers slip into moder

In [32]:
neg_total_words = ' '.join(neg_text_list)
neg_total_words = word_tokenize(neg_total_words)

neg_frequency = FreqDist(neg_total_words)

In [34]:
top_20_neg = px.bar(pd.DataFrame(neg_frequency, index=[0]).T.sort_values(by=[0], ascending=False).head(20))
top_20_neg.show()

### 긍정적인 감성에 자주 사용된 단어

In [36]:
pos_text_list = []
for idx, row in train[train['sentiment_class']=='positive'].iterrows():
    pos_text_list.append(row['Phrase'])

pos_text_list

['This quiet introspective entertaining independent worth seeking',
 'quiet introspective entertaining independent',
 'entertaining',
 'worth seeking',
 'positively thrilling combination ethnography intrigue betrayal deceit murder Shakespearean tragedy juicy soap opera',
 'positively thrilling combination ethnography intrigue betrayal deceit murder',
 'thrilling',
 'comedydrama nearly epic proportions rooted sincere performance title character undergoing midlife crisis',
 'nearly epic',
 'rooted sincere performance title character undergoing midlife crisis',
 'sincere performance',
 'sincere performance',
 'sincere performance',
 'recommend Snow Dogs',
 'high hilarity',
 'performances absolute',
 'absolute',
 'absolute',
 'extravagant',
 'better',
 'this sweet modest ultimately winning story',
 'sweet modest ultimately winning story',
 'sweet modest',
 'sweet',
 'ultimately winning story',
 'dizzily gorgeous',
 'Love very much',
 'Love',
 'Best indie year',
 'Best',
 'indie year',
 'im

In [37]:
pos_total_words = ' '.join(pos_text_list)
pos_total_words = word_tokenize(pos_total_words)

pos_frequency = FreqDist(pos_total_words)

In [38]:
pos_20 = px.bar(pd.DataFrame(pos_frequency, index=[0]).T.sort_values(by=[0], ascending=False).head(20))
pos_20.show()

### 긍정적인 감성 표현에 자주 사용되는 bigram 단어

> n-gram : 단어마다 그 단어 포함 뒤로 n개만큼의 단어를 한 묶음으로 보고 이를 하나의 토큰으로 간주 하는 것.
- n=1 : unigrams
- n=2 : bigrams
- n=3 : trigrams
- n=4 : ngrams

In [None]:
text_list = []