## 자연어 벡터화

### 1. CountVectorizer
단어들의 카운트(출현 빈도(frequency))로 여러 문서들을 벡터화

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
text_data = ['나는 배가 고프다', '내일 점심 뭐먹지', '내일 공부 해야겠다', '점심 먹고 공부 해야지']
cvect = CountVectorizer()

In [7]:
# 단어 사전 추출
output = cvect.fit_transform(text_data)
output.toarray()

array([[1, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 1]], dtype=int64)

In [15]:
cvect.vocabulary_

{'나는': 2,
 '배가': 6,
 '고프다': 0,
 '내일': 3,
 '점심': 7,
 '뭐먹지': 5,
 '공부': 1,
 '해야겠다': 8,
 '먹고': 4,
 '해야지': 9}

In [14]:
sorted(cvect.vocabulary_)

['고프다', '공부', '나는', '내일', '먹고', '뭐먹지', '배가', '점심', '해야겠다', '해야지']

In [18]:
import pandas as pd
df = pd.DataFrame(output.toarray(), columns=sorted(cvect.vocabulary_))
display(df)

Unnamed: 0,고프다,공부,나는,내일,먹고,뭐먹지,배가,점심,해야겠다,해야지
0,1,0,1,0,0,0,1,0,0,0
1,0,0,0,1,0,1,0,1,0,0
2,0,1,0,1,0,0,0,0,1,0
3,0,1,0,0,1,0,0,1,0,1


In [19]:
# 문자열 추출
output2 = cvect.transform(text_data)
output2.toarray()

array([[1, 0, 1, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 1]], dtype=int64)

### 2. TfidfVecorizer
TF-IDF라는 값을 사용하여 CountVectorizer의 단점을 보완</br>
먼저 해당 단어의 TF를 구하고, 이후 전체 문장에서 IDF를 구한 후, 해당 값에 역수를 취해준 IDF를 만들어 곱해준다</br>
아예 등장하지 않는다면 0, 그 이외에는 실수값이 크다면 그 단어가 보다 가치있는 특징이라는 것이고, 작다면 그다지 가치가 없다고 판단이 가능합니다.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
text_data = ['나는 배가 고프다', '내일 점심 뭐먹지', '내일 공부 해야겠다', '점심 먹고 공부 해야지']
tfidf = TfidfVectorizer()

In [31]:
output = tfidf.fit_transform(text_data)

In [32]:
output.toarray()

array([[0.57735027, 0.        , 0.57735027, 0.        , 0.        ,
        0.        , 0.57735027, 0.        , 0.        , 0.        ],
       [0.        , 0.        , 0.        , 0.52640543, 0.        ,
        0.66767854, 0.        , 0.52640543, 0.        , 0.        ],
       [0.        , 0.52640543, 0.        , 0.52640543, 0.        ,
        0.        , 0.        , 0.        , 0.66767854, 0.        ],
       [0.        , 0.43779123, 0.        , 0.        , 0.55528266,
        0.        , 0.        , 0.43779123, 0.        , 0.55528266]])

In [35]:
tfidf.vocabulary_

{'나는': 2,
 '배가': 6,
 '고프다': 0,
 '내일': 3,
 '점심': 7,
 '뭐먹지': 5,
 '공부': 1,
 '해야겠다': 8,
 '먹고': 4,
 '해야지': 9}

In [36]:
df2 = pd.DataFrame(output.toarray(),columns=sorted(tfidf.vocabulary_))
display(df2)

Unnamed: 0,고프다,공부,나는,내일,먹고,뭐먹지,배가,점심,해야겠다,해야지
0,0.57735,0.0,0.57735,0.0,0.0,0.0,0.57735,0.0,0.0,0.0
1,0.0,0.0,0.0,0.526405,0.0,0.667679,0.0,0.526405,0.0,0.0
2,0.0,0.526405,0.0,0.526405,0.0,0.0,0.0,0.0,0.667679,0.0
3,0.0,0.437791,0.0,0.0,0.555283,0.0,0.0,0.437791,0.0,0.555283


## 모델 학습

linear/logistic/pipeline

### 1. 데이터 전처리

#### 단발성대화데이터셋

In [1]:
import pandas as pd
df = pd.read_excel("../data/감성분석/kor_train_data/한국어_단발성_대화_데이터셋.xlsx")

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38594 entries, 0 to 38593
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Sentence    38594 non-null  object 
 1   Emotion     38594 non-null  object 
 2   Unnamed: 2  0 non-null      float64
 3   Unnamed: 3  0 non-null      float64
 4   Unnamed: 4  0 non-null      float64
 5   공포          7 non-null      object 
 6   5468        7 non-null      float64
dtypes: float64(4), object(3)
memory usage: 2.1+ MB


In [19]:
df2 = df.copy()
df2 = df2.iloc[:,:2]
df2

Unnamed: 0,Sentence,Emotion
0,언니 동생으로 부르는게 맞는 일인가요..??,공포
1,그냥 내 느낌일뿐겠지?,공포
2,아직너무초기라서 그런거죠?,공포
3,유치원버스 사고 낫다던데,공포
4,근데 원래이런거맞나요,공포
...,...,...
38589,솔직히 예보 제대로 못하는 데 세금이라도 아끼게 그냥 폐지해라..,혐오
38590,재미가 없으니 망하지,혐오
38591,공장 도시락 비우생적임 아르바이트했는데 화장실가성 손도 않씯고 재료 담고 바닥 떨어...,혐오
38592,코딱지 만한 나라에서 지들끼리 피터지게 싸우는 센징 클래스 ㅉㅉㅉ,혐오


In [41]:
df3 = df2.copy()
df3.Sentence = df2.Sentence.str.replace("[^가-힣 ]","", regex=True)

In [50]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse=False)

In [60]:
df3.Emotion.shape
df3.Emotion.values.reshape(-1,1)
encoded_y = ohe.fit_transform(df3.Emotion.values.reshape(-1,1))

In [61]:
ohe.categories_

[array(['공포', '놀람', '분노', '슬픔', '중립', '행복', '혐오'], dtype=object)]

In [62]:
X_train, X_test, y_train, y_test = train_test_split(
    df.Sentence, encoded_y, stratify=df.Sentence, random_state=2022
)
X_train, X_test, y_train, y_test

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.