# 20 뉴스그룹 데이터 준비 및 특성 추출

**데이터셋 확인 및 분리**

In [1]:
from sklearn.datasets import fetch_20newsgroups

# 20개의 토픽 중 선택하고자 하는 토픽을 리스트로 생성
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']

# 학습 데이터셋을 가져옴
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),  # 메일 내용에서 hint가 되는 부분을 삭제 - 순수하게 내용만으로 분류
                                      categories=categories)
# 검증 데이터셋을 가져옴
newsgroups_test = fetch_20newsgroups(subset='test', 
                                     remove=('headers', 'footers', 'quotes'),
                                     categories=categories)

print('Train set size:', len(newsgroups_train.data))
print('Test set size:', len(newsgroups_test.data))
print('Selected categories:', newsgroups_train.target_names)
print('Train labels:', set(newsgroups_train.target))

Train set size: 2034
Test set size: 1353
Selected categories: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']
Train labels: {0, 1, 2, 3}


In [2]:
print('Train set text samples:', newsgroups_train.data[0])
print('Train set label smaples:', newsgroups_train.target[0], '\n')
print('Test set text samples:', newsgroups_test.data[0])
print('Test set label smaples:', newsgroups_test.target[0])

Train set text samples: Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych
Train set label smaples: 1 

Test set text samples: TRry the SKywatch project in  Arizona.
Test set label smaples: 2


**카운트 기반 특성 추출**

In [3]:
X_train = newsgroups_train.data   # 학습 데이터셋 문서
y_train = newsgroups_train.target # 학습 데이터셋 라벨

X_test = newsgroups_test.data     # 검증 데이터셋 문서
y_test = newsgroups_test.target   # 검증 데이터셋 라벨

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features=2000, min_df=5, max_df=0.5)       # 최소 5번 이상 최대 문서 전체에서 50% 이하의 빈도를 가진 단어만 포함

X_train_cv = cv.fit_transform(X_train)  # train set을 변환
print('Train set dimension:', X_train_cv.shape) 
X_test_cv = cv.transform(X_test)    # test set을 변환
print('Test set dimension:', X_test_cv.shape)

Train set dimension: (2034, 2000)
Test set dimension: (1353, 2000)


In [5]:
for word, count in zip(cv.get_feature_names_out()[:100], X_train_cv[0].toarray()[0, :100]):
    print(word, ':', count, end=', ')

00 : 0, 000 : 0, 01 : 0, 04 : 0, 05 : 0, 10 : 0, 100 : 0, 1000 : 0, 11 : 0, 12 : 0, 128 : 0, 129 : 0, 13 : 0, 130 : 0, 14 : 0, 15 : 0, 16 : 0, 17 : 0, 18 : 0, 19 : 0, 1987 : 0, 1988 : 0, 1989 : 0, 1990 : 0, 1991 : 0, 1992 : 0, 1993 : 0, 20 : 0, 200 : 0, 202 : 0, 21 : 0, 22 : 0, 23 : 0, 24 : 0, 25 : 0, 256 : 0, 26 : 0, 27 : 0, 28 : 0, 2d : 0, 30 : 0, 300 : 0, 31 : 0, 32 : 0, 33 : 0, 34 : 0, 35 : 0, 39 : 0, 3d : 0, 40 : 0, 400 : 0, 42 : 0, 45 : 0, 50 : 0, 500 : 0, 60 : 0, 600 : 0, 65 : 0, 70 : 0, 75 : 0, 80 : 0, 800 : 0, 90 : 0, 900 : 0, 91 : 0, 92 : 0, 93 : 0, 95 : 0, _the : 0, ability : 0, able : 1, abortion : 0, about : 1, above : 0, absolute : 0, absolutely : 0, ac : 0, accept : 0, acceptable : 0, accepted : 0, access : 0, according : 0, account : 0, accurate : 0, across : 0, act : 0, action : 0, actions : 0, active : 0, activities : 0, activity : 0, acts : 0, actual : 0, actually : 0, ad : 0, add : 0, added : 0, addition : 0, additional : 0, address : 0, 

# 나이브 베이즈 분류기(Naive Bayse Classifier)를 이용한 문서 분류

**`coef_`를 이용한 단어의 영향에 대한 분석 함수**<br>
단, 나이브 베이즈 분류기는 scikit-learn 1.1 이상에서는 쓸 수 없다.

In [6]:
import numpy as np

def top10_features(classifier, vectorizer, categories):
    feature_names = np.asarray(vectorizer.get_feature_names_out())
    for i, category in enumerate(categories):
        top10 = np.argsort(-classifier.coef_[i])[:15]   # 역순으로 정렬하기 위해 계수에 음수를 취해서 정렬 후 앞에서부터 10개의 값을 반환
        print("%s: %s" % (category, ", ".join(feature_names[top10])), "\n")   # 카테고리와 영향이 큰 특성 10개를 출력

**카운트 벡터를 통한 분류**

In [7]:
from sklearn.naive_bayes import MultinomialNB

NB_clf = MultinomialNB()    # 분류기 선언

NB_clf.fit(X_train_cv, y_train)     # train set을 이용하여 분류기(classifier)를 학습

print('Train set score: {:.3f}'.format(NB_clf.score(X_train_cv, y_train)))  # train set에 대한 예측정확도를 확인
print('Test set score: {:.3f}'.format(NB_clf.score(X_test_cv, y_test)))     # test set에 대한 예측정확도를 확인

Train set score: 0.824
Test set score: 0.732


In [8]:
print('First document and label in test data:', X_test[0], y_test[0])
print('Second document and label in test data:', X_test[1], y_test[1])

pred = NB_clf.predict(X_test_cv[:2])

print('Predicted labels:', pred)
print('Predicted categories:', newsgroups_train.target_names[pred[0]], '/', newsgroups_train.target_names[pred[1]])

First document and label in test data: TRry the SKywatch project in  Arizona. 2
Second document and label in test data: The Vatican library recently made a tour of the US.
 Can anyone help me in finding a FTP site where this collection is 
 available. 1
Predicted labels: [2 1]
Predicted categories: sci.space / comp.graphics


In [9]:
top10_features(NB_clf, cv, newsgroups_train.target_names)

# 카테고리에 영향을 끼치는 상위 15개의 단어 중 결정적인 단어가 포함 되어 있는 카테고리는 comp.graphics, sci.space 두개 뿐이다.

alt.atheism: you, not, be, are, this, have, as, but, or, if, they, on, an, what, by 

comp.graphics: you, on, this, or, with, be, can, are, have, if, from, image, as, graphics, but 

sci.space: on, space, be, are, this, you, as, have, with, at, from, was, by, or, not 

talk.religion.misc: you, not, as, this, are, be, have, with, was, he, they, on, but, or, by 





**TF-IDF를 통한 분류**

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

# CountVectorizer와 동일한 인수를 사용
tfidf = TfidfVectorizer(max_features=2000, min_df=5, max_df=0.5) 
X_train_tfidf = tfidf.fit_transform(X_train)    # train set을 변환
X_test_tfidf = tfidf.transform(X_test)      # test set을 변환

NB_clf.fit(X_train_tfidf, y_train)      # tfidf train set을 이용하여 분류기(classifier)를 새로 학습

print('Train set score: {:.3f}'.format(NB_clf.score(X_train_tfidf, y_train)))   # train set에 대한 예측정확도를 확인
print('Test set score: {:.3f}'.format(NB_clf.score(X_test_tfidf, y_test)))  # test set에 대한 예측정확도를 확인

Train set score: 0.862
Test set score: 0.741


In [11]:
print('First document and label in test data:', X_test[0], y_test[0])
print('Second document and label in test data:', X_test[1], y_test[1])

pred = NB_clf.predict(X_test_cv[:2])

print('Predicted labels:', pred)
print('Predicted categories:', newsgroups_train.target_names[pred[0]], '/', newsgroups_train.target_names[pred[1]])

First document and label in test data: TRry the SKywatch project in  Arizona. 2
Second document and label in test data: The Vatican library recently made a tour of the US.
 Can anyone help me in finding a FTP site where this collection is 
 available. 1
Predicted labels: [2 1]
Predicted categories: sci.space / comp.graphics


In [12]:
top10_features(NB_clf, tfidf, newsgroups_train.target_names)

# 카테고리에 영향을 끼치는 상위 15개의 단어 중 결정적인 단어가 포함 되어 있는 카테고리의 수가 카운트 벡터보다 TF-IDF 가 늘어났다는 것을 알 수 있다.

alt.atheism: you, not, are, be, this, have, as, what, they, if, do, god, but, your, or 

comp.graphics: you, on, graphics, this, have, any, can, or, with, thanks, if, be, but, there, are 

sci.space: space, on, you, be, was, this, as, they, have, are, at, would, or, if, from 

talk.religion.misc: you, not, he, are, as, this, be, god, was, they, with, have, who, jesus, your 





**`CountVectorizer`와 `TfidfVectorizer`의 성능 차이**<br>
TF-IDF 가 조금 더 학습이 잘 되었다고 생각할 수 있으며, 카운트 벡터와 TF-IDF 둘 다 생각보다 카테고리에 영향을 끼치는 결정적인 단어가 많이 없다.

# 로지스틱 회귀(Logistic Regression)를 이용한 문서 분류

**카운트 벡터를 통한 분류**

In [13]:
from sklearn.linear_model import LogisticRegression

# count vector에 대해 regression을 해서 NB와 비교
LR_clf = LogisticRegression(max_iter=500)       # 분류기 선언
LR_clf.fit(X_train_cv, y_train)     # train data를 이용하여 분류기를 학습

print('Train set score: {:.3f}'.format(LR_clf.score(X_train_cv, y_train)))  # train data에 대한 예측정확도 
print('Test set score: {:.3f}'.format(LR_clf.score(X_test_cv, y_test)))     # test data에 대한 예측정확도

Train set score: 0.976
Test set score: 0.684


In [14]:
print('First document and label in test data:', X_test[0], y_test[0])
print('Second document and label in test data:', X_test[1], y_test[1])

pred = LR_clf.predict(X_test_cv[:2])

print('Predicted labels:', pred)
print('Predicted categories:', newsgroups_train.target_names[pred[0]], '/', newsgroups_train.target_names[pred[1]])

First document and label in test data: TRry the SKywatch project in  Arizona. 2
Second document and label in test data: The Vatican library recently made a tour of the US.
 Can anyone help me in finding a FTP site where this collection is 
 available. 1
Predicted labels: [2 1]
Predicted categories: sci.space / comp.graphics


In [15]:
top10_features(LR_clf, cv, newsgroups_train.target_names)

alt.atheism: bobby, religion, atheists, deletion, posting, atheism, motto, our, logic, either, post, define, example, iii, atheist 

comp.graphics: graphics, image, file, 3d, number, computer, looking, hi, files, six, ftp, using, ll, thank, site 

sci.space: space, orbit, name, nasa, spacecraft, launch, old, mars, sounds, research, solar, thought, right, down, wings 

talk.religion.misc: blood, christians, order, fbi, god, christian, story, black, objective, ignorance, city, beliefs, koresh, kent, christ 



**TF-IDF를 통한 분류**

In [16]:
# count vector에 대해 regression을 해서 NB와 비교
LR_clf.fit(X_train_tfidf, y_train)

print('Train set score: {:.3f}'.format(LR_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(LR_clf.score(X_test_tfidf, y_test)))

Train set score: 0.930
Test set score: 0.734


In [17]:
print('First document and label in test data:', X_test[0], y_test[0])
print('Second document and label in test data:', X_test[1], y_test[1])

pred = LR_clf.predict(X_test_cv[:2])

print('Predicted labels:', pred)
print('Predicted categories:', newsgroups_train.target_names[pred[0]], '/', newsgroups_train.target_names[pred[1]])

First document and label in test data: TRry the SKywatch project in  Arizona. 2
Second document and label in test data: The Vatican library recently made a tour of the US.
 Can anyone help me in finding a FTP site where this collection is 
 available. 1
Predicted labels: [2 1]
Predicted categories: sci.space / comp.graphics


In [18]:
top10_features(LR_clf, tfidf, newsgroups_train.target_names)

alt.atheism: atheism, religion, bobby, atheists, islam, islamic, deletion, atheist, motto, punishment, god, people, up, vice, sea 

comp.graphics: graphics, image, file, computer, 3d, files, hi, looking, points, code, using, format, video, screen, windows 

sci.space: space, nasa, orbit, launch, moon, spacecraft, shuttle, earth, lunar, flight, sci, solar, dc, cost, year 

talk.religion.misc: christian, christians, jesus, god, fbi, objective, order, his, he, blood, christ, children, koresh, kent, who 



**`CountVectorizer`와 `TfidfVectorizer`의 성능 차이**<br>
카운트 벡터가 TF-IDF에 비해 Overfitting이 일어났다, 카운트 벡터와 TF-IDF 둘 다 카테고리에 영향을 끼치는 결정적인 단어를 잘 뽑았다고 볼 수 있고, 이 부분에 대해서 나이브 베이즈 분류기와 차이가 많이 난다는 것을 알 수 있다.

# 릿지 회귀(Ridge Regression)를 이용한 문서 분류

**카운트 벡터를 통한 분류**

In [19]:
from sklearn.linear_model import RidgeClassifier

ridge_clf = RidgeClassifier()   # 릿지 분류기 선언
ridge_clf.fit(X_train_cv, y_train)   # 학습

print('Train set score: {:.3f}'.format(ridge_clf.score(X_train_cv, y_train)))
print('Test set score: {:.3f}'.format(ridge_clf.score(X_test_cv, y_test)))

Train set score: 0.966
Test set score: 0.498


In [20]:
top10_features(ridge_clf, cv, newsgroups_train.target_names)

alt.atheism: noted, policy, entire, element, verses, iii, stated, road, deletion, interpretations, tells, matthew, absolute, disagree, freedom 

comp.graphics: main, 01, tar, postings, requires, plotting, six, transfer, 32, plan, 33, magazine, spirit, apple, went 

sci.space: hundred, tool, behind, factor, demo, expected, fee, wall, george, maintain, 500, immoral, earlier, longer, runs 

talk.religion.misc: blood, weight, teachings, moment, 26, previous, decenso, writes, positions, black, creation, covered, smaller, ii, official 



**TF-IDF를 통한 분류**

In [21]:
ridge_clf.fit(X_train_tfidf, y_train)

print('Train set score: {:.3f}'.format(ridge_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(ridge_clf.score(X_test_tfidf, y_test)))

Train set score: 0.960
Test set score: 0.735


In [22]:
top10_features(ridge_clf, tfidf, newsgroups_train.target_names)

alt.atheism: bobby, religion, atheists, atheism, motto, punishment, satan, deletion, islamic, liar, atheist, islam, sea, tells, vice 

comp.graphics: graphics, computer, 3d, file, 42, hi, image, using, screen, looking, pov, card, code, points, sphere 

sci.space: space, orbit, spacecraft, sci, moon, funding, nasa, 23, engineering, nick, flight, sounds, name, money, launch 

talk.religion.misc: christian, blood, christians, fbi, order, objective, hudson, children, abortion, dead, jesus, christ, fire, creation, context 



**alpha 값 조정으로 성능 높여보기**

alpha값이 작아질수록 선형 회귀에 가깝게 되면서 모델 복잡도는 더 낮아지게 되어 Underfitting에 주의 해야하며, alpha값이 커질수록 기울기가 높아지게 되면서 모델 복잡도는 더 높아지게 되어 Overrfitting에 주의 해야한다.

In [23]:
from sklearn.model_selection import train_test_split

X_train_ridge, X_val_ridge, y_train_ridge, y_val_ridge = train_test_split(
    X_train_tfidf, y_train, test_size=0.2, random_state=42)     # 학습 데이터에서 다시 학습 데이터와 검증 데이터로 분리

max_score = 0
max_alpha = 0

for alpha in np.arange(0.1, 10, 0.1):   # alpha를 0.1부터 10까지 0.1씩 증가
    ridge_clf = RidgeClassifier(alpha=alpha)    # 릿지 분류기 선언(alpha 사용)
    ridge_clf.fit(X_train_ridge, y_train_ridge)     # 학습
    score = ridge_clf.score(X_val_ridge, y_val_ridge)   # 검정 데이터셋에 대해 정확도를 측정

    if score > max_score:   # 정확도가 이전의 정확도 최대값보다 크면 최대값을 변경한다.
        max_score = score
        max_alpha = alpha
print('Max alpha {:.3f} at max validation score {:.3f}'.format(max_alpha, max_score))

Max alpha 1.600 at max validation score 0.826


In [24]:
ridge_clf = RidgeClassifier(alpha=1.6)  # 릿지 분류기 선언
ridge_clf.fit(X_train_tfidf, y_train)   # 학습

print('Train set score: {:.3f}'.format(ridge_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(ridge_clf.score(X_test_tfidf, y_test)))

Train set score: 0.948
Test set score: 0.739


In [25]:
top10_features(ridge_clf, tfidf, newsgroups_train.target_names)

alt.atheism: bobby, religion, atheism, atheists, motto, punishment, islam, deletion, islamic, satan, atheist, sea, liar, vice, tells 

comp.graphics: graphics, computer, 3d, file, image, hi, 42, using, screen, looking, points, card, video, code, files 

sci.space: space, orbit, nasa, spacecraft, moon, sci, launch, flight, funding, idea, sounds, name, money, pat, engineering 

talk.religion.misc: christian, christians, fbi, blood, order, jesus, objective, children, christ, hudson, abortion, koresh, fire, dead, context 



**`CountVectorizer`와 `TfidfVectorizer`의 성능 차이**<br>
카운트 벡터가 TF-IDF에 비해 차이가 상당히 큰 Overfitting이 일어났다, 카운트 벡터에 비해 TF-IDF가 카테고리에 영향을 끼치는 결정적인 단어를 훨씬 더 잘 뽑았다고 볼 수 있다.<br> 
또한 TF-IDF로 진행한 alpha 조정은 조금 더 나아진 학습 결과에 카테고리에 영향을 끼치는 결정적인 단어가 조금 더 잘 뽑히면서도 단어의 순위가 바뀌는 결과를 보여준다.

# 라쏘 회귀분석(Lasso Regression)를 이용한 문서 분류

**카운트 벡터를 통한 분류**

In [26]:
# Lasso는 동일한 LogisticRegression을 사용하면서 매개변수로 지정

lasso_clf = LogisticRegression(penalty='l1', solver='liblinear', C=1)
lasso_clf.fit(X_train_cv, y_train)      # train data로 학습

print('Train set score: {:.3f}'.format(lasso_clf.score(X_train_cv, y_train)))
print('Test set score: {:.3f}'.format(lasso_clf.score(X_test_cv, y_test)))

# 계수(coefficient) 중에서 0이 아닌 것들의 개수를 출력
print('Used features count: {}'.format(np.sum(lasso_clf.coef_ != 0)), 'out of', X_train_cv.shape[0]) 

Train set score: 0.943
Test set score: 0.701
Used features count: 1580 out of 2034




In [27]:
top10_features(lasso_clf, cv, newsgroups_train.target_names)

alt.atheism: bobby, middle, risk, sea, motto, atheists, tells, punishment, vice, atheism, satan, our, policy, contradictory, islam 

comp.graphics: graphics, 3d, file, card, sphere, hi, pov, computer, looking, site, image, edges, video, number, 42 

sci.space: space, orbit, spacecraft, 23, solar, nick, launch, mars, wings, flight, engineering, aerospace, star, centaur, nasa 

talk.religion.misc: fbi, blood, authority, thou, christians, order, creation, moment, yourself, christ, sin, christian, serious, fire, writes 



**TF-IDF를 통한 분류**

In [28]:
lasso_clf.fit(X_train_tfidf, y_train)

print('Train set score: {:.3f}'.format(lasso_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(lasso_clf.score(X_test_tfidf, y_test)))

print('Used features count: {}'.format(np.sum(lasso_clf.coef_ != 0)), 'out of', X_train_tfidf.shape[0]) 

Train set score: 0.819
Test set score: 0.724
Used features count: 437 out of 2034


In [29]:
top10_features(lasso_clf, tfidf, newsgroups_train.target_names)

alt.atheism: bobby, atheism, atheists, islam, religion, islamic, motto, atheist, satan, vice, punishment, sea, makes, must, isn 

comp.graphics: graphics, image, 3d, file, computer, hi, video, files, looking, sphere, ftp, screen, points, using, pov 

sci.space: space, orbit, launch, nasa, spacecraft, flight, moon, dc, shuttle, solar, earth, year, lunar, cost, safety 

talk.religion.misc: fbi, christian, christians, christ, order, jesus, children, objective, context, blood, his, kent, hudson, thou, who 



**`CountVectorizer`와 `TfidfVectorizer`의 성능 차이**<br>
카운트 벡터가 TF-IDF에 비해 Overfitting이 일어났다. 카운트 벡터에 비해 TF-IDF가 카테고리에 영향을 끼치는 결정적인 단어를 훨씬 더 잘 뽑았다고 볼 수 있다. 사용한 특성의 개수의 차이가 상당히 크다는 것을 볼 수 있다.

# 결정트리 등을 이용한 기타 문서 분류 방법

In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

# 결정트리
tree = DecisionTreeClassifier(random_state=7)
tree.fit(X_train_tfidf, y_train)
print('Decision Tree train set score: {:.3f}'.format(tree.score(X_train_tfidf, y_train)))
print('Decision Tree test set score: {:.3f}\n'.format(tree.score(X_test_tfidf, y_test)))

# 랜덤포레스트 (결정트리의 앙상블 모형)
forest = RandomForestClassifier(random_state=7)
forest.fit(X_train_tfidf, y_train)
print('Random Forest train set score: {:.3f}'.format(forest.score(X_train_tfidf, y_train)))
print('Random Forest test set score: {:.3f}\n'.format(forest.score(X_test_tfidf, y_test)))

# 그래디언트 부스팅
gb = GradientBoostingClassifier(random_state=7)
gb.fit(X_train_tfidf, y_train)
print('Gradient Boosting train set score: {:.3f}'.format(gb.score(X_train_tfidf, y_train)))
print('Gradient Boosting test set score: {:.3f}\n'.format(gb.score(X_test_tfidf, y_test)))

Decision Tree train set score: 0.977
Decision Tree test set score: 0.536

Random Forest train set score: 0.977
Random Forest test set score: 0.685

Gradient Boosting train set score: 0.933
Gradient Boosting test set score: 0.696



In [31]:
sorted_feature_importances = sorted(zip(tfidf.get_feature_names_out(), gb.feature_importances_), key=lambda x: x[1], reverse=True)
for feature, value in sorted_feature_importances[:40]:
    print('%s: %.3f' % (feature, value), end=', ')

space: 0.126, graphics: 0.080, atheism: 0.024, thanks: 0.023, file: 0.021, orbit: 0.020, jesus: 0.018, god: 0.018, hi: 0.017, nasa: 0.015, image: 0.015, files: 0.014, christ: 0.010, moon: 0.010, bobby: 0.010, launch: 0.010, looking: 0.010, christian: 0.010, atheists: 0.009, christians: 0.009, fbi: 0.009, 3d: 0.008, you: 0.008, not: 0.008, islamic: 0.007, religion: 0.007, spacecraft: 0.007, flight: 0.007, computer: 0.007, islam: 0.007, ftp: 0.006, color: 0.006, software: 0.005, atheist: 0.005, card: 0.005, people: 0.005, koresh: 0.005, his: 0.005, kent: 0.004, sphere: 0.004, 

# 성능을 높이기 위한 방법들

**정규표현식, 불용어 제거, 어간 추출을 통한 토크나이저 정의**

In [None]:
import nltk
nltk.download('stopwords')

In [35]:
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
import re

In [36]:
cachedStopWords = stopwords.words("english")
RegTok = RegexpTokenizer("[\w']{3,}")   # 정규포현식으로 토크나이저를 정의
english_stops = set(stopwords.words('english'))     # 영어 불용어를 가져옴

In [37]:
def tokenizer(text):
    tokens = RegTok.tokenize(text.lower())
    # stopwords 제외
    words = [word for word in tokens if (word not in english_stops) and len(word) > 2]
    # porter stemmer 적용
    features = (list(map(lambda token: PorterStemmer().stem(token),words)))
    return features

**새로 만든 토크나이저로 TF-IDF 벡터만들기**

In [45]:
tfidf = TfidfVectorizer(tokenizer=tokenizer)    # 새로 정의한 토크나이저 사용
X_train_tfidf = tfidf.fit_transform(X_train)    # train set을 변환
X_test_tfidf = tfidf.transform(X_test)          # test set을 변환

print('Train set dimension:', X_train_tfidf.shape)
print('Test set dimension:', X_test_tfidf.shape)

Train set dimension: (2034, 20085)
Test set dimension: (1353, 20085)


**라쏘 회귀**

In [60]:
Lasso_clf = LogisticRegression(penalty='l1', solver='liblinear', C=1)       # 분류기 선언
Lasso_clf.fit(X_train_tfidf, y_train)       # train data를 이용하여 분류기를 학습

print('Train set score: {:.3f}'.format(Lasso_clf.score(X_train_tfidf, y_train)))    # train data에 대한 예측정확도 
print('Test set score: {:.3f}'.format(Lasso_clf.score(X_test_tfidf, y_test)))   # test data에 대한 예측정확도

Train set score: 0.790
Test set score: 0.718


**로지스틱 회귀**

In [46]:
LR_clf = LogisticRegression()
LR_clf.fit(X_train_tfidf, y_train)

print('Train set score: {:.3f}'.format(LR_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(LR_clf.score(X_test_tfidf, y_test)))
len(LR_clf.coef_[0])

Train set score: 0.962
Test set score: 0.761


20085

**릿지 회귀**

In [58]:
Ridge_clf = RidgeClassifier(alpha=2.5)
Ridge_clf.fit(X_train_tfidf, y_train)

print('Train set score: {:.3f}'.format(Ridge_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(Ridge_clf.score(X_test_tfidf, y_test)))

Train set score: 0.967
Test set score: 0.769


**나이브 베이즈 분류**

In [48]:
NB_clf = MultinomialNB(alpha=0.01)
NB_clf.fit(X_train_tfidf, y_train)

print('Train set score: {:.3f}'.format(NB_clf.score(X_train_tfidf, y_train)))
print('Test set score: {:.3f}'.format(NB_clf.score(X_test_tfidf, y_test)))

Train set score: 0.971
Test set score: 0.793


# 카운트 기반의 문제점과 N-gram