# 문제정의
- 영화 리뷰 데이터셋을 활용해서 긍정리뷰와 부정리뷰를 구분
- 긍정 / 부정 리뷰에서 자주 사용되는 단어 확인

# 데이터 수집
- Large movie dataset 다운로드

In [1]:
# 파일을 읽어오는 함수
from sklearn.datasets import load_files

In [2]:
train_data_url = 'data/aclimdb/train'
test_data_url = 'data/aclimdb/test'

In [3]:
reviews_train = load_files(train_data_url , shuffle = True)
reviews_test = load_files(test_data_url , shuffle = True)

In [4]:
reviews_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [5]:
len(reviews_train.data)

25000

In [6]:
reviews_train.data[0]

b"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty."

In [7]:
# 0이 부정 1이 긍정
reviews_train.target

array([1, 0, 1, ..., 0, 0, 0])

# 데이터 전처리

In [8]:
reviews_train.data[0]

b"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.<br /><br />It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see. <br /><br />Flawed but honest with a terrible honesty."

## 태그 제거
- br 태그를 띄어쓰기로 변경

In [9]:
# 바이트형태 리플레이스
reviews_train.data[0].replace(b'<br />' , b' ')

b"Zero Day leads you to think, even re-think why two boys/young men would do what they did - commit mutual suicide via slaughtering their classmates. It captures what must be beyond a bizarre mode of being for two humans who have decided to withdraw from common civility in order to define their own/mutual world via coupled destruction.  It is not a perfect movie but given what money/time the filmmaker and actors had - it is a remarkable product. In terms of explaining the motives and actions of the two young suicide/murderers it is better than 'Elephant' - in terms of being a film that gets under our 'rationalistic' skin it is a far, far better film than almost anything you are likely to see.   Flawed but honest with a terrible honesty."

In [10]:
X_train = reviews_train.data[:1000]
X_test = reviews_test.data[:1000]

In [11]:
y_train = reviews_train.target[:1000]
y_test = reviews_test.target[:1000]

In [12]:
# 자동으로 결과값이 리스트로 어펜드됨
# 아래 코드를 한줄로 쓸 수 있음
# X_train = []
# for txt in X_train : 
#    X_train.append(txt.replace(b"<br />", b" "))
X_train = [ txt.replace(b"<br />", b" ")  for txt in X_train ]
X_test = [ txt.replace(b"<br />", b" ")  for txt in X_test ]

## 토큰화
- BOW : 문장을 단위로 수치화 해주는 작업

### 예시

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
testBOW = CountVectorizer()

In [15]:
text = [
    '혹시 자소서 포함된 이력서 , 자소서 양식을 미리 받아볼수 있나요?',
    '다음주월욜에 하는게 자소서가 좀 적혀있어야 한다는거죠..?',
    '강의실 하나에서 다 모여서 진행하는건가요?',
    '혜정이가 카드 스티커 모으는 거 좋아해요'        
]

In [16]:
# 단어사전 구축, 토큰화
testBOW.fit(text)

CountVectorizer()

In [17]:
# 단어사전
testBOW.vocabulary_

{'혹시': 21,
 '자소서': 10,
 '포함된': 16,
 '이력서': 8,
 '양식을': 7,
 '미리': 4,
 '받아볼수': 5,
 '있나요': 9,
 '다음주월욜에': 1,
 '하는게': 18,
 '자소서가': 11,
 '적혀있어야': 12,
 '한다는거죠': 19,
 '강의실': 0,
 '하나에서': 17,
 '모여서': 2,
 '진행하는건가요': 14,
 '혜정이가': 20,
 '카드': 15,
 '스티커': 6,
 '모으는': 3,
 '좋아해요': 13}

In [18]:
# 수치화, 벡터화
testBOW.transform(text)

<4x22 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

In [19]:
testBOW.transform(text).toarray()

array([[0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0],
       [1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0]],
      dtype=int64)

### 실제 데이터 적용

In [21]:
movie_bow = CountVectorizer()
movie_bow.fit(X_train)
X_train = movie_bow.transform(X_train)
X_test = movie_bow.transform(X_test)

In [22]:
X_train.shape

(1000, 17994)