# 데이터 로드
- __label__1 : 별점이 1, 2점인 리뷰
- __label__2 : 별점이 4, 5점인 리뷰

In [29]:
import numpy as np
import pandas as pd

train_origin = pd.read_csv('train.ft.txt', sep='\r', encoding='utf-8', header=None, skiprows=0)
train_origin.head()

Unnamed: 0,0
0,__label__2 Stuning even for the non-gamer: Thi...
1,__label__2 The best soundtrack ever to anythin...
2,__label__2 Amazing!: This soundtrack is my fav...
3,__label__2 Excellent Soundtrack: I truly like ...
4,"__label__2 Remember, Pull Your Jaw Off The Flo..."


In [30]:
test_origin = pd.read_csv('test.ft.txt', sep='\r', encoding='utf-8', header=None, skiprows=0)
test_origin.head()

Unnamed: 0,0
0,__label__2 Great CD: My lovely Pat has one of ...
1,__label__2 One of the best game music soundtra...
2,__label__1 Batteries died within a year ...: I...
3,"__label__2 works fine, but Maha Energy is bett..."
4,__label__2 Great for the non-audiophile: Revie...


In [31]:
train = train_origin.copy()
train.info() # 3,600,000 : 총 360만개의 데이터

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3600000 entries, 0 to 3599999
Data columns (total 1 columns):
 #   Column  Dtype 
---  ------  ----- 
 0   0       object
dtypes: object(1)
memory usage: 27.5+ MB


In [32]:
test = test_origin.copy()
test.info() # 400,000 : 총 40만개의 데이터

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400000 entries, 0 to 399999
Data columns (total 1 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   0       400000 non-null  object
dtypes: object(1)
memory usage: 3.1+ MB


# 데이터 전처리

## 평점-텍스트 분류

In [33]:
pattern = r'^__label__(\d+)\s*(.*)$'

train[['label', 'text']] = train[0].str.extract(pattern)
train['label'] = pd.to_numeric(train['label'])
train = train[['label', 'text']]
train[:10]

Unnamed: 0,label,text
0,2,Stuning even for the non-gamer: This sound tra...
1,2,The best soundtrack ever to anything.: I'm rea...
2,2,Amazing!: This soundtrack is my favorite music...
3,2,Excellent Soundtrack: I truly like this soundt...
4,2,"Remember, Pull Your Jaw Off The Floor After He..."
5,2,an absolute masterpiece: I am quite sure any o...
6,1,"Buyer beware: This is a self-published book, a..."
7,2,Glorious story: I loved Whisper of the wicked ...
8,2,A FIVE STAR BOOK: I just finished reading Whis...
9,2,Whispers of the Wicked Saints: This was a easy...


In [34]:
pattern = r'^__label__(\d+)\s*(.*)$'

test[['label', 'text']] = test[0].str.extract(pattern)
test['label'] = pd.to_numeric(test['label'])
test = test[['label', 'text']]
test[:10]

Unnamed: 0,label,text
0,2,Great CD: My lovely Pat has one of the GREAT v...
1,2,One of the best game music soundtracks - for a...
2,1,Batteries died within a year ...: I bought thi...
3,2,"works fine, but Maha Energy is better: Check o..."
4,2,Great for the non-audiophile: Reviewed quite a...
5,1,DVD Player crapped out after one year: I also ...
6,1,"Incorrect Disc: I love the style of this, but ..."
7,1,DVD menu select problems: I cannot scroll thro...
8,2,Unique Weird Orientalia from the 1930's: Exoti...
9,1,"Not an ""ultimate guide"": Firstly,I enjoyed the..."


## 긍정-부정 분류

In [35]:
train_positive = train[train['label']==2]
train_negative = train[train['label']==1]
print(train_positive[-5:])
print(train_negative[-5:])

         label                                               text
3599989      2  Amazing CD: Tyler Hitlon's CD is awesome! If y...
3599990      2  Buy this CD and you'll thank yourself!: Tyler ...
3599991      2  Tyler Rocks: there is only one word to describ...
3599992      2  AWESOME: Absolutely amazing so relieving of my...
3599999      2  Makes My Blood Run Red-White-And-Blue: I agree...
         label                                               text
3599994      1  Too simplistic: While Mr. Harrison makes some ...
3599995      1  Don't do it!!: The high chair looks great when...
3599996      1  Looks nice, low functionality: I have used thi...
3599997      1  compact, but hard to clean: We have a small ho...
3599998      1  what is it saying?: not sure what this book is...


In [36]:
test_positive = test[test['label']==2]
test_negative = test[test['label']==1]
print(test_positive[-5:])
print(test_negative[-5:])

        label                                               text
399986      2  Extremely Useful for Me and Others: As a teach...
399987      2  I like this keyboard a lot!: My keyboard at wo...
399990      2  I really love Puff Daddy and R. Kelly's songs....
399998      2  Classic Jessica Mitford: This is a compilation...
        label                                               text
399993      1  CRAP: this is not music, no matter what anyone...
399995      1  Unbelievable- In a Bad Way: We bought this Tho...
399996      1  Almost Great, Until it Broke...: My son reciev...
399997      1  Disappointed !!!: I bought this toy for my son...
399999      1  Comedy Scene, and Not Heard: This DVD will be ...


## 라벨링 수정 (1,2 -> 0,1)

In [37]:
train_positive['label'] = 1
train_negative['label'] = 0
print(train_positive[-5:])
print(train_negative[-5:])

         label                                               text
3599989      1  Amazing CD: Tyler Hitlon's CD is awesome! If y...
3599990      1  Buy this CD and you'll thank yourself!: Tyler ...
3599991      1  Tyler Rocks: there is only one word to describ...
3599992      1  AWESOME: Absolutely amazing so relieving of my...
3599999      1  Makes My Blood Run Red-White-And-Blue: I agree...
         label                                               text
3599994      0  Too simplistic: While Mr. Harrison makes some ...
3599995      0  Don't do it!!: The high chair looks great when...
3599996      0  Looks nice, low functionality: I have used thi...
3599997      0  compact, but hard to clean: We have a small ho...
3599998      0  what is it saying?: not sure what this book is...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_positive['label'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_negative['label'] = 0


In [38]:
test_positive['label'] = 1
test_negative['label'] = 0
print(test_positive[-5:])
print(test_negative[-5:])

        label                                               text
399986      1  Extremely Useful for Me and Others: As a teach...
399987      1  I like this keyboard a lot!: My keyboard at wo...
399990      1  I really love Puff Daddy and R. Kelly's songs....
399998      1  Classic Jessica Mitford: This is a compilation...
        label                                               text
399993      0  CRAP: this is not music, no matter what anyone...
399995      0  Unbelievable- In a Bad Way: We bought this Tho...
399996      0  Almost Great, Until it Broke...: My son reciev...
399997      0  Disappointed !!!: I bought this toy for my son...
399999      0  Comedy Scene, and Not Heard: This DVD will be ...


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_positive['label'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_negative['label'] = 0


## 2.5만개 랜덤 샘플링

In [39]:
# 긍정 훈련 : 비데 10,000 + 오픈 15,000
# 부정 훈련 : 비데 1,000 + 오픈 24,000
train_pos_random = train_positive.sample(n=15000, random_state=2501)
train_neg_random = train_negative.sample(n=24000, random_state=2501)
print(train_pos_random[-5:])
print(train_neg_random[-5:])

         label                                               text
23770        1  Great relief from pain!: This item is super fa...
1732696      1  Great Basics for Home Groomer: This is the bes...
2746838      1  GREAT PRODUCT: I had a miss fire code indicati...
2021382      1  Christian Wunderlich and Kristin Hall=impeccab...
1026195      1  It Is Just Simply Great: This is a great produ...
         label                                               text
1467622      0  Unwatchable: I recorded this from TV, and trie...
2233041      0  Buy a stapler instead: There's a trick with a ...
904458       0  not so good: I guess it works,one problem the ...
1484738      0  Not High Polish: I returned this item because ...
2186239      0  Holes after first wash: The sheets are sub-par...


In [40]:
# 긍정 테스트 : 비데 11,267 + 오픈 13,733
# 부정 테스트 : 비데 1,828 + 오픈 23,172
test_pos_random = test_positive.sample(n=13733, random_state=2501)
test_neg_random = test_negative.sample(n=23172, random_state=2501)
print(test_pos_random[-5:])
print(test_neg_random[-5:])

        label                                               text
383349      1  Twista is back: I do have to agree that Twista...
328407      1  Steven's 2 cents: I use these gloves for a wat...
219482      1  Excellent continuation of the Sonja Blue saga:...
275751      1  Tad does it again.: Not much to say here. If y...
192895      1  Skid Row Laureate: If you're tired of reading ...
        label                                               text
149886      0  Maybe I am missing something...: Perhaps I am ...
288117      0  Not for everyone: I really wanted to like this...
150189      0  Really STICKY cookware: Do not buy! This is th...
205816      0  Awful Mount: I have ridden with this for about...
305190      0  Avoid at all costs!!!: I have a Tripp Lite Sma...


## 긍정-부정 병합

In [41]:
# 무작위로 데이터 병합
train_random = pd.concat([train_pos_random, train_neg_random], ignore_index=True).sample(frac=1, random_state=2501)
train_random.head()

Unnamed: 0,label,text
7238,1,Awesome Movie: The Cheetah Girls was a great m...
31289,0,Not what I expect from Dean Koontz: I wish Dea...
19468,0,Not loud enough for use in a crowd.: This prod...
28751,0,"Not Again: I used to be a huge Hanson fan, so ..."
10841,1,"What a Blessing {even as an adult): So, I admi..."


In [42]:
test_random = pd.concat([test_pos_random, test_neg_random], ignore_index=True).sample(frac=1, random_state=2501)
test_random.head()

Unnamed: 0,label,text
10979,1,"Scary: Great show, but last season was much mo..."
20643,0,It will break!: The flash flood certainly has ...
29179,0,Disappointed: I couldn't wait to get these can...
23958,0,Does not ring true!: Mineko Iwasaki paints her...
24445,0,ABSOLUTELY HORRIBLE - Don't Buy It: I ordered ...


In [43]:
train_random['text'].duplicated().sum()

0

In [44]:
test_random['text'].duplicated().sum()

0

# 파일 저장

In [45]:
train_random.to_excel('open_train_data.xlsx', index=False)

In [46]:
test_random.to_excel('open_test_data.xlsx', index=False)