핵심 키워드

- Categorization
- One-hot Encoding
- Bag of words
- Cosine Distance
- Corpus

- List Comprehension

##### News Categorization

컴퓨터는 문자를 그대로 이해하지 못함. 

> 문자 => 숫자

숫자로 유사하다는 어떻게 표현할까?

> 유사하다 = 가깝다.

Result => 문자 -> 숫자 -> Vector

문자를 Vector로 - One-hot Encoding
- 하나의 단어를 Vector의 index로 인식, 단어 존재시 1 없으면 0

#### Bag og words
- 단어별로 인덱스를 부여해서, 한 문장(또는 문서)의 단어의 개수를 Vector로 표현

corpus : 모든 단어가 들어있는 Set

### 그렇다면 유사성은?

좌표에서는 피타고라스 = Euclidian distance = 두 점 사이의 직선의 거리

두 점사이의 각도  = Cosine distance = Why cosine similarity? **Count < Direction**
- Love, hate 
(5,0)
(5,1)
(4,0)

어느 점이 가장 가까운가?

#### Process

- 파일을 불러오기
- 파일을 읽어서 단어사전(corpus)만들기
- 단어별로 Index 만들기
- 만들어진 인덱스로 문서별로 Bag of words vector 생성
- 비교하고자 하는 문서 비교하기
- 얼마나 맞는지 측정하기

In [1]:
import os
def get_file_list(dir_name):
    return os.listdir(dir_name)

In [2]:
dir_name = "news_data"
file_list = get_file_list(dir_name)
file_list = [os.path.join(dir_name, file_name) for file_name in file_list] # OS마다 폴더를 구분하는 기호가 다르기 때문에.

In [3]:
file_list

['news_data/3_SEUNG-HWAN OH SHUTS DOWN TWINS IN FIRST SPRING ACTION.txt',
 'news_data/1_MLB Team Interested In Dae-Ho Lee.txt',
 "news_data/4_Twins' Park Byung-ho hits first big league triple.txt",
 'news_data/7_Swansea City midfielder Ki Sung-Yueng will miss Crystal Palace clash.txt',
 'news_data/7_Premier League 2015-2016 Swansea City vs Manchester City, Preview, Prediction & Predicted Lineup.txt',
 'news_data/8_Heung Min Son gives Tottenham a two-goal advantage vs Chelsea.txt',
 "news_data/5_Injured midfielder likely to miss nat'l team friendlies.txt",
 'news_data/8_Has Heung Min Son Done Enough to Justify 22m Transfer.txt',
 'news_data/7_Sung-yueng In Team Of The Week.txt',
 'news_data/3_SEUNG-HWAN OH STRIKES OUT THE SIDE.txt',
 'news_data/3_Seung Hwan Oh Arrives Early To Cardinals Spring Training Camp.txt',
 "news_data/1_Seattle Mariners' Newest Signing Dae-Ho Lee Could Become Fan Favorite.txt",
 'news_data/2_Ryu Hyun-jin Rejoins Dodgers After Surgery.txt',
 'news_data/8_South Kor

In [4]:
len(file_list)

80

In [5]:
def get_conetents(file_list):
    y_class = []
    X_text = []
    class_dict = { 1:'0', 2:'0',3:'0',4:'0',5:'1',6:'1',7:'1',8:'1' }
    
    for file_name in file_list:
        try:
            f = open(file_name, 'r', encoding='cp949') #cp949 windows
            category = int(file_name.split(os.sep)[1].split("_")[0]) # sep | ₩ 등의 구분기호를 말한다.
            y_class.append(class_dict[category])
            X_text.append(f.read())
            f.close()
        except UnicodeDecodeError as e:
            print(e)
            print(file_name)
    return X_text, y_class

In [6]:
X_text, y_class = get_conetents(file_list)

Corpus 만들기 + 단어별 index 생성하기.

In [7]:
def get_cleaned_text(text):
        import re
        text = re.sub('\W+','', text.lower())
        return text
    
def get_corpus_dict(text):
    text = [sentence.split() for sentence in text]
    cleaned_words = [get_cleaned_text(word) for words in text for word in words] # list comprehansion
    
    from collections import OrderedDict
    corpus_dict = OrderedDict()
    for i, v in enumerate(set(cleaned_words)): # 동일한 단어는 한 개로 처리함.
        corpus_dict[v] = i
    return corpus_dict

In [8]:
corpus = get_corpus_dict(X_text) # 사전을 만드는 것.

In [9]:
corpus

OrderedDict([('', 0),
             ('hes', 1),
             ('games', 2),
             ('joo', 3),
             ('garth', 4),
             ('packed', 5),
             ('breakthrough', 6),
             ('argue', 7),
             ('98', 8),
             ('galore', 9),
             ('doing', 10),
             ('adjustment', 11),
             ('certain', 12),
             ('campaign', 13),
             ('midseason', 14),
             ('kind', 15),
             ('drawing', 16),
             ('terry', 17),
             ('73', 18),
             ('agrees', 19),
             ('praising', 20),
             ('walked', 21),
             ('mccarthy', 22),
             ('rain', 23),
             ('wonder', 24),
             ('cardinals', 25),
             ('history', 26),
             ('actually', 27),
             ('inform', 28),
             ('citizens', 29),
             ('morning', 30),
             ('have', 31),
             ('heard', 32),
             ('weidenfeller', 33),
             ('perfo

**이제 백터화를 만들어야 한다.**

 ### 문서별로 Bag of words vector 생성

In [10]:
def get_count_vector(text, corpus):
    text = [sentence.split() for sentence in text]
    word_number_list = [[corpus[get_cleaned_text(word)] for word in words] for words in text]
    X_vector = [[0 for _ in range(len(corpus))] for x in range(len(text))] # _ 는 쓰지 않는다.
    
    for i, text in enumerate(word_number_list):
        for word_number in text:
            X_vector[i][word_number] += 1
    return X_vector

In [11]:
X_vector = get_count_vector(X_text, corpus)

### 비교하기

In [12]:
import math
def get_cosine_similarity(v1, v2):
    "compute cosine similarity of v1 to v2: (v1 dot v2)/ {||v1||*||v2||}"
    sumxx, sumxy, sumyy = 0,0,0
    for i in range(len(v1)):
        x = v1[i]; y = v2[i]
        sumxx += x*x
        sumyy += y*y
        sumxy += x*y
    return sumxy/math.sqrt(sumxx*sumxy)

### 비교결과 정리하기

In [13]:
def get_similarity_score(X_vector, source):
    source_vector = X_vector[source]
    similarity_list = []
    for target_vector in X_vector:
        similarity_list.append(
        get_cosine_similarity(source_vector, target_vector))
    return similarity_list
def get_top_n_similarity_news(similarity_score, n):
    import operator
    x = {i: v for i, v in enumerate(similarity_score)}
    sorted_x = sorted(x.items(), key=operator.itemgetter(1))
    
    return list(reversed(sorted_x))[1:n+1]

In [14]:
def get_accuracy(similarity_list, y_class, source_news):
    source_class = y_class[source_news]

    return sum([source_class == y_class[i[0]] for i in similarity_list]) / len(similarity_list)

In [15]:
source_number =10
result = []

for i in range(80):
    source_number = i
    
    similarity_score = get_similarity_score(X_vector, source_number)
    similarity_news = get_top_n_similarity_news(similarity_score, 10)
    accuracy_score = get_accuracy(similarity_news, y_class, source_number)
    result.append(accuracy_score)
    
print(sum(result)/80)

0.5337500000000003
