# TF-IDF

- Term Frequency - Inverse Document Frequency
- 단어 수의 빈도만이 아닌 문서 내 단어 출현 빈도를 활용해 문서 집단 내에서 각 단어의 중요도를 계산하는 방법

계산 방법
- TF-IDF는 TF와 IDF의 곱으로 나타남
- TF는 특정 문서 a에서 특정 단어의 등장 횟수를 나타냄
> TF(d, t) = (문서 d 내의 t의 단어가 몇 개 등장했는가)
- IDF는 특정 문서 집단 내에서 특정 단어의 등장 문서 수의 비율을 반비례하여 log를 씌운 값을 의미함
> DF(t) = (문서 집단 내에서 t의 단어가 몇 개의 문서에서 등장했는가)
> IDF(t) = log(D/1+IDF(t)), D = (총 문서 갯수)
- 이 두개를 곱해 TF-IDF 값을 구하게 됨
> TF-IDF = TF(d, t) * IDF(t)

- TF-IDF의 의미는 범용적으로 많이 쓰이는 단어는 낮은 값을, 특정 문서에서 주제로 사용될 수 있는 단어들은 높은 값을 나타내게 됨

In [1]:
# 파일 불러오기
import pandas as pd

df = pd.read_csv("센서클리닝.csv")

tmp = pd.DataFrame([i.replace("{"," ") for i in df["명칭(원문)"]])
tmp

Unnamed: 0,0
0,지원되는 센서 클리닝 시스템 ASSISTED SENSOR CLEANING SYSTEM}
1,레이저를 이용한 이미지센서의 건식 세정방법 및 세정장치 METHOD AND APPA...
2,자동 건조 장치 및 그 제어 방법 Apparatus for automatical d...
3,"오일열화 검출장치, 오일열화 검출장치의 센서 커버, 및 오일열화도 측정 방법"
4,센서 클리닝을 위한 분사구조 INJECTION STRUCTURE FOR CLEANI...
5,센서 클리닝 시스템을 위한 냉각시스템 COOLING SYSTEM FOR SENSOR...
6,센서 클리닝 장치 및 방법 SYSTEM AND METHOD FOR CLEANING ...
7,화상형성장치 및 그 센싱장치 IMAGE FORMING APPARATUS AND SE...
8,"스캐닝 모듈, 이를 구비한 화상독취장치, 및 스캐닝 모듈의메인트넌스 방법 Scann..."
9,차량용 센서 클리닝을 위한 컴프레서의 쿨링 시스템 COOLING SYSTEM OF ...


In [2]:
# 수동 버전
# 띄어쓰기 기준으로 특허에 포진해 있는 단어를 수집함
def wordbox(tmp):
    word_box = []
    for i in tmp[0]:
        j = i.split(" ")
        for k in j:
            if k not in word_box:
                word_box.append(k)
    
    return word_box
    

In [3]:
# TF 계산
def TF(tmp, word_box):
    emb = []
    for i in tmp[0]:
        c = []
        for j in word_box:
            if j not in i:
                c.append(0)
            else:
                c.append(i.count(j))
        emb.append(c)
    tf=pd.DataFrame(emb)
    return tf

In [4]:
# idf 계산
import numpy as np
def IDF(tmp, word_box):
    count = []
    for i in tmp[0]:
        c = []
        for j in word_box:
            if j not in i:
                c.append(0)
            else:
                c.append(1)
        count.append(c)
    count_doc = pd.DataFrame(count).sum().to_list()

    idf = [np.log10(len(count)/1+i) for i in count_doc]
    return idf

In [5]:
# TF-IDF 계산
def TF_IDF(tf, idf, word_box):
    TF_IDF = tf*idf
    TF_IDF.columns = word_box
    TF_IDF
    return TF_IDF

In [6]:
def total_TFIDF(tmp):
    print(type(tmp))
    word = wordbox(tmp)
    tf = TF(tmp, word)
    idf = IDF(tmp, word)
    tf_idf = TF_IDF(tf, idf, word)
    return tf_idf

total_TFIDF(tmp)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,지원되는,센서,클리닝,시스템,ASSISTED,SENSOR,CLEANING,SYSTEM},레이저를,이용한,...,처리,PROCESSING,METHOD},공기유량센서,air,flow,ACR센서,클리닝방법,AN,ACR
0,1.322219,1.491362,1.491362,1.39794,1.322219,1.431364,1.447158,1.342423,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
1,0.0,1.491362,0.0,0.0,0.0,1.431364,1.447158,0.0,1.322219,1.322219,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.924796,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.491362,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.491362,1.491362,0.0,0.0,1.431364,1.447158,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
5,0.0,1.491362,1.491362,2.79588,0.0,1.431364,1.447158,1.342423,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
6,0.0,1.491362,1.491362,0.0,0.0,1.431364,1.447158,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.924796,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,1.491362,1.491362,1.39794,0.0,1.431364,1.447158,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0


In [7]:
# 통합버젼
import pandas as pd

#파일 불러오기
df = pd.read_csv("센서클리닝.csv")

tmp = pd.DataFrame([i.replace("{"," ") for i in df["명칭(원문)"]])
tmp

# 수동 버전
# 띄어쓰기 기준으로 특허에 포진해 있는 단어를 수집함
def wordbox(atmp):
    word_box = []
    for i in atmp[0]:
        j = i.split(" ")
        for k in j:
            if k not in word_box:
                word_box.append(k)
    
    return word_box

# TF 계산
def TF(atmp, word_box):
    emb = []
    for i in atmp[0]:
        c = []
        for j in word_box:
            if j not in i:
                c.append(0)
            else:
                c.append(i.count(j))
        emb.append(c)
    tf=pd.DataFrame(emb)
    return tf

# idf 계산
import numpy as np
def IDF(tmp, word_box):
    count = []
    for i in tmp[0]:
        c = []
        for j in word_box:
            if j not in i:
                c.append(0)
            else:
                c.append(1)
        count.append(c)
    count_doc = pd.DataFrame(count).sum().to_list()

    idf = [np.log10(len(count)/1+i) for i in count_doc]
    return idf

# TF-IDF 계산
def TF_IDF(tf, idf, word_box):
    TF_IDF = tf*idf
    TF_IDF.columns = word_box
    TF_IDF
    return TF_IDF

def total_TFIDF(tmp):
    print(type(tmp))
    word = wordbox(tmp)
    tf = TF(tmp, word)
    idf = IDF(tmp, word)
    tf_idf = TF_IDF(tf, idf, word)
    return tf_idf

total_TFIDF(tmp)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,지원되는,센서,클리닝,시스템,ASSISTED,SENSOR,CLEANING,SYSTEM},레이저를,이용한,...,처리,PROCESSING,METHOD},공기유량센서,air,flow,ACR센서,클리닝방법,AN,ACR
0,1.322219,1.491362,1.491362,1.39794,1.322219,1.431364,1.447158,1.342423,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
1,0.0,1.491362,0.0,0.0,0.0,1.431364,1.447158,0.0,1.322219,1.322219,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.924796,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,1.491362,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,1.491362,1.491362,0.0,0.0,1.431364,1.447158,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
5,0.0,1.491362,1.491362,2.79588,0.0,1.431364,1.447158,1.342423,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
6,0.0,1.491362,1.491362,0.0,0.0,1.431364,1.447158,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.924796,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,1.491362,1.491362,1.39794,0.0,1.431364,1.447158,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.462398,0.0


In [8]:
## 사이킷런을 활용한 tfidf 계산

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
vectors = tfidf.fit_transform(tmp.iloc[:,0])

In [9]:
# TF-IDF 확인
tf_idf = pd.DataFrame(vectors.toarray(), columns=tfidf.get_feature_names_out())
tf_idf

Unnamed: 0,acr,acr센서,air,an,and,apparatus,assisted,automatical,automatically,body,...,쿨링,클리닝,클리닝방법,클리닝을,프론트,필러,형성,화상,화상독취장치,화상형성장치
0,0.0,0.0,0.0,0.0,0.0,0.0,0.49975,0.0,0.0,0.0,...,0.0,0.293029,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.165452,0.156015,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.166004,0.156536,0.0,0.301165,0.301165,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.35339,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.237533,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.316335,0.0,0.0,0.0,0.0,0.0,...,0.0,0.336505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.225538,0.212674,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.359668
8,0.0,0.0,0.0,0.0,0.108828,0.102621,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.197437,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.330126,0.0,0.0,0.261848,0.0,0.0,0.0,0.0,0.0,0.0
