<a href="https://colab.research.google.com/github/ancestor9/2025_Spring_Data-Management/blob/main/week_05/Text_Representation_and_Embedding_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 텍스트 표현 기법과 임베딩
# **Data Representation**

## 1. Tabular data
<img src='http://jalammar.github.io/images/pandas-intro/0%20excel-to-pandas.png'>

## 2. Audio and Timeseries data
<img src= 'http://jalammar.github.io/images/numpy/numpy-audio.png'>

## 3. Image data
<img src='http://jalammar.github.io/images/numpy/numpy-grayscale-image.png'>
<img src='http://jalammar.github.io/images/numpy/numpy-color-image.png'>

## <font color='orange'>**4. Text data**
- **아래 그림을 이해하여야 한다.**
<img src='http://jalammar.github.io/images/numpy/numpy-nlp-embeddings.png'>
<img src='http://jalammar.github.io/images/numpy/numpy-nlp-bert-shape.png'>


## <font color='orange'>**목소리, 주식가격, 그림, 동영상, 언어는 모두 순서(Order)가 있다.**

## 🎯 강의 목표
- 자연어 처리에서 사용되는 텍스트 표현 방법의 역사와 원리를 이해한다.
- Bag of Words, TF-IDF, Word Embedding 기법을 실습한다.
- Word2Vec과 같은 사전 학습된 임베딩을 적용해 본다.

## 📘 이론 강의
### **1. 텍스트 표현(Representation of Text)의 필요성**
- 컴퓨터는 텍스트를 숫자로 이해해야 함
- 자연어는 비정형 데이터 → 수치화 필요

### **2. 텍스트 표현 방식**
#### 2.1 One-hot Encoding
- 각 단어를 고유 인덱스로 변환 후, 그 인덱스만 1인 벡터로 표현
- 단점: 희소성, 단어 간 의미 관계 없음

#### 2.2 Bag of Words (BoW)
- 문서별 단어 출현 빈도 벡터
- 장점: 단순하고 빠름
- 단점: 문맥 정보 손실
- 📌 구현 도구: `CountVectorizer`

#### 2.3 N-gram 모델
- 연속된 N개의 단어를 하나의 특징으로 간주
- 예: bigram("I love NLP") → ["I love", "love NLP"]
- 📌 구현 도구: `CountVectorizer(ngram_range=(n, n))`

#### 2.4 TF-IDF
- 단어의 중요도를 반영한 벡터
- TF: 문서 내 빈도 / IDF: 전체 문서에서의 희귀성
- 📌 구현 도구: `TfidfVectorizer`

#### 2.5 Word Embedding
- 의미 기반 분산 표현 (Distributed Representation)
- Word2Vec, GloVe, FastText 등
- 단어 간 유사도, 의미 추론 가능
- CBOW / Skip-gram

#### 2.6 사전학습 임베딩
- Gensim / HuggingFace Transformers 사용 가능
- 예: word2vec-google-news-300

## 💻 **실습**

#### 2.0 정수인코딩(Integer Encoding)
- 📌 구현 도구: `Tokenizer`
- 예를 들어 : 단어에 정수를 부여하는 방법 중 하나로 단어를 빈도수 순으로 정렬한 단어 집합(vocabulary)을 만들고, 빈도수가 높은 순서대로 차례로 낮은 숫자부터 정수를 부여하는 방법


In [22]:
raw_text = '''
A barber is a person. a barber is good person. a barber is huge person. he Knew A Secret! The Secret He Kept is huge secret. Huge secret. His barber kept his word. a barber kept his word. His barber kept his secret. But keeping and keeping such a huge secret to himself was driving the barber crazy. the barber went up a huge mountain.
'''

In [23]:
import nltk
nltk.download('punkt_tab') # 문장을 구분하거나 단어로 쪼갤 때 필요한 pre-trained 모델
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [24]:
# 문장 토큰화
sentences = sent_tokenize(raw_text)
sentences

['\nA barber is a person.',
 'a barber is good person.',
 'a barber is huge person.',
 'he Knew A Secret!',
 'The Secret He Kept is huge secret.',
 'Huge secret.',
 'His barber kept his word.',
 'a barber kept his word.',
 'His barber kept his secret.',
 'But keeping and keeping such a huge secret to himself was driving the barber crazy.',
 'the barber went up a huge mountain.']

In [25]:
len(sentences)

11

In [26]:
for sentence in sentences:
    print(sentence)


A barber is a person.
a barber is good person.
a barber is huge person.
he Knew A Secret!
The Secret He Kept is huge secret.
Huge secret.
His barber kept his word.
a barber kept his word.
His barber kept his secret.
But keeping and keeping such a huge secret to himself was driving the barber crazy.
the barber went up a huge mountain.


In [31]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [32]:

vocab = {}
preprocessed_sentences = []
stop_words = set(stopwords.words('english'))

for sentence in sentences:
    # 단어 토큰화
    tokenized_sentence = word_tokenize(sentence)
    result = []

    for word in tokenized_sentence:
        word = word.lower() # 모든 단어를 소문자화하여 단어의 개수를 줄인다.
        if word not in stop_words: # 단어 토큰화 된 결과에 대해서 불용어를 제거한다.
            if len(word) > 2: # 단어 길이가 2이하인 경우에 대하여 추가로 단어를 제거한다.
                result.append(word)
                if word not in vocab:
                    vocab[word] = 0
                vocab[word] += 1
    preprocessed_sentences.append(result)
print(preprocessed_sentences)

[['barber', 'person'], ['barber', 'good', 'person'], ['barber', 'huge', 'person'], ['knew', 'secret'], ['secret', 'kept', 'huge', 'secret'], ['huge', 'secret'], ['barber', 'kept', 'word'], ['barber', 'kept', 'word'], ['barber', 'kept', 'secret'], ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'], ['barber', 'went', 'huge', 'mountain']]


In [33]:
print('단어 집합 :',vocab)

단어 집합 : {'barber': 8, 'person': 3, 'good': 1, 'huge': 5, 'knew': 1, 'secret': 6, 'kept': 4, 'word': 2, 'keeping': 2, 'driving': 1, 'crazy': 1, 'went': 1, 'mountain': 1}


In [34]:
vocab_sorted = sorted(vocab.items(), key = lambda x:x[1], reverse = True)
vocab_sorted

[('barber', 8),
 ('secret', 6),
 ('huge', 5),
 ('kept', 4),
 ('person', 3),
 ('word', 2),
 ('keeping', 2),
 ('good', 1),
 ('knew', 1),
 ('driving', 1),
 ('crazy', 1),
 ('went', 1),
 ('mountain', 1)]

In [36]:
# 높은 빈도수를 가진 단어일수록 낮은 정수를 1부터 부여
word_to_index = {}
i = 0
for (word, frequency) in vocab_sorted :
    if frequency > 1 : # 빈도수가 작은 단어는 제외.
        i = i + 1
        word_to_index[word] = i

print(word_to_index)

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'word': 6, 'keeping': 7}


In [40]:
# 자연어 처리를 하다보면, 텍스트 데이터에 있는 단어를 모두 사용하기 보다는 빈도수가 가장 높은 n개의 단어만 사용하고 싶을 때
vocab_size = 5

# 인덱스가 5 초과인 단어 제거
words_frequency = [word for word, index in word_to_index.items() if index >= vocab_size + 1]

# 해당 단어에 대한 인덱스 정보를 삭제
for w in words_frequency:
    del word_to_index[w]
print(word_to_index)

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5}


In [None]:
# 단어 집합에 존재하지 않는 단어들이 생기는 상황을 Out-Of-Vocabulary(단어 집합에 없는 단어) 'OOV 문제'
# word_to_index에 'OOV'란 단어를 새롭게 추가하고, 단어 집합에 없는 단어들은 'OOV'의 인덱스로 인코딩

In [41]:
word_to_index['OOV'] = len(word_to_index) + 1
print(word_to_index)


{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'OOV': 6}


In [42]:
encoded_sentences = []
for sentence in preprocessed_sentences:
    encoded_sentence = []
    for word in sentence:
        try:
            # 단어 집합에 있는 단어라면 해당 단어의 정수를 리턴.
            encoded_sentence.append(word_to_index[word])
        except KeyError:
            # 만약 단어 집합에 없는 단어라면 'OOV'의 정수를 리턴.
            encoded_sentence.append(word_to_index['OOV'])
    encoded_sentences.append(encoded_sentence)

print(encoded_sentences)

[[1, 5], [1, 6, 5], [1, 3, 5], [6, 2], [2, 4, 3, 2], [3, 2], [1, 4, 6], [1, 4, 6], [1, 4, 2], [6, 6, 3, 2, 6, 1, 6], [1, 6, 3, 6]]


### **Counter**

In [43]:
from collections import Counter

In [44]:
preprocessed_sentences

[['barber', 'person'],
 ['barber', 'good', 'person'],
 ['barber', 'huge', 'person'],
 ['knew', 'secret'],
 ['secret', 'kept', 'huge', 'secret'],
 ['huge', 'secret'],
 ['barber', 'kept', 'word'],
 ['barber', 'kept', 'word'],
 ['barber', 'kept', 'secret'],
 ['keeping', 'keeping', 'huge', 'secret', 'driving', 'barber', 'crazy'],
 ['barber', 'went', 'huge', 'mountain']]

In [53]:
# prompt: preprocessed_sentences를 하나의 단어집합으로 만들어줘

from collections import Counter

# Flatten the list of lists into a single list of words
all_words = [word for sentence in preprocessed_sentences for word in sentence]

# Create a vocabulary using Counter
vocab = Counter(all_words)

vocab


Counter({'barber': 8,
         'person': 3,
         'good': 1,
         'huge': 5,
         'knew': 1,
         'secret': 6,
         'kept': 4,
         'word': 2,
         'keeping': 2,
         'driving': 1,
         'crazy': 1,
         'went': 1,
         'mountain': 1})

In [54]:
# words = np.hstack(preprocessed_sentences)으로도 수행 가능.
all_words = sum(preprocessed_sentences, [])
Counter(all_words)

Counter({'barber': 8,
         'person': 3,
         'good': 1,
         'huge': 5,
         'knew': 1,
         'secret': 6,
         'kept': 4,
         'word': 2,
         'keeping': 2,
         'driving': 1,
         'crazy': 1,
         'went': 1,
         'mountain': 1})

In [55]:
print(vocab["barber"]) # 'barber'라는 단어의 빈도수 출력

8


In [56]:
vocab_size = 5
vocab = vocab.most_common(vocab_size) # 등장 빈도수가 높은 상위 5개의 단어만 저장
vocab

[('barber', 8), ('secret', 6), ('huge', 5), ('kept', 4), ('person', 3)]

### <font color='red'> **퀴즈 1. 이상한 나라의 앨리스라는 소설에서 출현하는 상위 10위 단어는 무엇인가?**

In [59]:
from nltk.corpus import gutenberg
# 필요한 리소스를 다운로드
nltk.download('gutenberg')
# 구텐베르크 말뭉치에서 사용 가능한 파일 목록 출력
file_ids = gutenberg.fileids()
file_ids

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [63]:
raw_text = gutenberg.raw('austen-emma.txt')
raw_text[:1000]

"[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister's marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died too long ago for her to have more than an indistinct\nremembrance of her caresses; and her place had been supplied\nby an excellent woman as governess, who had fallen little short\nof a mother in affection.\n\nSixteen years had Miss Taylor been in Mr. Woodhouse's family,\nless as a governess than a friend, very fond of both daughters,\nbut particularly of Emma.  Between _them_ it was more the intimacy\nof sisters.  Even before Miss Taylor had ceased to hold the nominal\noffice of 

In [62]:
# prompt: raw_text에서 출현하는 단어 사우이 10개와 출현빈도를 구해줘
# Tokenize the text
tokens = word_tokenize(raw_text)

# Remove stop words and punctuation
stop_words = set(stopwords.words('english'))
words = [w.lower() for w in tokens if w.isalnum() and w.lower() not in stop_words]

# Count word frequencies
word_counts = Counter(words)

# Get the top 10 most frequent words
top_10_words = word_counts.most_common(10)

top_10_words

[('emma', 860),
 ('could', 836),
 ('would', 818),
 ('miss', 599),
 ('must', 566),
 ('harriet', 500),
 ('much', 484),
 ('said', 483),
 ('one', 447),
 ('weston', 437)]

#### 2.1 One-hot Encoding
- 📌 구현 도구: `CountVectorizer`



In [66]:
word_to_index

{'barber': 1, 'secret': 2, 'huge': 3, 'kept': 4, 'person': 5, 'OOV': 6}

In [69]:
def one_hot_encoding(word, word_to_index):
  one_hot_vector = [0]*(len(word_to_index))
  index = word_to_index[word]
  one_hot_vector[index - 1] = 1  # Subtract 1 from the index to adjust for 0-based indexing
  return one_hot_vector

In [70]:
one_hot_encoding('barber', word_to_index)

[1, 0, 0, 0, 0, 0]

In [71]:
for k, v in word_to_index.items():
    print(k, one_hot_encoding(k, word_to_index))

barber [1, 0, 0, 0, 0, 0]
secret [0, 1, 0, 0, 0, 0]
huge [0, 0, 1, 0, 0, 0]
kept [0, 0, 0, 1, 0, 0]
person [0, 0, 0, 0, 1, 0]
OOV [0, 0, 0, 0, 0, 1]


In [72]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical

text = "나랑 점심 먹으러 갈래 점심 메뉴는 햄버거 갈래 갈래 햄버거 최고야"

tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
print('단어 집합 :',tokenizer.word_index)


단어 집합 : {'갈래': 1, '점심': 2, '햄버거': 3, '나랑': 4, '먹으러': 5, '메뉴는': 6, '최고야': 7}


In [78]:
# 단어 집합(vocabulary)에 있는 단어들로만 구성된 텍스트가 있다면, texts_to_sequences()를 통해서 이를 정수 시퀀스로 변환가능
sub_text = "점심 먹으러 갈래 메뉴는 햄버거 최고야"
encoded = tokenizer.texts_to_sequences([sub_text])[0]
encoded

[2, 5, 1, 6, 3, 7]

In [77]:
one_hot = to_categorical(encoded)
one_hot

array([[0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1.]])

In [98]:
import pandas as pd
df = pd.DataFrame(one_hot[:, 1:],
                  columns=[i[0] for i in tokenizer.word_index.items()])

df

Unnamed: 0,갈래,점심,햄버거,나랑,먹으러,메뉴는,최고야
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,1.0


In [99]:
# Why empty?
sub_text = "친구랑 수영장가서 불고기를 먹을 거야"
encoded = tokenizer.texts_to_sequences([sub_text])[0]
print(encoded)

[]


#### 2.2 Bag of Words (BoW)
- 📌 구현 도구: `CountVectorizer`
- 단어들의 순서는 전혀 고려하지 않고, 단어들의 출현 빈도(frequency)에만 집중하는 텍스트 데이터의 수치화 표현 방법
- Bag of Words를 직역하면 단어들의 가방

In [100]:
tokenizer.word_index

{'갈래': 1, '점심': 2, '햄버거': 3, '나랑': 4, '먹으러': 5, '메뉴는': 6, '최고야': 7}

In [109]:
sub_text = ["점심 먹으러 갈래 메뉴는 불고기가 최고야", "저녁은 양곱창이 최고야"]
encoded = tokenizer.texts_to_sequences(sub_text) # tokenizer.texts_to_sequences() 리스트를 입력인자로
encoded

[[2, 5, 1, 6, 7], [7]]

In [111]:
# https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print(X.toarray())

['and' 'document' 'first' 'is' 'one' 'second' 'the' 'third' 'this']
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]


In [113]:
pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())

Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,2,0,1,0,1,1,0,1
2,1,0,0,1,1,0,1,1,1
3,0,1,1,1,0,0,1,0,1



#### 2.3 N-gram 모델
- 📌 구현 도구: `CountVectorizer(ngram_range=(n, n))`


In [116]:
vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
X2 = vectorizer2.fit_transform(corpus)
pd.DataFrame(X2.toarray(), columns=vectorizer2.get_feature_names_out())

Unnamed: 0,and this,document is,first document,is the,is this,second document,the first,the second,the third,third one,this document,this is,this the
0,0,0,1,1,0,0,1,0,0,0,0,1,0
1,0,1,0,1,0,1,0,1,0,0,1,0,0
2,1,0,0,1,0,0,0,0,1,1,0,1,0
3,0,0,1,0,1,0,1,0,0,0,0,0,1



#### 2.4 TF-IDF
- 📌 구현 도구: `TfidfVectorizer`


In [117]:
corpus

['This is the first document.',
 'This document is the second document.',
 'And this is the third one.',
 'Is this the first document?']

In [138]:
vocab = list(set(w for doc in corpus for w in doc.split()))
vocab.sort()
print(vocab)

['And', 'Is', 'This', 'document', 'document.', 'document?', 'first', 'is', 'one.', 'second', 'the', 'third', 'this']


- (1) tf(d,t) : 특정 문서 d에서의 특정 단어 t의 등장 횟수.
- (2) df(t) : 특정 단어 t가 등장한 문서의 수.
- (3) idf(t) : df(t)에 반비례하는 수.

In [148]:
N = len(corpus)

print(N)


from math import log # IDF 계산을 위해

def tf(t, d):  # Term Frequency
  return d.count(t)

def idf(t):    # Inverse Document Frequency
  df = 0
  for doc in corpus:
    df += t in doc
  return log(N/(df+1))

def tfidf(t, d):  # tf-idf
  return tf(t,d)* idf(t)

4


In [149]:
# Term Frequency 구하기

result = []

# 각 문서에 대해서 아래 연산을 반복
for i in range(N):
  result.append([])
  d = corpus[i]
  for j in range(len(vocab)):
    t = vocab[j]
    result[-1].append(tf(t, d))

tf_ = pd.DataFrame(result, columns = vocab)
tf_


Unnamed: 0,And,Is,This,document,document.,document?,first,is,one.,second,the,third,this
0,0,0,1,1,1,0,1,2,0,0,1,0,0
1,0,0,1,2,1,0,0,2,0,1,1,0,0
2,1,0,0,0,0,0,0,2,1,0,1,1,1
3,0,1,0,1,0,1,1,1,0,0,1,0,1


In [150]:
# idf 구하기

result = []
for j in range(len(vocab)):
    t = vocab[j]
    result.append(idf(t))

idf_ = pd.DataFrame(result, index=vocab, columns=["IDF"])
idf_


Unnamed: 0,IDF
And,0.693147
Is,0.693147
This,0.287682
document,0.0
document.,0.287682
document?,0.693147
first,0.287682
is,-0.223144
one.,0.693147
second,0.693147


In [151]:
result = []
for i in range(N):
  result.append([])
  d = corpus[i]
  for j in range(len(vocab)):
    t = vocab[j]
    result[-1].append(tfidf(t,d))

tfidf_ = pd.DataFrame(result, columns = vocab)
tfidf_


Unnamed: 0,And,Is,This,document,document.,document?,first,is,one.,second,the,third,this
0,0.0,0.0,0.287682,0.0,0.287682,0.0,0.287682,-0.446287,0.0,0.0,-0.223144,0.0,0.0
1,0.0,0.0,0.287682,0.0,0.287682,0.0,0.0,-0.446287,0.0,0.693147,-0.223144,0.0,0.0
2,0.693147,0.0,0.0,0.0,0.0,0.0,0.0,-0.446287,0.693147,0.0,-0.223144,0.693147,0.287682
3,0.0,0.693147,0.0,0.0,0.0,0.693147,0.287682,-0.223144,0.0,0.0,-0.223144,0.0,0.287682


In [153]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfv = TfidfVectorizer().fit(corpus)
pd.DataFrame(tfidfv.transform(corpus).toarray(), columns=tfidfv.vocabulary_)

Unnamed: 0,this,is,the,first,document,second,and,third,one
0,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085
1,0.0,0.687624,0.0,0.281089,0.0,0.538648,0.281089,0.0,0.281089
2,0.511849,0.0,0.0,0.267104,0.511849,0.0,0.267104,0.511849,0.267104
3,0.0,0.469791,0.580286,0.384085,0.0,0.0,0.384085,0.0,0.384085


### 단어가 많아지면 컬럼이 엄청많아지고 차원의 저주, 비효율적

In [160]:
raw_text = gutenberg.raw('austen-emma.txt')
text = raw_text[:10000]
text

'[Emma by Jane Austen 1816]\n\nVOLUME I\n\nCHAPTER I\n\n\nEmma Woodhouse, handsome, clever, and rich, with a comfortable home\nand happy disposition, seemed to unite some of the best blessings\nof existence; and had lived nearly twenty-one years in the world\nwith very little to distress or vex her.\n\nShe was the youngest of the two daughters of a most affectionate,\nindulgent father; and had, in consequence of her sister\'s marriage,\nbeen mistress of his house from a very early period.  Her mother\nhad died too long ago for her to have more than an indistinct\nremembrance of her caresses; and her place had been supplied\nby an excellent woman as governess, who had fallen little short\nof a mother in affection.\n\nSixteen years had Miss Taylor been in Mr. Woodhouse\'s family,\nless as a governess than a friend, very fond of both daughters,\nbut particularly of Emma.  Between _them_ it was more the intimacy\nof sisters.  Even before Miss Taylor had ceased to hold the nominal\noffice o

In [161]:
# prompt: raw_text[:1000]의 내용을 countervector, tf-idf로 변환

tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
words = [w.lower() for w in tokens if w.isalnum() and w.lower() not in stop_words]

# CountVectorizer
vectorizer = CountVectorizer()
X_count = vectorizer.fit_transform(words)  # words를 리스트로 변환해서 사용
print(X_count.toarray())
print(vectorizer.vocabulary_)
pd.DataFrame(X_count.toarray(), columns=vectorizer.vocabulary_)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
{'emma': 134, 'jane': 235, 'austen': 36, '1816': 0, 'volume': 470, 'chapter': 63, 'woodhouse': 488, 'handsome': 193, 'clever': 75, 'rich': 373, 'comfortable': 78, 'home': 208, 'happy': 197, 'disposition': 121, 'seemed': 383, 'unite': 454, 'best': 49, 'blessings': 52, 'existence': 151, 'lived': 255, 'nearly': 302, 'years': 493, 'world': 491, 'little': 253, 'distress': 123, 'vex': 466, 'youngest': 495, 'two': 452, 'daughters': 98, 'affectionate': 11, 'indulgent': 225, 'father': 155, 'consequence': 87, 'sister': 398, 'marriage': 272, 'mistress': 291, 'house': 212, 'early': 129, 'period': 336, 'mother': 294, 'died': 109, 'long': 259, 'ago': 17, 'indistinct': 224, 'remembrance': 367, 'caresses': 59, 'place': 338, 'supplied': 421, 'excellent': 149, 'woman': 487, 'governess': 186, 'fallen': 152, 'short': 394, 'affection': 10, 'sixteen': 402, 'miss': 290, 'taylor': 429, 'fami

Unnamed: 0,emma,jane,austen,1816,volume,chapter,woodhouse,handsome,clever,rich,...,sir,beautiful,moonlight,mild,draw,back,fire,found,damp,dirt
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
796,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
797,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
798,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [162]:
# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(words) # words를 리스트로 변환해서 사용
print("\nTF-IDF Vectorizer Result:")
print(X_tfidf.toarray())
tfidf_vectorizer.vocabulary_
pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.vocabulary_)


TF-IDF Vectorizer Result:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


Unnamed: 0,emma,jane,austen,1816,volume,chapter,woodhouse,handsome,clever,rich,...,sir,beautiful,moonlight,mild,draw,back,fire,found,damp,dirt
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
796,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
797,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
798,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
