### Text Data Structure

#### Word Expreassion

**◆ word vectors (word representations)**
- The most basic problem of natural language processing is how to make a computer recognize natural language.
- Computer recognizes natural language as binary code ( Unicode , ASCII code ,…). 
    Ex) English : 1100010110111000, English : 1100010110110100
- This way of expression has no characteristics of words at all.
- It can be used for classification and clustering.

**◆ One-Hot encoding (one-hot encoding)**

words It is expressed as a single vector , with only one 1 at a specific position, and the rest are marked as 0. Ex) {Thomas Jepperson made Jepperson building}

<center>

||Thomas|Jepperson|made|building|
|:---:|:---:|:---:|:---:|:---:|
|Thomas|1|0|0|0|
|Japperson|0|1|0|0|
|made|0|0|1|0|
|Japperson|0|1|0|0|
|building|0|0|0|1|

</center>

- To know what the nth word is in a row ( sentence ) , you need to know the column value that is 1 in that row. (100% restorable)
- One row is each binary row vector, where only one is 1 and the rest are 0.
- Columns act as a dictionary of words (terms).

**◆ One-hot Disadvantages and Alternatives**

- one - hot The disadvantage of encoding is that it becomes inefficient because the size of the vector becomes
large when there are many words.
- To overcome various disadvantages, two alternatives are proposed.

**◆ Two alternatives**
1. frequency information based

    a. word frequency vector (bag of words)
   
    b. word - document matrix method (TF-IDF , etc. )
   
    c. co-occurrence matrix : word - word matrix , document - document matrix
   
2. Meaning ( subject / characteristic ) information-based

    a. subject vector ( semantic vector )
   
    b. word2vec/ Glove ( method is different, but the solution is the same )
   
    c. BERT, GPT, ...
    
**◆ Word frequency vector (word collection vector): Bag of words**
- A method of trying to understand the meaning of a sentence only with word collection data (word frequency), ignoring the order and grammar of words in a sentence
- Unlike the one-hot vector, it contains only the number of appearances, so it is difficult to reproduce the document .
- In the most basic way, there is a vector of binary word collections .
- Binary word collection vectors are useful for document search indexing, which tells which word is used in which document .

#### Integer Encoding & Padding

**◆ Integer encoding**
- It is the basic step among several techniques for converting text to numbers in natural language processing.
- A preprocessing task that maps each word to a unique integer.
- If there are 5,000 words in the text, unique integers mapped to words from 1 to 5,000 in each of the 5,000 words, In other words, an index is given, usually after sorting by word frequency.
- One of the ways to assign integers to words is to create a set of words (vocabulary) in which words are sorted in order of frequency. There is a way to assign integers from lowest to highest in order

**◆ padding**
- Each sentence ( or document ) can be of different lengths , but the machine divides all documents of the same length into one matrix. Reports can be grouped together and processed .
- Arbitrarily equalizing the length of several sentences for parallel operation

#### Word document procession

**◆ word Frequency (Term Frequency: TF)**
- the number of times the word appeared in the document
- If a particular word appears frequently in a particular document, the word is said to be closely related to that document.

```
// Example

Doc1 : the fox chases the rabbit
Doc2 : the rabbit ate the cabbage
Doc3 : the fox caught the rabbit
```

Rows are words, columns are documents(TDM: Term-Document Matrix)

<center>

|          | Doc1 | Doc2 | Doc3 |
|:--------:|:----:|:----:|:----:|
|   the    |  2   |  2   |  2   |
|   fox    |  1   |  0   |  1   |
|  rabbit  |  1   |  1   |  1   |
|  chases  |  1   |  0   |  0   |
|  caught  |  0   |  0   |  1   |
| cabbage  |  0   |  1   |  0   |
|   ate    |  0   |  1   |  0   |

</center>

#### Word document matrix

**◆ word frequency reverse document frequency**
- **Zipf's Law** : The frequency of use of any word is inversely proportional to the rank of that word. (Ex: 1st place is 3 times as frequent as 3rd place)
- Give low weight to words that appear frequently in the document but do not help to understand the meaning of the document -> IDF


**IDF**

A weight that measures the importance of a word. 

$$\mathrm{IDF} = \log(\frac{\mathrm{N}}{\mathrm{DF}})$$

N is the total number of documents. DF is the document frequency (the number of documents in which the word appears)

- The smaller the DF, the higher the importance of the word.
- The higher the IDF, the higher the importance of the word
- Words with high TF–IDF values give high discrimination in documents (important words in information retrieval)
- Calculate TF-IDF : (TF-IDF)(t, d)=TF(t, d) x IDF(t), where t is the word and d is the document

<center>

|          | Doc1 | Doc2 | Doc3 | DF | N/DF | IDF=$\log_2(\mathrm{N}/\mathrm{DF})$ |
|:--------:|:----:|:----:|:----:|:--:|:----:|:------------------------------------:| 
|   the    |  2   |  2   |  2   | 3  | 3/3  |            $\log_2(3/3)$             |
|   fox    |  1   |  0   |  1   | 2  | 3/2  |            $\log_2(3/2)$             |
|  rabbit  |  1   |  1   |  1   | 3  | 3/3  |            $\log_2(3/3)$             |
|  chases  |  1   |  0   |  0   | 1  | 3/1  |            $\log_2(3/1)$             |
|  caught  |  0   |  0   |  1   | 1  | 3/1  |            $\log_2(3/1)$             |
| cabbage  |  0   |  1   |  0   | 1  | 3/1  |            $\log_2(3/1)$             |
|   ate    |  0   |  1   |  0   | 1  | 3/1  |            $\log_2(3/1)$             |

</center>

**◆ TF standardization and regularization**
- The longer the length of the document, the higher the frequency of occurrence of the word and
the higher the possibility of being searched.
- Thus the longer the length of the document, the higher the possibility of similarity with other documents .
- Standardization and normalization of TF is necessary to complicate these week-points.

**◆ Standardization** 

$$z = \frac{\mathrm{TF}-\mu(\mathrm{TF})}{\sigma(\mathrm{TF})}$$

Example) doc 1

$$\mu(\mathrm{TF}) = \frac{5}{7} ~~ (\frac{\mathrm{number~of~occurrences}}{\mathrm{total~number~of~words}})$$
$$\sigma(\mathrm{TF}) = \sqrt{\frac{(2 - \mu)^2 + 3(1 - \mu)^2 + 3(0-\mu)^2}{6}}$$

<center>

|          |   Doc1    |   Doc2    |    Doc3    |
|:--------:|:---------:|:---------:|:----------:|
|   the    |  1.70084  |  1.70084  |  1.70084   |
|   fox    |  0.37796  | -0.944911 |  0.37796   |
|  rabbit  |  0.37796  |  0.37796  |  0.37796   |
|  chases  |  0.37796  | -0.944911 | -0.944911  |
|  caught  | -0.944911 | -0.944911 |  0.37796   |
| cabbage  | -0.944911 |  0.37796  | -0.944911  |
|   ate    | -0.944911 |  0.37796  | -0.944911  |

</center>

**◆ Normalization**

divide TF by the total frequency of the word (1+log(TF))/ ni

ni : frequency count of total words in

<center>

|          |   Doc1    |   Doc2    | Doc3 |
|:--------:|:---------:|:---------:|:----:|
|   the    |    0.4    |    0.4    | 0.4  |
|   fox    |    0.2    |     0     | 0.2  |
|  rabbit  |    0.2    |    0.2    | 0.2  |
|  chases  |    0.2    |     0     |  0   |
|  caught  |     0     |     0     | 0.2  |
| cabbage  |     0     |    0.2    |  0   |
|   ate    |     0     |    0.2    |  0   |

</center>

where $0.4 = \frac{1 + \log_2{2}}{5}$ and $0.2 = \frac{1 + \log_2{1}}{5}$

**◆ Normalized TF-IDF: Normalized TF times IDF**

<center>

| Normalized TF-IDF |                Doc1                |             Doc2             |  Doc3   |
|:-----------------:|:----------------------------------:|:----------------------------:|:-------:|
|        the        |    0 = $0.4 \times \log_2(3/3)$    | 0 = $0.4 \times \log_2(3/3)$ |    0    |
|        fox        | 0.11699 = $0.2 \times \log_2(3/2)$ |              0               | 0.11699 |
|      rabbit       |    0 = $0.2 \times \log_2(3/3)$    |              0               |    0    |
|      chases       | 0.31699 = $0.2 \times \log_2(3/1)$ |              0               |    0    |
|      caught       |                 0                  |              0               | 0.31699 |
|      cabbage      |                 0                  |           0.31699            |    0    |
|        ate        |                 0                  |           0.31699            |    0    |

</center>

**◆ Disadvantages of**
- Vectors of two pieces of text with different words, even though they have similar meanings (subjects), are in the TF-IDF vector space. If words with the same meaning but different spellings, TF-IDF vectors do not lie close together in the vector space.

**The TF-IDF method is difficult to use in the process of finding documents that are similar in meaning (topic)**

#### Co-Occurrence Matrix

**◆ joint ( simultaneous ) occurrence matrix**

A method of directly counting the number of times words appear simultaneously in a particular context. The number of simultaneous appearances is expressed as a matrix and the matrix is digitized to create word vectors.

```
Ex) 
Myeong-seok and Jun-seon went to America
Myeong-seok and Sang-ho went to the library
Myeong-seok and Jun-seon like cold noodles
```

< Co-occurrence matrix : word - word matrix > → square matrix , symmetric matrix

<center>

![wordvector](./images/wordvector.png)

</center>

→ Can be used as a social network analysis (e.g., calculating the centrality of each word (degree of connection , proximity , median , eigenvector))

- TDM : Term based Matrix

- DTM : Document based Matrix

#### Word Embedding

**◆ Word embedding (word embedding)**
- A one-hot vector is a sparse representation with many 0 's and only one 1’.
- In contrast to sparse representation , the size of the vector is determined by a value set by the user (smaller than the size of the word set ) rather than the size of the word set , and has real values other than 0 and 1.
- The method of expressing words as dense vectors is called word embedding, and the result obtained in this way is called an embedding vector.
- Examples of word embeddings include, LSA, word2vec, FastText , and Glove.

**◆ Distributed representation**
- **Local representation** is a method of expressing a word by looking only at the word itself and mapping a specific value.
- On the other hand , the distributed representation depends on the distribution hypothesis
- Based on the expression, it is made on the assumption that words appearing in similar positions have similar meanings, and the task of vectorizing the similarity of words corresponds to word embedding.
- Distributed representation methods refer to neighboring words to represent that word.
- For example, since the words cute and lovely often appear near the word puppy, the word puppy defines the word as cute and lovely.
- As an example of a distributed representation, There are techniques such as Word2vec.

#### Topic Vector

**◆ Topic vector (semantic vector)**
- Dimensional reduction of multidimensional vectors whose components are subject scores obtained using the weighted frequencies of TF-IDF vectors
- Group words of the same subject together using correlations between normalized term frequencies.
- Used for semantic-based retrieval, which searches documents based on their semantics → usually than keyword-based search is known to be accurate.
- Able to find a set of key words (keywords) that best summarize the meaning of a given document.
- There are (1) word subject vectors representing the meaning of words and (2) document subject vectors representing the meaning of documents.

▪ Word Topic Vectors : Create 3 topic scores {pet}, {animal } , {city} as subject vector reproduce

<center>

![topicvector.png](./images/TopicVector.png)

</center>