# What is feature extraction from text?

We need to extract some patterns from textual data. Which means convert textual data into numerical represntations to allow the model to receive the data for be trained

# Why we need it?

Whatever model is beeing used, it is just a mathematical abstraction to stack functions and compute some formulas, so we need to transform the text to numbers or vectors to allow the model perform its computation.

# Why it is so difficult?

The process to encode the text is challenging due to some issues like the numbers of tokens and how to link it to the model input size, determine the len of sentence, vocabulary and so on... How we convert each sentence to meaninful numbers and stablish some relationship between the sentences present in the data.

A clear example is, using a common transformation on tabular data like One-Hot Encoding. Performinh such transformation, imagining having a text and creating one column for each word would lead to a dimentionaly problem with the amout of columns we would generate and most of the columns would have 0 values, and problems with computational processing.

# Bag of Words

A kind of representation based on the frequence of words. The BoW is a matrix with each row represent a document and each column represents a word from the vocabulary and the Numbers indicate the freuency of each word in the respective document. Like the examples below.

documents = [
    "I love programming in Python",
    "Python and Java are popular programming languages",
    "I do not love Java"
]

Feature Names: ['and' 'are' 'in' 'java' 'languages' 'love' 'programming' 'python' 'too']

BoW Matrix:

 [[0 0 1 0 0 1 1 1 0]

  [1 1 0 1 1 0 1 1 0]

  [0 0 0 1 0 1 0 0 1]]

This kine of approach is suitable for some task classification like sentiment analysis. But have a drawback to not capture the semantic  information.

In [1]:
import numpy as np
import pandas as pd

In [3]:
df = pd.DataFrame({"text":["people watch lineker",
                         "lineker watch lineker",
                         "people write comment",
                          "lineker write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch lineker,1
1,lineker watch lineker,1
2,people write comment,0
3,lineker write comment,0


In [7]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

In [10]:
bow = cv.fit_transform(df['text'])
print(cv.vocabulary_)

{'people': 2, 'watch': 3, 'lineker': 1, 'write': 4, 'comment': 0}


In [11]:
bow.toarray()

array([[0, 1, 1, 1, 0],
       [0, 2, 0, 1, 0],
       [1, 0, 1, 0, 1],
       [1, 1, 0, 0, 1]])

In [12]:
print(bow[0].toarray())
print(bow[1].toarray())
print(bow[2].toarray())

[[0 1 1 1 0]]
[[0 2 0 1 0]]
[[1 0 1 0 1]]


In [14]:
cv.transform(['Aguiar watch Lineker']).toarray()

array([[0, 1, 0, 1, 0]])

## N-Grams

In [16]:
df = pd.DataFrame({"text":["people watch Lineker",
                         "Lineker watch Lineker",
                         "people write comment",
                          "Lineker write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch Lineker,1
1,Lineker watch Lineker,1
2,people write comment,0
3,Lineker write comment,0


- BI grams

In [17]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2, 2))

In [18]:
bow = cv.fit_transform(df['text'])

In [19]:
print(cv.vocabulary_)

{'people watch': 2, 'watch lineker': 4, 'lineker watch': 0, 'people write': 3, 'write comment': 5, 'lineker write': 1}


# TF-IDF

TF-IDF is a numerical statistic used to reflect the importance of a word in a document relative to a collection (corpus) of documents. Unlike the Bag of Words (BoW) model, which only considers raw word counts, TF-IDF adjusts for the fact that some words appear more frequently across all documents and thus may not be as informative.

But we have to metion some pros and cons

- Pros

    1. Reduces the impact of common words – Unlike BoW, it gives lower weights to frequently occurring words like "the", "is".
    2. Keeps important words relevant – Words that appear frequently in a single document but not in others get a higher weight.
    3. Simple and effective – Works well for many text-based applications like search engines and document ranking.
    4. No need for labeled data – TF-IDF can be applied in an unsupervised manner.

- Cons

    1. Ignores word order and meaning – Doesn't capture semantics.
    2. Sparsity – Results in high-dimensional sparse matrices, which can be inefficient for large corpora.
    3. Static representation – It does not adapt dynamically to new data, unlike deep learning-based embeddings like Word2Vec.
    4. Limited effectiveness in long documents – Since it treats each document separately, long documents with more frequent terms may get biased scores.

It is a good approach to use for information retrieval or feature extraction for classification tasks

In [20]:
df = pd.DataFrame({"text":["people watch Lineker",
                         "Lineker watch Lineker",
                         "people write comment",
                          "Lineker write comment"],"output":[1,1,0,0]})

df

Unnamed: 0,text,output
0,people watch Lineker,1
1,Lineker watch Lineker,1
2,people write comment,0
3,Lineker write comment,0


In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfid = TfidfVectorizer()

In [23]:
arr = tfid.fit_transform(df['text']).toarray()

In [24]:
arr

array([[0.        , 0.49681612, 0.61366674, 0.61366674, 0.        ],
       [0.        , 0.8508161 , 0.        , 0.52546357, 0.        ],
       [0.57735027, 0.        , 0.57735027, 0.        , 0.57735027],
       [0.61366674, 0.49681612, 0.        , 0.        , 0.61366674]])

# Word2Vec