Feature Engineering : Converting text into numbers: Also called Text representation or text vectorization

## Important Terms

1. corpus : Combination of all the words in the dataset
2. Vocabulary: Unique words in corpus
3. Document: Every record (row) of dataset is called document
4. Word: Indiviual words in document is called word

## Technique

1. one hot encoding
2. Bag of words
3. ngrams
4. Tfidf
5. custom features

#BAg of words

In [None]:
# Bag of words

import numpy as np
import pandas as pd

In [None]:
data = pd.DataFrame({'text':['people watch Youtube','Youtube watch Youtube','people write comment','Youtube write comment'],'output':[1,1,0,0]})

In [None]:
data

Unnamed: 0,text,output
0,people watch Youtube,1
1,Youtube watch Youtube,1
2,people write comment,0
3,Youtube write comment,0


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()

bag_of_word = cv.fit_transform(data['text'])

In [None]:
cv.vocabulary_

{'people': 1, 'watch': 2, 'youtube': 4, 'write': 3, 'comment': 0}

In [None]:
bag_of_word[0].toarray()

array([[0, 1, 1, 0, 1]])

In [None]:
bag_of_word[1].toarray()

array([[0, 0, 1, 0, 2]])

In [None]:
cv.transform(['people watch and write comment']).toarray()

array([[1, 1, 1, 1, 0]])

In [None]:
cv.transform(['people watch and write comment on YouTube']).toarray()


array([[1, 1, 1, 1, 1]])

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

# Ngrams

1. Bi-grams
2. tri-grams
3. n-grams

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(2,2))

bag_of_word = cv.fit_transform(data['text'])

In [None]:
cv.vocabulary_

{'people watch': 0,
 'watch youtube': 2,
 'youtube watch': 4,
 'people write': 1,
 'write comment': 3,
 'youtube write': 5}

In [None]:
bag_of_word[0].toarray()

array([[1, 0, 1, 0, 0, 0]])

In [None]:
cv = CountVectorizer(ngram_range=(1,3))

bag_of_word = cv.fit_transform(data['text'])

In [None]:
cv.vocabulary_

{'people': 1,
 'watch': 6,
 'youtube': 10,
 'people watch': 2,
 'watch youtube': 7,
 'people watch youtube': 3,
 'youtube watch': 11,
 'youtube watch youtube': 12,
 'write': 8,
 'comment': 0,
 'people write': 4,
 'write comment': 9,
 'people write comment': 5,
 'youtube write': 13,
 'youtube write comment': 14}

# TFIDF

#formulla

1. Tf(t,d) = (no. of occ of term t in doc D)/ total no. of term in doc D

2. idf(t) = log [ (total no. doc in the corpus)/ no. of doc with term t in them) ]

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer()


In [None]:
tf.fit_transform(data["text"]).toarray()

array([[0.        , 0.61366674, 0.61366674, 0.        , 0.49681612],
       [0.        , 0.        , 0.52546357, 0.        , 0.8508161 ],
       [0.57735027, 0.57735027, 0.        , 0.57735027, 0.        ],
       [0.61366674, 0.        , 0.        , 0.61366674, 0.49681612]])

In [None]:
print(tf.idf_)


[1.51082562 1.51082562 1.51082562 1.51082562 1.22314355]


In [None]:
tf.get_feature_names_out()

array(['comment', 'people', 'watch', 'write', 'youtube'], dtype=object)