In [8]:

"""
Machine learning Algorithm can't take data in raw text.

We can use feature extraction from the raw text in order to pass
numerical features to the machine learning algorithm.

For example, we could count the occurence of each word to map
text to a number.

Count Vectorization

Term-Frequency and Inverse Document Frequency.


"""



"\nMachine learning Algorithm can't take data in raw text.\n\nWe can use feature extraction from the raw text in order to pass\nnumerical features to the machine learning algorithm.\n\nFor example, we could count the occurence of each word to map\ntext to a number.\n\nCount Vectorization\n\nTerm-Frequency and Inverse Document Frequency.\n\n\n"

# Count Vectorization

In [9]:
msg = ["Hey, lets go to the game today!",
       "Call your mom.",
       "Want to go walk your dogs?"]
msg

['Hey, lets go to the game today!',
 'Call your mom.',
 'Want to go walk your dogs?']

In [10]:
from sklearn.feature_extraction.text import CountVectorizer

vector=CountVectorizer()
vector

# Count the occurrences of all the unique words.


In [11]:
# It will treat each individual unique word as a feature.

# fit the vectorizer to our data

vector.fit(msg)
vector.get_feature_names_out()


array(['call', 'dogs', 'game', 'go', 'hey', 'lets', 'mom', 'the', 'to',
       'today', 'walk', 'want', 'your'], dtype=object)

In [12]:
#  Document Term Matrix(DTM)

# It's going to count the occurrence of each unique feature or word
# throughout every single document.

# And each document is essentially a just each text messages.

# So we can think of a term document as just another word
# for every documented text.

# Imagine a very large set of documents known as a corpus,
# we are going to have a very sparse matrix, a matrix with a lot of
# zeros. It is called document term matrix or DTM.



# TfidfVetorizer

In [13]:
"""

An alternatice to CountVetorizer is something called TfidfVectorizer.
It also creates a document term matrix from our messages.

However, instead of filling with DTM with tokens counts it calculates
term frequency - inverse document frequency value for each word(TF-IDF)


Term frequency tf(t,d) :-
It is the raw count of a term in a document, i.e. the number of times
that term t occurs in document d.

However, Term Frequency alone is not enough for a thorough feature
analysis of the text!

Let's imagine very common terms , like "a" or "the"...

Because the term "the" is so common, term frequency will tend to
incorrectly emphasize which happen to use the word "the" more frequently.
without giving enough weight to the more meaningful terms "red" and "dogs"

An inverse document frequency factor is incorporated which diminishes
the weight of terms that occure very frequently in the document set and
increases the weight of terms that occur rarely.

It is the logarithmically scaled inverse fraction of the documents that
contain the word(obtained by dividing the total number of documents by
the number of documents containing the term, and then taking the
logarithm of that quotient.)

TF- IDF = term frequency * (1/document frequency)

TF- IDF = term frequency * inverse document frequency

"""




'\n\nAn alternatice to CountVetorizer is something called TfidfVectorizer.\nIt also creates a document term matrix from our messages.\n\nHowever, instead of filling with DTM with tokens counts it calculates\nterm frequency - inverse document frequency value for each word(TF-IDF)\n\n\nTerm frequency tf(t,d) :- \nIt is the raw count of a term in a document, i.e. the number of times\nthat term t occurs in document d.\n\nHowever, Term Frequency alone is not enough for a thorough feature\nanalysis of the text!\n\nLet\'s imagine very common terms , like "a" or "the"...\n\nBecause the term "the" is so common, term frequency will tend to\nincorrectly emphasize which happen to use the word "the" more frequently.\nwithout giving enough weight to the more meaningful terms "red" and "dogs"\n\nAn inverse document frequency factor is incorporated which diminishes\nthe weight of terms that occure very frequently in the document set and\nincreases the weight of terms that occur rarely.\n\nIt is the lo

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
vector = TfidfVectorizer()
vector.fit(msg)

In [16]:
vector.get_feature_names_out()

array(['call', 'dogs', 'game', 'go', 'hey', 'lets', 'mom', 'the', 'to',
       'today', 'walk', 'want', 'your'], dtype=object)

In [26]:
dtm = vector.fit_transform(msg)
print(dtm.transpose())

  (4, 0)	0.40301621080355077
  (5, 0)	0.40301621080355077
  (3, 0)	0.3065042162415877
  (8, 0)	0.3065042162415877
  (7, 0)	0.40301621080355077
  (2, 0)	0.40301621080355077
  (9, 0)	0.40301621080355077
  (0, 1)	0.6227660078332259
  (12, 1)	0.4736296010332684
  (6, 1)	0.6227660078332259
  (3, 2)	0.3494981241087058
  (8, 2)	0.3494981241087058
  (12, 2)	0.3494981241087058
  (11, 2)	0.45954803293870056
  (10, 2)	0.45954803293870056
  (1, 2)	0.45954803293870056


In [27]:
# TF - IDF allows us to understand the context of words across an entire
# corpus of documents instead of just its relative importance in a
# single document.

