# Bag Of Words

#### Here dont have numerical values, we have text which we want to transform to numeric values
* Bag of words - takes sentences and finds unique words in that sentence and transform to vectors (basic approach)<br>
  '<b>This is a book, This book is great' = BoW('This is a book great')</b>

In [1]:
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
class category:
    Mobile  = 'Mobile'
    Laptop = 'Laptop'
    
X = ['This is a mobile', 'I like this mobile', 'This is a great laptop', 'This is my favourite laptop']
y = [category.Mobile, category.Mobile, category.Laptop, category.Laptop]

In [3]:
cv = CountVectorizer(binary=True)
X_train = cv.fit_transform(X)

In [4]:
print(cv.get_feature_names_out())
print(X_train.toarray())

# This gives an array which is sorted by there occurence in the sentence - ex : is occured twice so, 2 

['favourite' 'great' 'is' 'laptop' 'like' 'mobile' 'my' 'this']
[[0 0 1 0 0 1 0 1]
 [0 0 0 0 1 1 0 1]
 [0 1 1 1 0 0 0 1]
 [1 0 1 1 0 0 1 1]]


In [5]:
from sklearn import svm
clf = svm.SVC(kernel='linear')
clf.fit(X_train, y)

In [6]:
X_test = cv.transform(['Gaming mobiles are cool'])
clf.predict(X_test)

array(['Mobile'], dtype='<U6')

#### There are many things we can add to this but this is just a basic example of how bag of words is working
#### More occurence(sentences) we feed, more cleaning we do --> will lead to a good model
#### Above example is of UNIGRAM - which takes 1 word at a time, Another one is BIGRAM - which takes 2 words
UNIGRAM - 'i like this' --> 'i like this', BIGRAM - 'i like this' --> 'i like, like this'

In [7]:
cv_with_ngram = CountVectorizer(binary=True, ngram_range = (1, 2))
X_train_ngram = cv_with_ngram.fit_transform(X)

In [8]:
cv_with_ngram.get_feature_names_out()

array(['favourite', 'favourite laptop', 'great', 'great laptop', 'is',
       'is great', 'is mobile', 'is my', 'laptop', 'like', 'like this',
       'mobile', 'my', 'my favourite', 'this', 'this is', 'this mobile'],
      dtype=object)

In [9]:
X_train_ngram.toarray()

array([[0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1],
       [0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0]])

#### BIGRAMS can be useful when we want to understand the tone of a sentence
#### EX : sentence = 'this is not cool' || Unigram : 'this, is, not, cool', bigram : 'this is, is not, not cool' || As we see it changes the tone of the sentence<b>

#### Bag of Words is great when the test data has words on which model was trained but if we pass in any word which was not seen during the training pass, then Bag of Words may not properly 

In [10]:
X_test = cv.transform(['I like 15 inch laptops'])
clf.predict(X_test)

array(['Mobile'], dtype='<U6')

# Word Vector
#### It tries to capture the semantic meaning of the words in the vectors<br>

#### Lets say we have some sentences which are for mobile:
'Calling is very good', 'It has SnapDragon', 'It is of 6-inches'<br>
Here WordVector will take more than 1 words and will try to relate them with the Mobile, like SnapDragon, calling, 6-inches often related to mobiles

In [11]:
import spacy
nlp = spacy.load("en_core_web_sm")

In [12]:
print(X)

['This is a mobile', 'I like this mobile', 'This is a great laptop', 'This is my favourite laptop']


In [13]:
docs = [nlp(txt) for txt in X]
train_spacy = [x.vector for x in docs]

In [14]:
docs[0].vector.shape

(96,)

In [15]:
clf_spacy = svm.SVC(kernel = 'linear')
clf_spacy.fit(train_spacy, y)

In [16]:
X_test = ['It runs on linux']
test_docs = [nlp(txt) for txt in X_test]
test_vectors = [x.vector for x in test_docs]

clf_spacy.predict(test_vectors)

array(['Laptop'], dtype='<U6')

# Regexes
Pattern matching of strings, 
Phone number, emails etc...

In [17]:
import re
regex = re.compile(r'na[^\s]') # ^ - not and \s - for empty space
sentence = ['anm', 'ahnsm', 'nakdldlnm']

matches = []
for i in sentence:
    if re.match(regex, i):
        matches.append(i)
print(matches)

['nakdldlnm']


# Stemming / Lemmatization
Techniques to normalize text<br>
reading - read, Books - book <br>
Stories - (Stori = Stemming and Story = Lemmatization)<br>
We can use this in our cleaning process of the corpus, it can help increase the accuracy also model will have to take less words to train

In [18]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

In [19]:
stemmer = PorterStemmer()
sentence = 'Driving the car'

words = word_tokenize(sentence) # works like split method

stemmed_words = []
for i in words:
    stemmed_words.append(stemmer.stem(i))

print(stemmed_words)
print(' '.join(stemmed_words))

['drive', 'the', 'car']
drive the car


In [20]:
lemmatizer = WordNetLemmatizer()
sentence = 'driving the car'

words = word_tokenize(sentence) # works like split method

lemmatized_words = []
for i in words:
    lemmatized_words.append(lemmatizer.lemmatize(i, pos='v')) # specifying verb to lemmatize

print(lemmatized_words)
print(' '.join(lemmatized_words))

['drive', 'the', 'car']
drive the car


# Stopwords removal
Set of most common words in english == this, that, is, a

In [21]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

sentence = 'Hello my disciple, are you doing good'
words = word_tokenize(sentence)

remove_stopwords = []
for i in words:
    if i not in stop_words:
        remove_stopwords.append(i)

print(words)
print(remove_stopwords)
print(' '.join(remove_stopwords))

['Hello', 'my', 'disciple', ',', 'are', 'you', 'doing', 'good']
['Hello', 'disciple', ',', 'good']
Hello disciple , good


# TextBlob
Natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

In [22]:
from textblob import TextBlob

# While using this techniques, it is better to lower() the words and use this techniques on top of that for better result
sentence = 'this is a Senetence'
tb_sentence = TextBlob(sentence)

tb_sentence.correct()

TextBlob("this is a Sentence")

Polarity is the output that lies between [-1,1], where -1 refers to negative sentiment and +1 refers to positive sentiment<br>Subjectivity is the output that lies within [0,1] and refers to personal opinions and judgments.

In [23]:
sent = 'I love how India tackled economic crisis, when other countries were in trouble'
TextBlob(sent).sentiment

Sentiment(polarity=0.09374999999999999, subjectivity=0.34375)

In [24]:
TextBlob(sent).tags

[('I', 'PRP'),
 ('love', 'VBP'),
 ('how', 'WRB'),
 ('India', 'NNP'),
 ('tackled', 'VBD'),
 ('economic', 'JJ'),
 ('crisis', 'NN'),
 ('when', 'WRB'),
 ('other', 'JJ'),
 ('countries', 'NNS'),
 ('were', 'VBD'),
 ('in', 'IN'),
 ('trouble', 'NN')]

In [None]:
# We can use RNN and Attention for better understanding of a text
# Transformer Architectures - BERT, OpenAI-ChatGPT, ElMo
# Spacy transformers and HuggingFace