***Steps involved in Text Preprocessing:***<br>
1. Converting the sentences into lower case.
2. Now applying stemming and lemmatization to the sentences.
3. Converting the sentences into count vectors using the Bag Of Words concept.

***While text preprocessing, always convert the sentence into lower case or upper case.*** This is so because if the words are in two different cases, then we will be considering them as two different words. Example: Hey and hey will be considered as two different words and hence converting them into lower case or upper case would make them as single word hey/ HEY.

***NOTE:***<br>
Words like IN (for India) and US (for USA) should not be converted to lower case because they will change the meaning to in and us. Hence, while converting the sentences into lower case, first make a separate list of those words and combine them at the last.

Now, you can apply stemming and lemmatization to the sentences and remove the stop words.

Bag Of Words can be of two types:<br>
1. Binary BOW<br>
2. BOW

***Binary Bag Of Words:***<br>
In Binary BOW, we first count the frequency of the words that are being repeated. Then, based on the words, we make a table and write 1 if the word is present in the sentence otherwise 0.<br><img src='binary_bow.jpg' width=500><br>
***BOW:***<br>
In BOW, we write the frequency of the word in the table rather than writing whether it is present or not.<br><img src='bow.jpg' width=500><br>

You can see that the table is in the form of vectors. Take sentence 1, you can say that the vector form of sentence 1 is
'1,1,0'. These vectors can be the features/ independent variables and we would be given the output variable for it. From this, we can train our ML model.

***Disadvantage of BOW:***<br>
In the table, we cannot say that the word 'BOY' has more weightage or the word 'Good'. Hence, the major disadvantage while using BOW is that we cannot say whether a word in the sentence has more or less impact on the sentence.

***To overcome this issue, we use techniques like TF-IDF (Term Frequency - Inverse Document Frequency).***

When you are doing sentiment analysis, you can go ahead with Bag Of Words to compute but if you are using a large dataset, Word2Vec is the best method

# ***Bag Of Words:***

In [1]:
paragraph="Mathematical analysis formally developed in the 17th century during the Scientific Revolution,but many of its ideas can be traced back to earlier mathematicians. Early results in analysis were implicitly present in the early days of ancient Greek mathematics. For instance, an infinite geometric sum is implicit in Zeno's paradox of the dichotomy. Later, Greek mathematicians such as Eudoxus and Archimedes made more explicit, but informal, use of the concepts of limits and convergence when they used the method of exhaustion to compute the area and volume of regions and solids. The explicit use of infinitesimals appears in Archimedes' The Method of Mechanical Theorems, a work rediscovered in the 20th century. In Asia, the Chinese mathematician Liu Hui used the method of exhaustion in the 3rd century AD to find the area of a circle. Zu Chongzhi established a method that would later be called Cavalieri's principle to find the volume of a sphere in the 5th century. The Indian mathematician Bhāskara II gave examples of the derivative and used what is now known as Rolle's theorem in the 12th century."

In [2]:
import nltk

In [3]:
# Step 1: Convert the paragraph into lower case

In [27]:
paragraph =paragraph.lower()

In [28]:
# Step 2 is to apply stemming or lemmatization

In [29]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

In [30]:
lemmatizer=WordNetLemmatizer()

In [31]:
sentences=nltk.sent_tokenize(paragraph)

In [32]:
for i in range(len(sentences)):
    words=nltk.word_tokenize(sentences[i])
    words=[lemmatizer.lemmatize(word) for word in words if word not in set(stopwords.words('english'))]
    sentences[i]=' '.join(words)

In [33]:
sentences

['mathematical analysis formally developed 17th century scientific revolution , many idea traced back earlier mathematician .',
 'early result analysis implicitly present early day ancient greek mathematics .',
 "instance , infinite geometric sum implicit zeno 's paradox dichotomy .",
 'later , greek mathematician eudoxus archimedes made explicit , informal , use concept limit convergence used method exhaustion compute area volume region solid .',
 "explicit use infinitesimal appears archimedes ' method mechanical theorem , work rediscovered 20th century .",
 'asia , chinese mathematician liu hui used method exhaustion 3rd century ad find area circle .',
 "zu chongzhi established method would later called cavalieri 's principle find volume sphere 5th century .",
 "indian mathematician bhāskara ii gave example derivative used known rolle 's theorem 12th century ."]

In [34]:
# You can still observe that the punctuation marks are not removed. To remove them, we will use the re library. re stands for
# Regular Expression.

In [35]:
import re

In [36]:
for i in range(len(sentences)):
    sentences[i] = re.sub('[^a-zA-Z0-9]',' ',sentences[i])
    # re.sub() is used to replace the patterns (1st parameter) with the thing that is typed in the 2nd parameter. Note that
    # '^' symbol is denoted as 'not'. Hence, it will replace everything except a-z and A-Z and 0-9 with ' ' (spaces) from 
    # the ith sentence.

In [37]:
sentences

['mathematical analysis formally developed 17th century scientific revolution   many idea traced back earlier mathematician  ',
 'early result analysis implicitly present early day ancient greek mathematics  ',
 'instance   infinite geometric sum implicit zeno  s paradox dichotomy  ',
 'later   greek mathematician eudoxus archimedes made explicit   informal   use concept limit convergence used method exhaustion compute area volume region solid  ',
 'explicit use infinitesimal appears archimedes   method mechanical theorem   work rediscovered 20th century  ',
 'asia   chinese mathematician liu hui used method exhaustion 3rd century ad find area circle  ',
 'zu chongzhi established method would later called cavalieri  s principle find volume sphere 5th century  ',
 'indian mathematician bh skara ii gave example derivative used known rolle  s theorem 12th century  ']

In [38]:
# Step 3 is to convert the sentences into vectors using BOW

In [39]:
from sklearn.feature_extraction.text import CountVectorizer

In [40]:
cv=CountVectorizer()

In [45]:
X=cv.fit_transform(sentences).toarray()

In [47]:
import pandas as pd
X=pd.DataFrame(X)
X

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,72,73,74,75,76,77,78,79,80,81
0,0,1,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,0,0,0,0,0,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,1,1,1,0,0,0,0
4,0,0,1,0,0,0,0,0,1,1,...,0,1,0,1,0,0,1,0,0,0
5,0,0,0,1,0,1,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
6,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,1
7,1,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,0,0,0,0,0
