##**Importing nltk library for language processing**

In [68]:
import nltk  # Library for language processing
nltk.download('punkt') # Download the required package
nltk.download('stopwords') # Download the required package
nltk.download('wordnet') # Download the required package
import re #Library for String searching and manipulation
from nltk.corpus import stopwords # Libraring to remove stopwords
from nltk.stem.porter import PorterStemmer # Library for stemming
from nltk.stem import WordNetLemmatizer # Library for Lemmatization

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [0]:
paragraph = """The Republic of India is a country in Asia. It has an area of 3,287,263 square kilometres (1,269,219 sq mi). It is at the center of South Asia. India has more than 1.2 billion (1,210,000,000) people, which is the second largest population in the world. It is the seventh largest country in the world by area and the largest country in South Asia. It is also the most populous democracy in the world.

India has seven neighbours: Pakistan in the north-west, China and Nepal in the north, Bhutan and Bangladesh in the north-east, Myanmar in the east and Sri Lanka, an island, in the south.

The capital of India is New Delhi. India is a peninsula, bound by the Indian Ocean in the south, the Arabian Sea on the west and Bay of Bengal in the east. The coastline of India is of about 7,517 km (4,671 mi) long. India has the third largest military force in the world and is also a nuclear weapon state.

India's economy became the world's fastest growing in the G20 developing nations in the last quarter of 2014, replacing the People's Republic of China. India's literacy and wealth are also rising.According to New World Wealth, India is the seventh richest country in the world with a total individual wealth of $5.6 trillion. However, it still has many social and economic issues like poverty and corruption. India is a founding member of the World Trade Organisation (WTO), and has signed the Kyoto Protocol.

India has the fourth largest number of spoken languages per country in the world, only behind Papua New Guinea, Indonesia, and Nigeria. People of many different religions live there, including the five most popular world religions: Hinduism, Buddhism, Sikhism, Islam, and Christianity. The first three religions came from the Indian subcontinent along with Jainism."""

###**Cleaning the texts**

In [0]:
wordnet=WordNetLemmatizer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [wordnet.lemmatize(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

###**Creating the Bag of Words model using CountVectorizer**

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

###**Let's see our tokenized sentences**

In [72]:
sentences

['The Republic of India is a country in Asia.',
 'It has an area of 3,287,263 square kilometres (1,269,219 sq mi).',
 'It is at the center of South Asia.',
 'India has more than 1.2 billion (1,210,000,000) people, which is the second largest population in the world.',
 'It is the seventh largest country in the world by area and the largest country in South Asia.',
 'It is also the most populous democracy in the world.',
 'India has seven neighbours: Pakistan in the north-west, China and Nepal in the north, Bhutan and Bangladesh in the north-east, Myanmar in the east and Sri Lanka, an island, in the south.',
 'The capital of India is New Delhi.',
 'India is a peninsula, bound by the Indian Ocean in the south, the Arabian Sea on the west and Bay of Bengal in the east.',
 'The coastline of India is of about 7,517 km (4,671 mi) long.',
 'India has the third largest military force in the world and is also a nuclear weapon state.',
 "India's economy became the world's fastest growing in the 

###**Let's see our corpus after text cleaning and lemmatization**

In [73]:
corpus

['republic india country asia',
 'area square kilometre sq mi',
 'center south asia',
 'india billion people second largest population world',
 'seventh largest country world area largest country south asia',
 'also populous democracy world',
 'india seven neighbour pakistan north west china nepal north bhutan bangladesh north east myanmar east sri lanka island south',
 'capital india new delhi',
 'india peninsula bound indian ocean south arabian sea west bay bengal east',
 'coastline india km mi long',
 'india third largest military force world also nuclear weapon state',
 'india economy became world fastest growing g developing nation last quarter replacing people republic china',
 'india literacy wealth also rising according new world wealth india seventh richest country world total individual wealth trillion',
 'however still many social economic issue like poverty corruption',
 'india founding member world trade organisation wto signed kyoto protocol',
 'india fourth largest numbe

##**Observations:-** 
###**Lemmatization**
- We first applied tokenization to our paragraph
- Then we applied lemmatization to the tokenized sentences
- We can see that after lemmatizing the tokenized sentences the corpus consists of meaningful words
- Thus, for sentiment analysis we can use lemmatization for text pre-processing 

###**Now we will also perform stemming on our paragraph**

In [0]:
ps = PorterStemmer()
sentences = nltk.sent_tokenize(paragraph)
corpus = []
for i in range(len(sentences)):
    review = re.sub('[^a-zA-Z]', ' ', sentences[i])
    review = review.lower()
    review = review.split()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
    review = ' '.join(review)
    corpus.append(review)

###**Creating the Bag of Words model using CountVectorizer**

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()

###**Let's see our tokenized sentences**

In [76]:
sentences

['The Republic of India is a country in Asia.',
 'It has an area of 3,287,263 square kilometres (1,269,219 sq mi).',
 'It is at the center of South Asia.',
 'India has more than 1.2 billion (1,210,000,000) people, which is the second largest population in the world.',
 'It is the seventh largest country in the world by area and the largest country in South Asia.',
 'It is also the most populous democracy in the world.',
 'India has seven neighbours: Pakistan in the north-west, China and Nepal in the north, Bhutan and Bangladesh in the north-east, Myanmar in the east and Sri Lanka, an island, in the south.',
 'The capital of India is New Delhi.',
 'India is a peninsula, bound by the Indian Ocean in the south, the Arabian Sea on the west and Bay of Bengal in the east.',
 'The coastline of India is of about 7,517 km (4,671 mi) long.',
 'India has the third largest military force in the world and is also a nuclear weapon state.',
 "India's economy became the world's fastest growing in the 

###**Let's see our corpus after text cleaning and stemming**

In [77]:
corpus

['republ india countri asia',
 'area squar kilometr sq mi',
 'center south asia',
 'india billion peopl second largest popul world',
 'seventh largest countri world area largest countri south asia',
 'also popul democraci world',
 'india seven neighbour pakistan north west china nepal north bhutan bangladesh north east myanmar east sri lanka island south',
 'capit india new delhi',
 'india peninsula bound indian ocean south arabian sea west bay bengal east',
 'coastlin india km mi long',
 'india third largest militari forc world also nuclear weapon state',
 'india economi becam world fastest grow g develop nation last quarter replac peopl republ china',
 'india literaci wealth also rise accord new world wealth india seventh richest countri world total individu wealth trillion',
 'howev still mani social econom issu like poverti corrupt',
 'india found member world trade organis wto sign kyoto protocol',
 'india fourth largest number spoken languag per countri world behind papua new gui

##**Observations:-** 
###**Stemming**
- We first applied tokenization to our paragraph
- Then we applied lemmatization to the tokenized sentences
- After performing stemming the corpus consists of some words which has no meaning like "republ" , "popul" etc.
- So, performing stemming will not be a preferred choice while text pre-processing for sentiment analysis
- But we can surely use stemming in spam classification where meaningful words dont have a higher weightage than other words.