# Advanced Certification in AIML
## A Program by IIIT-H and TalentSprint



### Not for Grading

## Learning Objectives

At the end of the experiment, you will be able to

*   Understand Beautiful Soup
*   Use NLTK package

### Beautiful soup

Beautiful Soup is a Python library for pulling data out of HTML files.

### NLTK 

NLTK is a package in python that provides libraries for different text processing techniques, such as classification, tokenization, stemming, parsing and pos tagging

In [None]:
! wget https://cdn.iiith.talentsprint.com/aiml/Experiment_related_data/ipl.csv


### Importing required packages


In [None]:
import requests                                 # request is a library to send HTTP request in Python
from bs4 import BeautifulSoup as bs             # BeautifulSoup makes it easy to scrape information from web pages
from nltk.tokenize import word_tokenize         # word_tokenize() method is to split a sentence into tokens or words
from nltk.tokenize import sent_tokenize         # sent_tokenize() is to split a document or paragraph into sentences
import string
import nltk
import matplotlib.pyplot as plt

### Extract data from HTML using Beautiful Soup

In [None]:
# Specify the url
url= "http://shakespeare.mit.edu/allswell/full.html"

# Make a request to a web page and print the response text
try:
    r = requests.get(url)
    soup = bs(r.content, 'lxml')
except:
  pass

In [None]:
text = soup.get_text()
text

### Normalizing Text


From the given text, replace newline characters '\n' with " " and convert the text to lowercase

Hint: use .replace for replacing the '\n' and .lower() for converting text to lowercase

In [None]:
# YOUR CODE HERE
text = text.replace('\n'," ")
text = text.lower()
print(text)

### Tokenization


In [None]:
nltk.download('punkt')
nltk.download('wordnet')

Sentence Level Tokenization


In [None]:
sen_token = sent_tokenize(text)
len(sen_token)

In [None]:
sen_token[0]

Remove Punctuations after sentence tokenize

In [None]:
for i, sent in enumerate(sen_token):
  sen_token[i] = sent.translate(str.maketrans('', '', string.punctuation))
print(sen_token[10])        # print for 10th sentence and verify the punctuation removal

Average sentence length

In [None]:
len_sent = [len(sent.split()) for i, sent in enumerate(sen_token)]
Avg_Sent = sum(len_sent)//len(len_sent)
print("Average sentence length of shakespeare is ", Avg_Sent)

### Word Level Tokenization

Process of converting total text into words

In [None]:
wtokens = word_tokenize(text)
print(word_tokenize(text))

Remove Punctuations  after word tokenize

In [None]:
# Remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
wtokens = [w.translate(table) for w in wtokens]
wtokens = [w for w in wtokens if len(w) >= 1]
wtokens

### Stemming

Stemming is the process of converting the words of a text to its non-changing portions.

Hint: Refer to the following link for [Stemmer](https://www.nltk.org/book/ch03.html)

In [None]:
porter = nltk.PorterStemmer()
# YOUR CODE HERE: Apply stemmer for wtokens
stem = [porter.stem(i) for i in wtokens]
print(stem)

#### Lemmatization

Lemmatization is the process of converting the words of a sentence to its dictionary form

Hint: Refer to the following link for [Lemmatizer](https://www.nltk.org/book/ch03.html)

In [None]:
lemma = nltk.WordNetLemmatizer()
# YOUR CODE HERE: Apply lemmatize for wtokens
lemmatizer = [lemma.lemmatize(i) for i in wtokens]
print(lemmatizer[20:40])

#### Remove Stopwords

Hint: Refer to the following link for [Stopwords](https://stackabuse.com/removing-stop-words-from-strings-in-python/)

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords  
stop_words = set(stopwords.words('english')) 

In [None]:
stop_words_removed = []
# YOUR CODE HERE: Use NLTK packages for getting the Stop words and remove stopwords from wtokens
for i, word in enumerate(wtokens):
  if word not in stop_words:
    stop_words_removed.append(word)
stop_words_removed[:10]

#### Parts of Speech:


Given any sentence, you can classify each word as a noun, verb, conjunction, or any other class of words. When there are hundreds of thousands of sentences, even millions, this is obviously a large and tedious task. But it's not one that can't be solved computationally. 




In [None]:
nltk.download('averaged_perceptron_tagger')

To know what is DT, JJ, or any other tags, use below code to verify


In [None]:
nltk.download('tagsets')
nltk.help.upenn_tagset('NN')

Print the parts of speech for the first 20 wtokens using pos_tag 

Hint: Refer to the following link for[ parts of speech](https://www.nltk.org/book/ch05.html)

In [None]:
# YOUR CODE HERE
pos = nltk.pos_tag(wtokens[:20])
print(pos)

Get the count of NN tag from first 20 words

In [None]:
# YOUR CODE HERE
a = [i for i in pos if i[1] == 'NN']
print(len(a))

Get the count of VBP tag from first 20 words

In [None]:
# YOUR CODE HERE
a = [i for i in pos if i[1] == 'VBP']
print(len(a))