# Text Data Project â€“ NLP Pipeline using Project Gutenberg

This notebook loads an HTML dataset from Project Gutenberg, extracts text using BeautifulSoup, and performs basic NLP tasks such as tokenization and POS tagging.


## 1. Loading the Dataset (HTML File)

- Dataset Source: Project Gutenberg  
- Book Used: *Romeo and Juliet*  
- File Name: `romeo_and_juliet.html`  
- This section reads the HTML file and extracts the raw text.


In [1]:
from bs4 import BeautifulSoup

file_path = r"C:\Users\tikai\Downloads\Romeo_Juliet\Romeo_Juliet.html"

with open(file_path, "r", encoding="utf-8") as f:
    html = f.read()

soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()


## 2. Extract Raw Text from HTML (Scraping)

Using BeautifulSoup to parse the HTML content and extract clean text from the dataset.


In [2]:
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()

print("Extracted text sample:")
print(text[:500])   # show first 500 characters


Extracted text sample:


Romeo and Juliet | Project Gutenberg

























The Project Gutenberg eBook of Romeo and Juliet
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the 


## 2. Extracting Raw Text from HTML (Scraping)

Here we parse the HTML using BeautifulSoup and extract the raw text.
This removes tags and keeps only clean text that we can use for NLP tasks.


In [3]:
soup = BeautifulSoup(html, "html.parser")
text = soup.get_text()

print("Extracted text sample:")
print(text[:500])   # show first 500 characters


Extracted text sample:


Romeo and Juliet | Project Gutenberg

























The Project Gutenberg eBook of Romeo and Juliet
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the 


### 4.1 Download Required NLTK Resources

We download the tokenizers and POS tagger models needed for NLP.


In [4]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\tikai\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\tikai\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

### 4.2 Sentence Tokenization

Splitting the extracted text into meaningful sentences.


In [5]:
from nltk import sent_tokenize

sentences = sent_tokenize(text)
print("Total sentences:", len(sentences))
print(sentences[:5])


Total sentences: 3285
['\n\nRomeo and Juliet | Project Gutenberg\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nThe Project Gutenberg eBook of Romeo and Juliet\nThis ebook is for the use of anyone anywhere in the United States and\nmost other parts of the world at no cost and with almost no restrictions\nwhatsoever.', 'You may copy it, give it away or re-use it under the terms\nof the Project Gutenberg License included with this ebook or online\nat www.gutenberg.org.', 'If you are not located in the United States,\nyou will have to check the laws of the country where you are located\nbefore using this eBook.', 'Title: Romeo and Juliet\n\nAuthor: William Shakespeare\n\nRelease date: November 1, 1998 [eBook #1513]\n                Most recently updated: September 18, 2025\nLanguage: English\nCredits: the PG Shakespeare Team, a team of about twenty Project Gutenberg volunteers\n\n*** START OF THE PROJECT GUTENBERG EBOOK ROMEO AND JULIET ***\n\nTHE TRAGEDY OF ROMEO AND JULIET\nby Willi

### 4.3 Word Tokenization

Breaking the text into individual words (tokens).


In [6]:
from nltk import word_tokenize

words = word_tokenize(text)
print("Total words:", len(words))
print(words[:40])


Total words: 37615
['Romeo', 'and', 'Juliet', '|', 'Project', 'Gutenberg', 'The', 'Project', 'Gutenberg', 'eBook', 'of', 'Romeo', 'and', 'Juliet', 'This', 'ebook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost']


### 4.4 POS (Part-of-Speech) Tagging

Assigning grammatical tags (noun, verb, adjective, etc.) to each word.


In [7]:
from nltk import pos_tag

pos_tags = pos_tag(words[:50])  # tag a sample
print("POS Tags for first 50 words:")
print(pos_tags)


POS Tags for first 50 words:
[('Romeo', 'NNP'), ('and', 'CC'), ('Juliet', 'NNP'), ('|', 'NNP'), ('Project', 'NNP'), ('Gutenberg', 'NNP'), ('The', 'DT'), ('Project', 'NNP'), ('Gutenberg', 'NNP'), ('eBook', 'NN'), ('of', 'IN'), ('Romeo', 'NNP'), ('and', 'CC'), ('Juliet', 'NNP'), ('This', 'DT'), ('ebook', 'NN'), ('is', 'VBZ'), ('for', 'IN'), ('the', 'DT'), ('use', 'NN'), ('of', 'IN'), ('anyone', 'NN'), ('anywhere', 'RB'), ('in', 'IN'), ('the', 'DT'), ('United', 'NNP'), ('States', 'NNPS'), ('and', 'CC'), ('most', 'JJS'), ('other', 'JJ'), ('parts', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('world', 'NN'), ('at', 'IN'), ('no', 'DT'), ('cost', 'NN'), ('and', 'CC'), ('with', 'IN'), ('almost', 'RB'), ('no', 'DT'), ('restrictions', 'NNS'), ('whatsoever', 'RB'), ('.', '.'), ('You', 'PRP'), ('may', 'MD'), ('copy', 'VB'), ('it', 'PRP'), (',', ','), ('give', 'VB')]


## 5. Results Summary

- Sentence tokenization completed  
- Word tokenization completed  
- POS tagging completed  
- Output samples are displayed in each step  


In [None]:
# Final Summary

In this notebook:

- I downloaded an HTML dataset (Romeo & Juliet) from Project Gutenberg.
- Loaded and scraped the HTML using BeautifulSoup.
- Extracted raw text and created a dataset sample file (`text_sample.txt`).
- Performed sentence tokenization, word tokenization, and POS tagging using NLTK.
- Prepared the notebook with proper sections and documentation for submission.
