# **Gen-AI Bootcamp 24**
## **Natural Language Processing Course Assignment**

### **Task 1: Text Collection and Loading**


**Domain Choosen:** Top Rated TV Shows

**Datset Link:** https://www.kaggle.com/datasets/titassaha/top-rated-tv-shows


### **Loading the Dataset**

In [9]:
#Loading datset of top rated TV Drama

import pandas as pd

df = pd.read_csv("data_TV.csv")

print(df.shape)
df.head

(2617, 8)


<bound method NDFrame.head of      first_air_date origin_country original_language               name  \
0        2021-09-03             US                en  The D'Amelio Show   
1        2008-01-20             US                en       Breaking Bad   
2        2021-11-06             US                en             Arcane   
3        2013-12-02             US                en     Rick and Morty   
4        2022-04-14             US                en    The Kardashians   
...             ...            ...               ...                ...   
2612     2002-06-11             US                en      American Idol   
2613     2000-07-05             US                en        Big Brother   
2614     1997-03-31             GB                en        Teletubbies   
2615     1985-02-19             GB                en         EastEnders   
2616     2006-10-09             CA                fr             La Job   

      popularity  vote_average  vote_count  \
0         30.104       

## **Task 2: Text Preprocessing**
**Objective:** Gain hands-on experience with text preprocessing techniques.


Choose Gutenberg Corpus


### Tokenization

In [4]:
import spacy
from nltk.corpus import gutenberg

# Load SpaCy's English model
nlp = spacy.load("en_core_web_sm")

# Load the text from the Gutenberg corpus
gutenberg_text = gutenberg.raw('austen-emma.txt')
print(gutenberg_text[:1000])  # Print the first 1000 characters for brevity

# Process the text with SpaCy
doc = nlp(gutenberg_text)

# Split the text into sentences
sentences = list(doc.sents)
print("\nSentences:", [sentence.text for sentence in sentences[:10]])  # Print the first 10 sentences for brevity

# Split the text into words
words = [token.text for token in doc]
print("\nWords:", words[:100])  # Print the first 100 words for brevity


[Emma by Jane Austen 1816]

VOLUME I

CHAPTER I


Emma Woodhouse, handsome, clever, and rich, with a comfortable home
and happy disposition, seemed to unite some of the best blessings
of existence; and had lived nearly twenty-one years in the world
with very little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister's marriage,
been mistress of his house from a very early period.  Her mother
had died too long ago for her to have more than an indistinct
remembrance of her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse's family,
less as a governess than a friend, very fond of both daughters,
but particularly of Emma.  Between _them_ it was more the intimacy
of sisters.  Even before Miss Taylor had ceased to hold the nominal
office of governess, the mildness o


### Stemming

In [5]:
import nltk
from nltk.corpus import gutenberg
from nltk.stem import PorterStemmer
import spacy

# Load SpaCy's English model
nlp = spacy.load("en_core_web_sm")

# Load the text from the Gutenberg corpus
gutenberg_text = gutenberg.raw('austen-emma.txt')

# Process the text with SpaCy
doc = nlp(gutenberg_text) # Using only the first 5000 characters for brevity

# Tokenize the text into words (using SpaCy and list comprehension for better efficiency)
tokens = [token.text for token in doc]

# Initialize the Porter Stemmer
porter_stemmer = PorterStemmer()

# Apply stemming to each word
stemmed_tokens = [porter_stemmer.stem(word) for word in tokens]

# Print the first 100 tokens and the first 100 stemmed tokens for comparison
print("Original Tokens:", tokens[:100])
print("\nStemmed Tokens:", stemmed_tokens[:100])


Original Tokens: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', '\n\n', 'VOLUME', 'I', '\n\n', 'CHAPTER', 'I', '\n\n\n', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', '\n', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', '\n', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', '\n', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.', '\n\n', 'She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', '\n', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'s", 'marriage', ',', '\n', 'been', 'mistress', 'of', 'his']

Stemmed Tokens: ['[', 'emma', 'by', 'jane', 'austen', '1816', ']', '\n\n', 'volum', 'i', '\n\n', 'chapter', 'i', '\n\n\n', 'emma', 'woodhous', ',', 'handsom', ',', 'clever', ',', 'and', '


### Lemmatization

In [8]:
# Apply lemmatization to each word using SpaCy
lemmatized_tokens = [token.lemma_ for token in doc]
print("\nStemmed Tokens:", stemmed_tokens[:100])
print("\nLemmatized Tokens:", lemmatized_tokens[:100])


Stemmed Tokens: ['[', 'emma', 'by', 'jane', 'austen', '1816', ']', '\n\n', 'volum', 'i', '\n\n', 'chapter', 'i', '\n\n\n', 'emma', 'woodhous', ',', 'handsom', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfort', 'home', '\n', 'and', 'happi', 'disposit', ',', 'seem', 'to', 'unit', 'some', 'of', 'the', 'best', 'bless', '\n', 'of', 'exist', ';', 'and', 'had', 'live', 'nearli', 'twenti', '-', 'one', 'year', 'in', 'the', 'world', '\n', 'with', 'veri', 'littl', 'to', 'distress', 'or', 'vex', 'her', '.', '\n\n', 'she', 'wa', 'the', 'youngest', 'of', 'the', 'two', 'daughter', 'of', 'a', 'most', 'affection', ',', '\n', 'indulg', 'father', ';', 'and', 'had', ',', 'in', 'consequ', 'of', 'her', 'sister', "'s", 'marriag', ',', '\n', 'been', 'mistress', 'of', 'hi']

Lemmatized Tokens: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', '\n\n', 'volume', 'I', '\n\n', 'chapter', 'I', '\n\n\n', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfor


### Stop Word Removal

In [7]:
# Remove stop words
stop_words = spacy.lang.en.stop_words.STOP_WORDS
filtered_tokens = [token for token in lemmatized_tokens if token.lower() not in stop_words and token.isalpha()]
# Remove stop words
print("Original Tokens:", tokens[:100])
print("\nLemmatized Tokens:", lemmatized_tokens[:100])
print("\nFiltered Tokens (Stop Words Removed):", filtered_tokens[:100])

Original Tokens: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', '\n\n', 'VOLUME', 'I', '\n\n', 'CHAPTER', 'I', '\n\n\n', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'and', 'rich', ',', 'with', 'a', 'comfortable', 'home', '\n', 'and', 'happy', 'disposition', ',', 'seemed', 'to', 'unite', 'some', 'of', 'the', 'best', 'blessings', '\n', 'of', 'existence', ';', 'and', 'had', 'lived', 'nearly', 'twenty', '-', 'one', 'years', 'in', 'the', 'world', '\n', 'with', 'very', 'little', 'to', 'distress', 'or', 'vex', 'her', '.', '\n\n', 'She', 'was', 'the', 'youngest', 'of', 'the', 'two', 'daughters', 'of', 'a', 'most', 'affectionate', ',', '\n', 'indulgent', 'father', ';', 'and', 'had', ',', 'in', 'consequence', 'of', 'her', 'sister', "'s", 'marriage', ',', '\n', 'been', 'mistress', 'of', 'his']

Lemmatized Tokens: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', '\n\n', 'volume', 'I', '\n\n', 'chapter', 'I', '\n\n\n', 'Emma', 'Woodhouse', ',', 'handsome', ',', 'clever', ',', 'a