### Q1. What is NLP and why is it important in today's AI ecosystem?
**Write your answer in 4–5 lines with at least one real-world application example.**

Natural Language Processing (NLP) is a branch of artificial intelligence that enables machines to understand, interpret, and generate human language. It bridges the gap between human communication and computer understanding. NLP is crucial in today's AI ecosystem because it powers applications like chatbots, voice assistants, language translation, and sentiment analysis. For example, Google Translate uses NLP to convert text from one language to another with impressive accuracy.

### Q2. Differentiate between the following:
**Tokenization vs Sentence Segmentation**
| Concept                   | Description                                                                                                                       |
| ------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| **Tokenization**          | Breaking text into smaller units like words or phrases (tokens). Example: `"I love NLP"` → `['I', 'love', 'NLP']`.                |
| **Sentence Segmentation** | Splitting a paragraph or document into individual sentences. Example: `"I love NLP. It's fun."` → `["I love NLP.", "It's fun."]`. |

**Stemming vs Lemmatization**
| Concept           | Description                                                                                                                    |
| ----------------- | ------------------------------------------------------------------------------------------------------------------------------ |
| **Stemming**      | Reduces words to their base or root form by chopping off suffixes, often leading to non-words. Example: `"running"` → `"run"`. |
| **Lemmatization** | Converts words to their dictionary (lemma) form using context and grammar. Example: `"better"` → `"good"`.                     |


**Stopwords removal vs Punctuation removal**
| Concept                 | Description                                                                            |
| ----------------------- | -------------------------------------------------------------------------------------- |
| **Stopwords Removal**   | Removes common words that carry little meaning in analysis (e.g., "the", "is", "and"). |
| **Punctuation Removal** | Deletes punctuation marks like commas, periods, and question marks from the text.      |


### Q3. List any 5 real-world applications of NLP and specify what NLP task is performed in each.
**Example: Chatbot – Named Entity Recognition (NER), Sentiment Analysis – Classification**
| Application                                      | NLP Task Performed                               |
| ------------------------------------------------ | ------------------------------------------------ |
| **Chatbot (e.g., Customer Support)**             | Named Entity Recognition (NER), Intent Detection |
| **Sentiment Analysis (e.g., Product Reviews)**   | Text Classification                              |
| **Machine Translation (e.g., Google Translate)** | Sequence-to-Sequence Modeling                    |
| **Voice Assistants (e.g., Alexa, Siri)**         | Speech Recognition, Intent Recognition           |
| **Spam Detection (e.g., Email Filters)**         | Text Classification                              |


In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to C:\Users\Hasti
[nltk_data]     Gohel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [12]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Hasti
[nltk_data]     Gohel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [16]:
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to C:\Users\Hasti
[nltk_data]     Gohel\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to C:\Users\Hasti
[nltk_data]     Gohel\AppData\Roaming\nltk_data...


True

In [18]:
import pandas as pd
import nltk
import re
import string
from bs4 import BeautifulSoup
import warnings
warnings.filterwarnings("ignore")
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

### Q4. Load the IMDb dataset and display the first 5 rows. Check for missing values and remove them if any.

In [3]:
df = pd.read_csv("IMDB Dataset.csv")

print("First 5 rows of the dataset:")
print(df.head())

print("\nMissing values in each column:")
print(df.isnull().sum())

# as we don't have any null values we do not need to drop it.

First 5 rows of the dataset:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive

Missing values in each column:
review       0
sentiment    0
dtype: int64


### Q5. Perform the following text preprocessing steps on the “review” column:
- Convert all text to lowercase
- Remove HTML tags
- Remove punctuation
- Remove numerical digits
- Display sample output for any 2 rows

In [7]:
def preprocess_text(text):
    text = text.lower()
    text = BeautifulSoup(text, "html.parser").get_text()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    return text

df['clean_review'] = df['review'].apply(preprocess_text)

print("Original vs Cleaned Reviews (First 2 Rows):\n")
for i in range(2):
    print(f"Original: {df['review'][i]}\n")
    print(f"Cleaned : {df['clean_review'][i]}\n")
    print("="*80)

Original vs Cleaned Reviews (First 2 Rows):

Original: One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I 

### Q6. Perform Tokenization on the reviews:
- Sentence tokenization (first 2 reviews)
- Word tokenization (first 2 reviews)
- Print the results clearly

In [10]:
reviews = df['review'].iloc[:2]

for i, review in enumerate(reviews):
    print(f"\n--- Review {i+1} ---")
    
    sentences = sent_tokenize(review)
    print("Sentence Tokenization:")
    for sent in sentences:
        print(f"• {sent}")

    words = word_tokenize(review)
    print("\nWord Tokenization:")
    print(words)


--- Review 1 ---
Sentence Tokenization:
• One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked.
• They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO.
• Trust me, this is not a show for the faint hearted or timid.
• This show pulls no punches with regards to drugs, sex or violence.
• Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary.
• It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda.
• Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I 

### Q7. Remove Stopwords:
- Use NLTK’s stopwords
- Show how the review looks before and after stopwords removal (1 example)

In [13]:
review = df['review'][0]
tokens = word_tokenize(review)


stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
filtered_review = ' '.join(filtered_tokens)

print("🔹 Original Review:\n")
print(review)

print("\n🔹 After Stopwords Removal:\n")
print(filtered_review)

🔹 Original Review:

One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the sh

### Q8. Apply Stemming using Porter Stemmer on first 5 reviews.
- Show before and after for each review.

In [15]:
stemmer = PorterStemmer()

for i in range(5):
    original = df['review'][i]
    
    # Tokenize
    tokens = word_tokenize(original)
    
    # Apply stemming
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    stemmed_review = ' '.join(stemmed_tokens)
    
    # Show before and after
    print(f"\n🔹 Review {i+1} - Before:\n{original}\n")
    print(f"🔸 Review {i+1} - After Stemming:\n{stemmed_review}\n")
    print("="*100)


🔹 Review 1 - Before:
One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the 

### Q9. Apply Lemmatization using SpaCy or WordNetLemmatizer.
- Show differences in results compared to stemming.
- Use 5–10 common words as examples.

In [19]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["running", "better", "wolves", "eating", "acting", "studies", "flies", "happier", "fought", "geese"]

print(f"{'Word':<12}{'Stemmed':<15}{'Lemmatized':<15}")
print("-" * 42)

for word in words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<12}{stemmed:<15}{lemmatized:<15}")

Word        Stemmed        Lemmatized     
------------------------------------------
running     run            running        
better      better         better         
wolves      wolv           wolf           
eating      eat            eating         
acting      act            acting         
studies     studi          study          
flies       fli            fly            
happier     happier        happier        
fought      fought         fought         
geese       gees           goose          


### Q10. BONUS (Optional but Recommended):
- Create a function preprocess_text(text) which combines all the steps above and applies to the entire dataset. Then, display the first 5 processed reviews.

In [20]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # 3. Remove punctuation and digits
    text = re.sub(r'[^\w\s]', '', text)     # Remove punctuation
    text = re.sub(r'\d+', '', text)         # Remove numbers
    
    # 4. Tokenization
    tokens = word_tokenize(text)
    
    # 5. Remove stopwords
    tokens = [word for word in tokens if word not in stop_words]
    
    # 6. Lemmatization
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # 7. Reconstruct cleaned text
    return ' '.join(tokens)

df['clean_review'] = df['review'].apply(preprocess_text)
print(df[['review', 'clean_review']].head())

                                              review  \
0  One of the other reviewers has mentioned that ...   
1  A wonderful little production. <br /><br />The...   
2  I thought this was a wonderful way to spend ti...   
3  Basically there's a family where a little boy ...   
4  Petter Mattei's "Love in the Time of Money" is...   

                                        clean_review  
0  one reviewer mentioned watching oz episode you...  
1  wonderful little production filming technique ...  
2  thought wonderful way spend time hot summer we...  
3  basically there family little boy jake think t...  
4  petter matteis love time money visually stunni...  
