# Text Preprocessing Assignment: NLTK and spaCy on Raw Review Data

## Introduction
This assignment focuses on the crucial step of text preprocessing in Natural Language Processing (NLP). You will utilize two popular Python libraries, NLTK and spaCy, to clean, transform, and prepare raw review data for further analysis or model building. Understanding and effectively applying these preprocessing techniques are fundamental skills for any NLP practitioner.

## Learning Objectives
Upon completion of this assignment, you should be able to:
- Load and inspect raw text data.
- Perform basic text cleaning operations (e.g., lowercasing, removing punctuation, numbers).
- Understand and apply different tokenization techniques using NLTK and spaCy.
- Identify and remove stop words effectively.
- Differentiate between and apply stemming and lemmatization.
- Utilize spaCy for advanced linguistic processing like Part-of-Speech (POS) tagging and Named Entity Recognition (NER).
- Create a custom preprocessing pipeline.
- Compare and contrast the functionalities of NLTK and spaCy for various preprocessing tasks.

## Dataset
For this assignment, you will use a raw review dataset. You can either use a dataset provided by your instructor, or if not provided, you can download a suitable dataset from platforms like Kaggle. A good example would be a sentiment analysis dataset containing product reviews (e.g., Amazon reviews, Yelp reviews, IMDB reviews).

**Assumption:** For the purpose of this notebook, we will assume you have a CSV file named `reviews.csv` with a column named `text` containing the review content. If your file or column name is different, please adjust the loading code accordingly.

**If you need to download a dataset, consider searching for one of these on Kaggle:**
- "Amazon Product Reviews"
- "Yelp Dataset"
- "IMDB Movie Reviews"

In [None]:
import pandas as pd
import re
import string

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

import spacy

# Download necessary NLTK data (if not already downloaded)
try:
    nltk.data.find('corpora/stopwords')
except nltk.downloader.DownloadError:
    nltk.download('stopwords')
try:
    nltk.data.find('tokenizers/punkt')
except nltk.downloader.DownloadError:
    nltk.download('punkt')
try:
    nltk.data.find('corpora/wordnet')
except nltk.downloader.DownloadError:
    nltk.download('wordnet')
try:
    nltk.data.find('taggers/averaged_perceptron_tagger')
except nltk.downloader.DownloadError:
    nltk.download('averaged_perceptron_tagger')

# Load spaCy model (if not already loaded/downloaded)
try:
    nlp = spacy.load('en_core_web_sm')
except OSError:
    print("Downloading 'en_core_web_sm' spaCy model...")
    spacy.cli.download('en_core_web_sm')
    nlp = spacy.load('en_core_web_sm')


# Load your dataset
try:
    df = pd.read_csv('reviews.csv')
    # Assuming the review text is in a column named 'text'
    # If your column is different, change 'text' below
    reviews = df['text'].astype(str) # Ensure it's string type
    print("Dataset loaded successfully!")
    print(f"Number of reviews: {len(reviews)}")
    print("First 5 reviews:")
    print(reviews.head())
except FileNotFoundError:
    print("Error: 'reviews.csv' not found. Please make sure the dataset is in the same directory or provide the correct path.")
    print("Please create a dummy DataFrame for demonstration if you don't have the file yet.")
    reviews = pd.Series([
        "This product is absolutely amazing! I love it. Highly recommended! #awesome",
        "The quality was terrible. I would never buy this again. It cost $50.",
        "Service was good, but the food was a bit cold. Overall 3/5 stars. 🌟🌟🌟",
        "Great experience with NLTK and spaCy. Learned a lot today at 9 AM!",
        "I bought 2 of these. They arrived late. Contact customer support at 1-800-REV-VIEW."
    ], name='text')
    print("Using dummy data for demonstration.")
    print(reviews)

# Select a sample review for detailed demonstration if the dataset is large
if len(reviews) > 0:
    sample_review = reviews.iloc[0]
    print(f"\nSample Review for detailed steps: "\n"{sample_review}")
else:
    sample_review = "This is a sample review for demonstration purposes."
    print(f"\nNo reviews loaded, using generic sample: "\n"{sample_review}")

## Assignment Questions

### Question 1: Basic Text Cleaning
Write a function `clean_text_basic(text)` that performs the following basic cleaning steps on a given string:
1.  **Convert to Lowercase:** Convert all characters to lowercase.
2.  **Remove Punctuation:** Remove all punctuation marks (e.g., `!`, `.`, `,`, `?`, etc.).
3.  **Remove Numbers:** Remove all numerical digits.
4.  **Remove Extra Whitespace:** Replace multiple spaces with a single space and strip leading/trailing whitespace.

Apply this function to the `sample_review` and print the result.

### Question 2: Tokenization (NLTK vs. spaCy)
Tokenization is the process of breaking down text into individual words or units (tokens).

#### 2.1 NLTK Word Tokenization
Using NLTK's `word_tokenize`, tokenize the `cleaned_sample_review` (output from Q1). Print the first 20 tokens.

#### 2.2 spaCy Tokenization
Using spaCy, tokenize the *original* `sample_review`. Print the first 20 tokens and for each token, also print its `text` and `is_punct` attribute.

#### 2.3 Comparison
Briefly discuss the differences you observe in tokenization between NLTK and spaCy based on the `sample_review`. Which one handles punctuation and special characters more intuitively for your use case (review data)?

### Question 3: Stop Word Removal
Stop words are common words (like 'the', 'is', 'a') that often carry little meaning in text analysis and are typically removed.

#### 3.1 NLTK Stop Word Removal
Using NLTK's `stopwords` corpus, remove stop words from the tokens obtained in Question 2.1 (NLTK tokenized, cleaned sample review). Print the tokens after stop word removal.

#### 3.2 spaCy Stop Word Removal
Using spaCy's built-in stop word list, remove stop words from the tokens obtained in Question 2.2 (spaCy tokenized, original sample review). Print the tokens (their `text` attribute) after stop word removal. (Hint: use `token.is_stop` attribute).


#### 3.3 Discussion
Compare the stop word lists and the results between NLTK and spaCy. Are there any notable differences? Which approach do you prefer and why?

### Question 4: Stemming vs. Lemmatization
Stemming reduces words to their root form (stem), often by simply chopping off suffixes. Lemmatization reduces words to their base or dictionary form (lemma), considering the word's meaning.

#### 4.1 NLTK Stemming (Porter Stemmer)
Apply NLTK's `PorterStemmer` to the tokens obtained after stop word removal in Question 3.1. Print the stemmed tokens.

#### 4.2 NLTK Lemmatization (WordNetLemmatizer)
Apply NLTK's `WordNetLemmatizer` to the tokens obtained after stop word removal in Question 3.1. Remember to provide the POS tag for better lemmatization (e.g., `'v'` for verbs, `'n'` for nouns). For simplicity, you can initially assume all words are nouns if not explicitly tagged, or try to use a simple POS tagger for more accurate results. Print the lemmatized tokens.

#### 4.3 spaCy Lemmatization
Apply spaCy's lemmatization to the tokens obtained after stop word removal in Question 3.2. Print the lemma of each token (using `token.lemma_`).

#### 4.4 Comparison
Compare the results of stemming and lemmatization. Provide examples from your output to illustrate the differences. When would you prefer lemmatization over stemming?

### Question 5: Advanced spaCy Features
spaCy excels at providing richer linguistic annotations.

#### 5.1 Part-of-Speech (POS) Tagging
For the *original* `sample_review`, use spaCy to perform POS tagging. Print each token along with its `text`, `pos_` (coarse-grained POS), and `tag_` (fine-grained POS) attributes.

#### 5.2 Named Entity Recognition (NER)
For the *original* `sample_review`, use spaCy to identify named entities. Print each entity's `text`, `label_` (entity type), and `start_char`, `end_char` attributes.

#### 5.3 Discussion
How can POS tagging and NER be beneficial in the context of analyzing customer reviews? Give specific examples.

### Question 6: Custom Preprocessing Pipeline
Create a function `preprocess_review(text)` that combines the following steps:
1.  **Basic Cleaning:** Apply the cleaning steps from Question 1.
2.  **spaCy Processing:** Process the cleaned text with spaCy.
3.  **Tokenization & Lowercasing (if not already):** Extract tokens and ensure they are lowercased (though spaCy's `lemma_` is usually lowercase).
4.  **Remove Stop Words:** Remove stop words using spaCy's `is_stop` attribute.
5.  **Lemmatization:** Apply lemmatization using spaCy's `lemma_` attribute.
6.  **Remove Punctuation Tokens:** Remove tokens that are purely punctuation (use `token.is_punct`).
7.  **Filter for Alphanumeric Tokens:** Keep only tokens that are alphanumeric and are not just spaces (e.g., `token.is_alpha` or a regex check).
8.  **Join Tokens:** Join the processed tokens back into a single string, separated by spaces.

Apply this function to the `sample_review` and print the final preprocessed string.

### Question 7: Apply to Full Dataset and Analysis
Apply your `preprocess_review` function (from Question 6) to the `text` column of your entire `reviews` DataFrame. Store the results in a new column named `processed_text`.

After processing, display the first 5 rows of the DataFrame showing both the original `text` and the `processed_text` columns. Briefly discuss any challenges or observations you encountered when applying the function to the full dataset.

## Submission Guidelines
- Ensure your notebook runs without errors from top to bottom.
- Save your notebook as `your_name_text_preprocessing_assignment.ipynb`.
- Clearly answer all questions and provide explanations where requested.
- Feel free to add markdown cells for additional explanations or observations.