# Laboratory: Natural Language Processing (NLP)

**Topic:** Sentiment Analysis and Named Entity Recognition (NER) using spaCy.

**Educational Objectives:** Understanding the text processing pipeline, dimensionality reduction techniques (TF-IDF), interpretability of linear models, and semantic information extraction (NER).

---

## 1. Environment Setup and Libraries

Unlike older approaches based on NLTK, we will use the **spaCy** library, which offers efficient lemmatization implementations and pre-trained language models. **Scikit-learn** will be used for matrix calculations and statistical modeling.



In [None]:
# Install dependencies (uncomment if necessary)
# !pip install spacy pandas scikit-learn matplotlib seaborn wordcloud
# !python -m spacy download en_core_web_sm

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import re
from wordcloud import WordCloud
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Initialize the spaCy model
# We disable the 'parser' (syntactic dependencies) to speed up preprocessing for classification tasks
nlp = spacy.load('en_core_web_sm')


## 2. Data Engineering and Exploratory Data Analysis (EDA)

The IMDB dataset contains 50,000 movie reviews labeled simply as positive or negative. Due to the computational complexity of NLP operations (specifically NER), this laboratory operates on a representative sample.

In [None]:
# Load data
df = 

# Sample 2000 rows with a random state for reproducibility
df =

# Verify class balance
print(df['sentiment'].value_counts(normalize=True))

### 2.1 Corpus Visualization (WordCloud)
Analyzing word frequency in raw text allows for a preliminary assessment of information noise (e.g., HTML tags, common stop words).

In [None]:
text_combined = " ".join(review for review in df.review)

wordcloud = WordCloud(width=800, height=400, background_color='white', max_words=100).generate(text_combined)

plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Most Frequent Tokens in Corpus (Raw Data)')
plt.show()

## 3. Text Processing Pipeline

Effective text classification requires normalization of input data. This process includes:

1.  **Tokenization:** Splitting the character stream into semantic units (words).
2.  **Lemmatization:** Reducing a word to its base dictionary form (e.g., *saw* $\rightarrow$ *see*, *movies* $\rightarrow$ *movie*). This is more advanced than stemming as it considers grammatical context.
3.  **Filtration:** Removing *stop words* (words with low information value, e.g., *the, is, at*) and punctuation.



In [None]:
def preprocess_text(text):
    """
    Executes the full text cleaning pipeline using spaCy.
    Args:
        text (str): Raw review text.
    Returns:
        str: Cleaned text consisting of lemmas.
    """
    # 1. Remove HTML tags (common artifact in IMDB)
    
    # 2. Process via spaCy model
    
    # 3. Extract lemmas for tokens that are not stop words, punctuation, or whitespace

    return 

print("Starting pipeline processing (this may take a few moments)...")
# Apply function to the dataframe
df['clean_review'] = 
df[['review', 'clean_review']].head(3)

## 4. Feature Extraction (Vectorization)

Machine Learning algorithms require numerical data representation. We will use the **TF-IDF** (Term Frequency-Inverse Document Frequency) method, which weights words based on their uniqueness across the corpus.

The formula for TF-IDF for term $t$ in document $d$ is:
$$ \text{tf-idf}(t, d) = \text{tf}(t, d) \times \text{idf}(t) $$
Where:
$$ \text{idf}(t) = \log \frac{N}{|\{d \in D : t \in d\}|} $$

To preserve context (e.g., negations like "not good"), we will employ **N-grams** (unigrams + bigrams).

In [None]:
# Vectorizer Configuration
# ngram_range=(1, 2): includes single words and word pairs
# max_features=5000: limits feature space to the 5000 most significant tokens
tfidf = 

# Transform text corpus to sparse matrix
X =
y = 

print(f"Feature Matrix X dimensions: {X.shape}")

## 5. Model Classification and Evaluation

We will use **Logistic Regression** as the baseline classifier. It is a linear model that performs well in high-dimensional NLP tasks and offers high result interpretability.

In [None]:
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test, idx_train, idx_test = 

# Train model
model = 
# Predict on test set
y_pred = 

# Classification metrics

# Confusion Matrix Visualization
cm = confusion_matrix(y_test, y_pred)


### 5.1 Model Interpretability (Feature Importance)
Analyzing the weights assigned by the model to specific features (words/bigrams) allows us to understand which phrases determine sentiment.

In [None]:
feature_names = 
coefficients = 
coef_df = pd.DataFrame({'feature': feature_names, 'coef': coefficients})

# Sort features and visualize the top features for both classes
top_positive = coef_df.sort_values(by='coef', ascending=False).head(10)
top_negative = coef_df.sort_values(by='coef', ascending=True).head(10)


## 6. Named Entity Recognition (NER)

**Named Entity Recognition** is the task of identifying and classifying named entities in text into predefined categories, such as persons (PERSON), organizations (ORG), or locations (GPE).

In the context of movie review analysis, NER allows for answering business questions, such as:
* Which actors or directors are discussed most frequently?
* Which studios appear in a negative context?

We will utilize the built-in `ner` component from spaCy.

In [None]:
# Reload the model with the NER component enabled
# (We previously disabled it for preprocessing efficiency)
nlp_ner = spacy.load("en_core_web_sm")

def extract_entities(texts, label_filter=None):
    """
    Extracts named entities from a list of texts.
    Uses nlp.pipe for efficient batch processing.
    """
    entities = []
    # nlp.pipe processes texts as a stream - significantly faster than a for loop, batch size should be equal to 50

    return entities

print("Extracting PERSON entities from the entire dataset...")
# Filter only persons (PERSON) - actors, directors, characters
people_entities = extract_entities(df['review'], label_filter=['PERSON'])

# Convert to DataFrame for easy analysis
people_df = pd.DataFrame(people_entities, columns=['Person'])
top_people = people_df['Person'].value_counts().head(15)

# Visuzalize most mmentioned people


### Task 6.1: Cross-Analysis: Sentiment vs. Entities
Extract persons (`PERSON`) separately for positive and negative reviews to determine if specific names correlate with specific sentiments.