### **Sentiment Analysis Project using Spacy**

In this project, I will be building a Sentiment Analysis Pipeline.


In [22]:
#Install required librairies
!pip install spacy
!pip install pandas
!pip install sklearn
!pip install datasets

Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.


In [23]:
#Importing necessary packages
import spacy
import re
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from datasets import load_dataset  # Hugging Face Datasets

## **Exploring Open-Source Data**

We will use the IMDB dataset from Hugging Face's datasets library.
It contains movie reviews labeled as **positive** or **negative**.

In [24]:
# Load IMDB Dataset
imdb_data = load_dataset("imdb")
print(imdb_data)

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})


In [25]:
# Convert to DataFrame for easier handling
df = pd.DataFrame(imdb_data['train'])
print(df.head())

                                                text  label
0  I rented I AM CURIOUS-YELLOW from my video sto...      0
1  "I Am Curious: Yellow" is a risible and preten...      0
2  If only to avoid making this type of film in t...      0
3  This film was probably inspired by Godard's Ma...      0
4  Oh, brother...after hearing about this ridicul...      0


## **Data Preprocessing**

In [26]:
# Data (Text) cleansing : Removing special characters, convert to lowercase, etc.
def clean_text(text):
  text = re.sub(r"[^a-zA-Z\s]", "", text)
  text = text.lower().strip()
  return text

In [27]:
df['cleaned_text'] = df['text'].apply(clean_text)

In [28]:
# Split the data into training and testing sets
df_sampled = df.sample(frac=0.6, random_state=42)  # Use 50% of the Data
X = df_sampled['cleaned_text']
y = df_sampled['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Building the Sentiment Analysis Pipeline

Since SpaCy does not have a built-in sentiment analysis model, I'll be training a custom pipeline.


In [29]:
#Load Spacy's blank English Model
nlp = spacy.blank("en")

In [30]:
# Add a TextCategorizer to the pipeline
from spacy.pipeline.textcat import Config

config_string = """
[model]
@architectures = "spacy.TextCatEnsemble.v2"
[model.tok2vec]
@architectures = "spacy.Tok2Vec.v2"
[model.tok2vec.embed]
@architectures = "spacy.MultiHashEmbed.v2"
width = 64
# Add attributes with the same length as rows
attrs = ["NORM","PREFIX", "SHAPE"]
rows = [10000, 20000, 100000]
[model.tok2vec.encode]
@architectures = "spacy.MaxoutWindowEncoder.v2"
width = 64
depth = 2
window_size = 1
"""

config = Config().from_str(config_string) # Create a Config object first and then call from_str

textcat = nlp.add_pipe("textcat", config=config)

In [31]:
# Add labels for classification
textcat.add_label("positive")
textcat.add_label("negative")

1

In [32]:
# Prepare training data
train_data = [
    (text, {"cats": {"positive": bool(label), "negative": not bool(label)}})
    for text, label in zip(X_train, y_train)
]

### Train the Model

In [33]:
from spacy.training.example import Example
from spacy.training.loop import train
from spacy.util import minibatch

# Train the model
optimizer = nlp.begin_training()

for epoch in range(7):
    losses = {}
    batches = minibatch(train_data, size=32)
    for batch in batches:
        examples = [Example.from_dict(nlp.make_doc(text), annotations) for text,annotations in batch]
        nlp.update(examples, drop=0.5, losses=losses)
    print(f"Losses at epoch {epoch}: {losses}")

Losses at epoch 0: {'textcat': 89.38241279125214}
Losses at epoch 1: {'textcat': 64.46729025989771}
Losses at epoch 2: {'textcat': 47.44303681328893}
Losses at epoch 3: {'textcat': 41.59987869672477}
Losses at epoch 4: {'textcat': 33.934350945055485}
Losses at epoch 5: {'textcat': 29.871456357417628}
Losses at epoch 6: {'textcat': 26.063095181132667}


### Evaluating the Model

In [35]:
def evaluate_model(model, texts, labels):
  predictions = []
  for text in texts:
    doc = model(text)
    predictions.append(doc.cats["positive"] > doc.cats["negative"])
  print(classification_report(labels, predictions))

evaluate_model(nlp, X_test.tolist(), y_test.tolist())

              precision    recall  f1-score   support

           0       0.86      0.87      0.87      1485
           1       0.88      0.86      0.87      1515

    accuracy                           0.87      3000
   macro avg       0.87      0.87      0.87      3000
weighted avg       0.87      0.87      0.87      3000



### Prediction on New Text

In [38]:
test_text = "This movie was fantastic!"
doc = nlp(test_text)
print(f"Positive: {doc.cats['positive']}, Negative: {doc.cats['negative']}")

Positive: 0.9975942969322205, Negative: 0.002405634382739663
