<a href="https://colab.research.google.com/github/AnastasiiaDm/machine-learning/blob/main/DZ_11_text_non_vector_methods/text_non_vector_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import re

from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [2]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")
df = pd.read_csv(path + "/IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [33]:
df['sentiment'].unique()

array(['positive', 'negative'], dtype=object)

In [34]:
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


In [3]:
def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    # Remove punctuation and numbers, only keep alphabets and spaces
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Lowercase text
    text = text.lower()
    return text

# Apply preprocessing to dataframe
df['review'] = df['review'].apply(clean_text)

In [6]:
vectorizer = TfidfVectorizer(
    analyzer='word',
    ngram_range=(1, 2),
    stop_words='english',
    max_features=1000
)

X = vectorizer.fit_transform(df['review'])

In [7]:
vectorizer.get_feature_names_out()

array(['ability', 'able', 'absolutely', 'accent', 'act', 'acted',
       'acting', 'action', 'actor', 'actors', 'actress', 'actual',
       'actually', 'add', 'added', 'admit', 'adult', 'age', 'ago',
       'agree', 'air', 'alien', 'alive', 'amazing', 'america', 'american',
       'amusing', 'animation', 'annoying', 'apart', 'apparently',
       'appear', 'appears', 'appreciate', 'arent', 'army', 'art', 'aside',
       'ask', 'atmosphere', 'attempt', 'attempts', 'attention',
       'audience', 'audiences', 'average', 'avoid', 'away', 'awesome',
       'awful', 'baby', 'background', 'bad', 'bad movie', 'badly', 'band',
       'barely', 'based', 'basic', 'basically', 'battle', 'beautiful',
       'beauty', 'begin', 'beginning', 'begins', 'believable', 'believe',
       'ben', 'best', 'better', 'big', 'biggest', 'bit', 'bizarre',
       'black', 'blood', 'body', 'book', 'books', 'bored', 'boring',
       'bother', 'bought', 'box', 'boy', 'boys', 'brain', 'break',
       'brilliant', 'brin

In [9]:
X.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [11]:
y = df['sentiment']

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

    negative       0.87      0.84      0.86      4961
    positive       0.85      0.88      0.86      5039

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000



The model performs well with an accuracy of 86%, showing balanced performance across both positive and negative reviews:

- Precision: 0.87 (negative), 0.85 (positive) — High precision means few false positives.
- Recall: 0.84 (negative), 0.88 (positive) — Good recall, but slightly lower for negative reviews.
- - F1-Score: 0.86 for both classes — Balanced between precision and recall.
- Macro Average: 0.86 for precision, recall, and F1-score, indicating consistent performance across both classes.
- Weighted Average: 0.86 — Similar to the macro average due to almost equal class distribution.

Model is performing well overall, with only a slight drop in recall for negative reviews (84%).