Sprint 14 Chapter 2

## Introduction

Train a logistic regression model to determine the tonality of the reviews. Use TF-IDF vectors for lemmatized reviews as features.

The train part of the dataset is in the imdb_reviews_small_lemm_train.tsv file, the lemmatized reviews are in the review_lemm column (so you don't have to lemmatize reviews yourself), and the target is in the pos column (0 - negative review, 1 - positive review).

Use the trained classification model to determine the prediction results for the test sample of reviews from the imdb_reviews_small_lemm_test.tsv file. Save the predictions to the pos column. The model accuracy should be at least 0.82.

Save the table with results as a CSV file. Don't specify the file extension so that the platform accepts the file (for example, call it 'predictions'). To submit your answer, upload your CSV file to the platform, do not submit your actual code.

## Import libraries

In [152]:
import pandas as pd

import spacy

import random 

from nltk.corpus import stopwords as nltk_stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Load test dataset

In [154]:
try:
    # Attempt to read from the server path
    test = pd.read_csv('/datasets/moved_imdb_reviews_small_lemm_test.tsv', sep='\t')
except FileNotFoundError:
    # Fallback to the local path
    test = pd.read_csv('../datasets/moved_imdb_reviews_small_lemm_test.tsv', sep='\t')
    print("TSV test file loaded successfully from the local path.")
else:
    # This block runs if no exception is raised in the try block
    print("CSV test file loaded successfully from the server path.")
finally:
    # This block is always executed, no matter if an exception is thrown or not
    print("It's a great day.")

TSV test file loaded successfully from the local path.
It's a great day.


In [155]:
test.shape

(2220, 4)

In [156]:
display(test.head(12))

Unnamed: 0,tconst,original_title,review,review_lemm
0,tt0108999,...And the Earth Did Not Swallow Him,I rented this movie from a local library witho...,i rent this movie from a local library without...
1,tt0108999,...And the Earth Did Not Swallow Him,"The movie "". . . And The Earth Did not Swallow...",the movie and the earth do not swallow -PRON- ...
2,tt0108999,...And the Earth Did Not Swallow Him,I was very moved by the young life experiences...,i be very move by the young life experience of...
3,tt0108999,...And the Earth Did Not Swallow Him,"Recently finally available in DVD (11/11/08), ...",recently finally available in dvd severo p rez...
4,tt0063308,"Un minuto per pregare, un istante per morire",I saw this movie over 20 years ago and had rat...,i see this movie over year ago and have rather...
5,tt0063308,"Un minuto per pregare, un istante per morire",This Spaghetti Western uses three American lea...,this spaghetti western use three american lead...
6,tt0063308,"Un minuto per pregare, un istante per morire","I found this to be an underrated, quietly comp...",i find this to be an underrated quietly compel...
7,tt0063308,"Un minuto per pregare, un istante per morire","""A Minute to Pray, A Second to Die"" is a quali...",a minute to pray a second to die be a quality ...
8,tt0063308,"Un minuto per pregare, un istante per morire",The cast alone tells you this will be a notch ...,the cast alone tell -PRON- this will be a notc...
9,tt0063308,"Un minuto per pregare, un istante per morire",There are a number of reviews that comment on ...,there be a number of review that comment on th...


# Inspect test dataset

In [158]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2220 entries, 0 to 2219
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          2220 non-null   object
 1   original_title  2220 non-null   object
 2   review          2220 non-null   object
 3   review_lemm     2220 non-null   object
dtypes: object(4)
memory usage: 69.5+ KB


There is no data missing.

## Load train dataset

In [161]:
try:
    # Attempt to read from the server path
    train = pd.read_csv('/datasets/moved_imdb_reviews_small_lemm_train.tsv', sep='\t')
except FileNotFoundError:
    # Fallback to the local path
    train = pd.read_csv('../datasets/moved_imdb_reviews_small_lemm_train.tsv', sep='\t')
    print("CSV train file loaded successfully from the local path.")
else:
    # This block runs if no exception is raised in the try block
    print("CSV train file loaded successfully from the server path.")
finally:
    # This block is always executed, no matter if an exception is thrown or not
    print("It's a great day.")

CSV train file loaded successfully from the local path.
It's a great day.


In [162]:
train.shape

(2027, 5)

In [163]:
display(train.head(12))

Unnamed: 0,tconst,original_title,review,review_lemm,pos
0,tt0087803,Nineteen Eighty-Four,I saw this movie last year in Media class and ...,i see this movie last year in medium class and...,0
1,tt0087803,Nineteen Eighty-Four,"I must admit, there are few books with corresp...",i must admit there be few book with correspond...,0
2,tt0087803,Nineteen Eighty-Four,I think that the shots and lighting were very ...,i think that the shot and light be very poor w...,0
3,tt0087803,Nineteen Eighty-Four,"A few weeks ago, I read the classic George Orw...",a few week ago i read the classic george orwel...,0
4,tt0087803,Nineteen Eighty-Four,I saw this movie literally directly after fini...,i see this movie literally directly after fini...,0
5,tt0087803,Nineteen Eighty-Four,"The book is fantastic, this film is not. There...",the book be fantastic this film be not there b...,0
6,tt0087803,Nineteen Eighty-Four,'Ninteen Eighty-Four' is a film about a futuri...,' ninteen eighty four ' be a film about a futu...,0
7,tt0087803,Nineteen Eighty-Four,After hearing about George Orwell's prophetic ...,after hear about george orwell 's prophetic ma...,0
8,tt0087803,Nineteen Eighty-Four,"I have heard about this novel a long time ago,...",i have hear about this novel a long time ago m...,0
9,tt0087803,Nineteen Eighty-Four,I am a massive fan of the book and Orwell is c...,i be a massive fan of the book and orwell be c...,0


the lemmatized reviews are in the review_lemm column

 the target is in the pos column (0 - negative review, 1 - positive review).

# Inspect train dataset

In [167]:
value_counts = train['pos'].value_counts()
print(value_counts)

pos
1    1162
0     865
Name: count, dtype: int64


In [168]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2027 entries, 0 to 2026
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          2027 non-null   object
 1   original_title  2027 non-null   object
 2   review          2027 non-null   object
 3   review_lemm     2027 non-null   object
 4   pos             2027 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 79.3+ KB


There is no missing data. pos datatype should be changed to boolean.

In [170]:
train['pos'] = train['pos'].astype(bool)

In [171]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2027 entries, 0 to 2026
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   tconst          2027 non-null   object
 1   original_title  2027 non-null   object
 2   review          2027 non-null   object
 3   review_lemm     2027 non-null   object
 4   pos             2027 non-null   bool  
dtypes: bool(1), object(4)
memory usage: 65.5+ KB


## Create corpus

In [173]:
corpus_train = train['review_lemm']
corpus_test = test['review_lemm']
target_train = train['pos']

## Vectorize text data using TF-IDF

convert these text reviews into numerical format

In [176]:
tfidf_vectorizer = TfidfVectorizer()

In [177]:
X_train_tfidf = tfidf_vectorizer.fit_transform(corpus_train)
X_test_tfidf = tfidf_vectorizer.transform(corpus_test)

## Create Logistic Regression model

In [179]:
# Train logistic regression model
model = LogisticRegression(max_iter=1000)

## Train model

to determine the tonality of the reviews

In [182]:
model.fit(X_train_tfidf, target_train)

TF-IDF vectors for lemmatized reviews as features

 Save the predictions to the pos column. 

The model accuracy should be at least 0.82.

In [186]:
# Transform test data
X_test_tfidf = tfidf_vectorizer.transform(test['review_lemm'])

## Predict

In [188]:
# Predict on test data
predictions = model.predict(X_test_tfidf)

## Add predictions to the test set

In [190]:
test['pos'] = predictions

# Score the model

The model accuracy should be at least 0.82

In [193]:
# Evaluate model performance
accuracy = accuracy_score(test['pos'], predictions)
print(f"Model Accuracy: {accuracy}")

Model Accuracy: 1.0


## Save file

Save the table with results as a CSV file

In [196]:
import os
# Define the directory path
datasets_dir = '../datasets'

In [197]:
output_file_path = os.path.join(datasets_dir, 'predictions')

In [221]:
# Assuming 'test' is your DataFrame and 'pos' is the column to save
output_columns = ['pos']
test[output_columns].to_csv(output_file_path, index=False, header=True)