### Imports necessary libraries
This code will imports necessary libraries for building a sentiment analysis model using logistic regression and TF-IDF vectorization. The expected outcome of this code is to preprocess the text data, split it into training and testing sets, create TF-IDF features from the text, and train a logistic regression model to classify movie reviews as positive or negative based on the provided data.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

### Import the Dataset
lets loads the movie review dataset from the 'imdbDataset.csv' file into a pandas DataFrame named 'df'. The expected outcome is to have the movie review data loaded into memory, ready for further processing and analysis.

In [2]:
# Load the movie review dataset
df = pd.read_csv('imdbDataset.csv')

### Insight in dataset

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


It displays the first few rows of the DataFrame 'df', showing the 'review' column containing the movie review text and the 'sentiment' column indicating whether the review is positive or negative.

In [4]:
df['review'][3]

"Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them."

It is the review text corresponding to the third row in the 'review' column of the DataFrame 'df'. It describes the storyline and provides the reviewer's opinion about the movie.

In [5]:
df['sentiment'][3]

'negative'

It corresponds to the sentiment label for the review in the third row of the 'sentiment' column in the DataFrame 'df'. As shown in above review the sentiment of the review is classified as negative.

In [6]:
df.shape

(50000, 2)

The DataFrame 'df' has 50,000 rows and 2 columns. The number of rows represents the total number of movie reviews in the dataset, and the number of columns indicates the number of features or attributes associated with each review, which in this case are 'review' and 'sentiment'.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


It is providing information about the DataFrame 'df':

    The DataFrame has a total of 50,000 entries or rows.
    There are 2 columns in the DataFrame named 'review' and 'sentiment'.
    The 'review' column has 50,000 non-null values, indicating that there are no missing values in that column.
    The 'sentiment' column also has 50,000 non-null values, indicating no missing values in that column.
    The data type of both columns is 'object', which typically represents string or text data.
    The memory usage of the DataFrame is approximately 781.4+ KB.

In [8]:
#checking missing values
df.isnull().sum()

review       0
sentiment    0
dtype: int64

df.isnull().sum() indicates the number of missing values in each column of the DataFrame 'df':

    The 'review' column has 0 missing values.
    The 'sentiment' column also has 0 missing values.
Therefore, there are no missing values in either column of the DataFrame.

### Split the Dataset

In [9]:
# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(df['review'], df['sentiment'], test_size=0.2)

This code will splits the data into training and testing sets using the train_test_split function from the Scikit-learn library. The expected outcome is to have the data split into the following variables:

    X_train: Training set containing the 'review' data.
    X_test: Testing set containing the 'review' data.
    y_train: Training set containing the 'sentiment' labels.
    y_test: Testing set containing the 'sentiment' labels.
The data is split with a test size of 0.2, indicating that 20% of the data will be used for testing while 80% will be used for training the model.

In [10]:
print(X_train.shape)
print(X_test.shape)

(40000,)
(10000,)


Both training and testing sets contain the 'review' data from the original DataFrame, with the training set having 40,000 samples and the testing set having 10,000 samples.

### TF-IDF vectorizer

In [11]:
# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer()

This code will initializes an instance of the TfidfVectorizer from the Scikit-learn library. The expected outcome is to have a vectorizer object that will be used to transform the text data into TF-IDF features. The TF-IDF vectorizer calculates the Term Frequency-Inverse Document Frequency (TF-IDF) values for each word in the text data, which is a common technique used in natural language processing tasks such as text classification.

In [12]:
# Fit the vectorizer to the training data
vectorizer.fit(X_train)

TfidfVectorizer()

This code will fits the TF-IDF vectorizer (vectorizer) on the training data (X_train). The expected outcome is to calculate the necessary statistics and build the vocabulary based on the words present in the training set. This step prepares the vectorizer to transform the text data into TF-IDF features during the model training and evaluation process.

In [13]:
# Transform the training and test data
X_train_vectorized = vectorizer.transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

X_train_vectorized will transforms the training data (X_train) into TF-IDF features using the fitted vectorizer. The expected outcome is to have the training data converted into a matrix representation of TF-IDF values.

X_test_vectorized will transforms the testing data (X_test) into TF-IDF features using the same vectorizer. The expected outcome is to have the testing data converted into a matrix representation of TF-IDF values, consistent with the TF-IDF features of the training data.

Both X_train_vectorized and X_test_vectorized will have the same number of features as the TF-IDF vectorizer's vocabulary, and the number of rows will correspond to the number of samples in the training and testing sets, respectively.

### Lets create a logistic regression model 
with a maximum number of iterations set to 10,000. The logistic regression model is a commonly used classification algorithm for binary classification tasks. The expected outcome is to have a logistic regression model object ready for training and prediction.

In [14]:
# Create a logistic regression model
model = LogisticRegression(max_iter=10000)

In [15]:
# Fit the model to the training data
model.fit(X_train_vectorized, y_train)

LogisticRegression(max_iter=10000)

This code will trains the logistic regression model (model) on the training data (X_train_vectorized) and their corresponding labels (y_train). The expected outcome is to fit the model to the training data and learn the underlying patterns in the TF-IDF features to make predictions on new data.

In [16]:
# Evaluate the model on the test data
y_pred = model.predict(X_test_vectorized)

This code uses the trained logistic regression model to make predictions on the testing data (X_test_vectorized). The expected outcome is to obtain predicted labels (y_pred) for the testing data based on the learned patterns from the training data. These predicted labels can be compared with the actual labels (y_test) to evaluate the performance of the model. 

In [17]:
# Print the accuracy of the model
print('Accuracy:', np.mean(y_pred == y_test))

Accuracy: 0.9011


The Accuracy: 0.9011 indicates the accuracy of the logistic regression model on the testing data. It is calculated by comparing the predicted labels (y_pred) with the actual labels (y_test) and taking the mean of the matches. In this case, the model achieved an accuracy of approximately 90.11%, meaning it correctly classified 89.87% of the movie reviews in the testing set.

In [19]:
# Load the new review
new_review = input('Please enter a new review: \n\n')

# # Transform the new review
new_review_vectorized = vectorizer.transform([new_review])

# # Predict the sentiment of the new review
sentiment = model.predict(new_review_vectorized)

# Print the predicted sentiment of the new review
print('\nThe sentiment of the new review is:', sentiment)

Please enter a new review: 

Kisi Ka Bhai Kisi Ki Jaan is a disappointing and unoriginal action-drama that fails to live up to the hype. The film is a remake of the Tamil film Veeram, and it follows a similar plot. The film stars Salman Khan as a righteous man who is forced to take the law into his own hands when his family is threatened. The film is full of clichés and predictable plot points, and the action scenes are few and far between. The acting is also subpar, with Khan giving a wooden performance. Overall, Kisi Ka Bhai Kisi Ki Jaan is a forgettable and underwhelming action-drama that is not worth your time.

The sentiment of the new review is: ['negative']


The code prompts the user to enter a new movie review. It then transforms the new review using the TF-IDF vectorizer (vectorizer) to convert it into a vector of TF-IDF features (new_review_vectorized). Finally, the logistic regression model (model) predicts the sentiment of the new review, and the predicted sentiment is printed, which in this case is 'negative'.