<a href="https://colab.research.google.com/github/EvgeniaKantor/DI-Bootcamp/blob/main/Exercises_XP_W8D3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [19]:
from google.colab import files
uploaded = files.upload()
# Make directory named kaggle and copy kaggle.json file there
!mkdir ~/.kaggle
!cp kaggle.json ~/.kaggle/

# Change the permissions of the file
!chmod 600 ~/.kaggle/kaggle.json

!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Saving kaggle.json to kaggle (1).json
mkdir: cannot create directory ‘/root/.kaggle’: File exists
Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
Downloading imdb-dataset-of-50k-movie-reviews.zip to /content
 66% 17.0M/25.7M [00:00<00:00, 76.3MB/s]
100% 25.7M/25.7M [00:00<00:00, 97.4MB/s]


In [20]:
!unzip imdb-dataset-of-50k-movie-reviews.zip

Archive:  imdb-dataset-of-50k-movie-reviews.zip
  inflating: IMDB Dataset.csv        


# Preprocessing

In [4]:
import pandas as pd

In [22]:
# Load the IMDb dataset into a DataFrame
df_reviews = pd.read_csv("IMDB Dataset.csv")

In [23]:
# Divide the DataFrame into smaller parts
df = df_reviews.iloc[:len(df_reviews) // 5]

In [24]:
# Display the first few lines
print("First few lines of the DataFrame:")
print(df.head())

First few lines of the DataFrame:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


In [25]:
# Check the size of the DataFrame
print("\nDataFrame size:", df.shape)


DataFrame size: (10000, 2)


In [26]:
# Check column types
print("\nColumn types:")
print(df.dtypes)


Column types:
review       object
sentiment    object
dtype: object


In [27]:
# Check for NaN values
print("\nNaN values count:")
print(df.isna().sum())


NaN values count:
review       0
sentiment    0
dtype: int64


In [31]:
# Print the first 5 reviews and their sentiments classification
print("First 5 reviews and their sentiments:")
df[['review', 'sentiment']].head()

First 5 reviews and their sentiments:


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [35]:
# get some insights
df.describe()

Unnamed: 0,words count
count,10000.0
mean,231.1173
std,171.430166
min,14.0
25%,126.0
50%,172.0
75%,282.0
max,1830.0


In [36]:
# check for duplicates
df.duplicated().sum()

17

In [37]:
# Delete duplicate rows
df = df.drop_duplicates()

In [41]:
# Create a function to count the number of words in each review and add a new column called “words count”
def count_words(text):
    return len(text.split())

# Apply this function to the review column and add a new column called “words count”
df.loc[:, 'words count'] = df['review'].apply(count_words)

# Visualize the result in the DataFrame
print(df[['review', 'words count']].head())

                                              review  words count
0  One of the other reviewers has mentioned that ...          307
1  A wonderful little production. <br /><br />The...          162
2  I thought this was a wonderful way to spend ti...          166
3  Basically there's a family where a little boy ...          138
4  Petter Mattei's "Love in the Time of Money" is...          230


# Preprocessing

In [42]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

In [43]:
# Initialize NLTK resources
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [44]:
def simple_preprocessing(text):
    # Make the text lower case
    text = text.lower()

    # Remove HTML br tags
    text = re.sub(r'<br\s*/*>', ' ', text)

    # Remove URLs
    text = re.sub(r'http\S+', '', text)

    # Remove hashtags and @ symbol
    text = re.sub(r'[@#]', '', text)

    # Remove punctuations
    text = re.sub(r'[^\w\s]', '', text)

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [word for word in tokens if word not in stop_words]

    # Return preprocessed text as string
    return ' '.join(filtered_tokens)

In [45]:
# Apply the simple_preprocessing() function in the review column
df['review'] = df['review'].apply(simple_preprocessing)

In [48]:
# Print the first 5 reviews and check that the signs were removed
print("First 5 preprocessed reviews:")
df.head()

First 5 preprocessed reviews:


Unnamed: 0,review,sentiment,words count
0,one reviewers mentioned watching 1 oz episode ...,positive,307
1,wonderful little production filming technique ...,positive,162
2,thought wonderful way spend time hot summer we...,positive,166
3,basically theres family little boy jake thinks...,negative,138
4,petter matteis love time money visually stunni...,positive,230


In [49]:
# Drop duplicated reviews
df = df.drop_duplicates('review')


In [51]:
# Check that duplicated reviews were deleted
print("After dropping duplicates, number of unique reviews:", df['review'].nunique())

After dropping duplicates, number of unique reviews: 9983


In [52]:
def stemming(text):
    # Initialize Porter Stemmer
    porter = PorterStemmer()

    # Apply stemming to each word in the text
    stemmed_text = [porter.stem(word) for word in word_tokenize(text)]

    # Return the stemmed text as a string
    return ' '.join(stemmed_text)

# Apply the stemming() function into the column
df['review'] = df['review'].apply(stemming)

# Preparing Data To Train The Model

In [53]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Binarize the sentiment column
df['sentiment'] = df['sentiment'].map({'positive': 1, 'negative': 0})

# Split the data into X and Y
X = df['review']
Y = df['sentiment']

# Vectorize the data using TF-IDF vectorizer
vectorizer = TfidfVectorizer()
X_vectorized = vectorizer.fit_transform(X)

# Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(X_vectorized, Y, test_size=0.3, random_state=42)

# Print the shapes of x_train, y_train, x_test, and y_test
print("Shape of x_train:", x_train.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of x_test:", x_test.shape)
print("Shape of y_test:", y_test.shape)


Shape of x_train: (6988, 49900)
Shape of y_train: (6988,)
Shape of x_test: (2995, 49900)
Shape of y_test: (2995,)


# Machine Learning Model: Instantiating, Training, Predicting And Evalueting

In [54]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Instantiate the Logistic Regression model
model = LogisticRegression()

# Train the model
model.fit(x_train, y_train)

# Predict on the testing set
y_pred = model.predict(x_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Print confusion matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


Accuracy: 0.8747913188647746
Confusion Matrix:
[[1266  222]
 [ 153 1354]]


In [55]:
# Preprocess the new reviews
new_reviews = [
    "I loved this movie!",
    "This movie was a bad comedy movie!"
]

# Apply preprocessing to the new reviews
preprocessed_reviews = [simple_preprocessing(review) for review in new_reviews]

# Vectorize the preprocessed reviews
vectorized_reviews = vectorizer.transform(preprocessed_reviews)

# Predict sentiment using the trained model
predicted_sentiments = model.predict(vectorized_reviews)

# Map the predicted labels back to their original sentiments
predicted_sentiments = ['positive' if sentiment == 1 else 'negative' for sentiment in predicted_sentiments]

# Print the predicted sentiments for the new reviews
for review, sentiment in zip(new_reviews, predicted_sentiments):
    print(f"Review: {review} => Predicted Sentiment: {sentiment}")

Review: I loved this movie! => Predicted Sentiment: positive
Review: This movie was a bad comedy movie! => Predicted Sentiment: negative


In [56]:
# Additional phrases for prediction
additional_reviews = [
    "I couldn't take my eyes off the screen!",
    "The movie left me feeling indifferent.",
    "The special effects were impressive, but the story fell flat.",
    "I was on the edge of my seat the entire time!",
    "I wouldn't recommend this movie to anyone.",
    "The soundtrack was fantastic, but the pacing was off.",
    "I'm still thinking about this movie days later.",
    "The performances were lackluster, but the visuals were stunning.",
    "I've never been so disappointed by a film.",
    "This movie was a rollercoaster of emotions!"
]

# Apply preprocessing to the additional reviews
preprocessed_reviews = [simple_preprocessing(review) for review in additional_reviews]

# Vectorize the preprocessed reviews
vectorized_reviews = vectorizer.transform(preprocessed_reviews)

# Predict sentiment using the trained model
predicted_sentiments = model.predict(vectorized_reviews)

# Map the predicted labels back to their original sentiments
predicted_sentiments = ['positive' if sentiment == 1 else 'negative' for sentiment in predicted_sentiments]

# Print the predicted sentiments for the additional reviews
for review, sentiment in zip(additional_reviews, predicted_sentiments):
    print(f"Review: {review} => Predicted Sentiment: {sentiment}")


Review: I couldn't take my eyes off the screen! => Predicted Sentiment: negative
Review: The movie left me feeling indifferent. => Predicted Sentiment: negative
Review: The special effects were impressive, but the story fell flat. => Predicted Sentiment: negative
Review: I was on the edge of my seat the entire time! => Predicted Sentiment: positive
Review: I wouldn't recommend this movie to anyone. => Predicted Sentiment: positive
Review: The soundtrack was fantastic, but the pacing was off. => Predicted Sentiment: positive
Review: I'm still thinking about this movie days later. => Predicted Sentiment: positive
Review: The performances were lackluster, but the visuals were stunning. => Predicted Sentiment: positive
Review: I've never been so disappointed by a film. => Predicted Sentiment: positive
Review: This movie was a rollercoaster of emotions! => Predicted Sentiment: positive


It seems like the model's predictions are not entirely accurate for some of the reviews. Here's a breakdown:

I couldn't take my eyes off the screen! => Predicted Sentiment: negative

I wouldn't recommend this movie to anyone. => Predicted Sentiment: positive

The soundtrack was fantastic, but the pacing was off. => Predicted Sentiment: positive

The performances were lackluster, but the visuals were stunning. => Predicted Sentiment: positive

I've never been so disappointed by a film. => Predicted Sentiment: positive

