<a href="https://colab.research.google.com/github/DotBion/techgb2336-dataSciBiz/blob/main/Copy_of_HW4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Movie Reviews for Text and Sentiment Analysis

The data in this homework is 20k movie reviews which have been already labelled as positive or negative.   We will apply our text toolbox to see if we can fit an effective supervised model.

The Movie Review data can be [downloaded from this link](https://drive.google.com/uc?export=download&id=1UA9CyRd8y7Wi4RKruXfItXadT3hY92bE)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import re
from sklearn.model_selection import train_test_split


In [2]:
# read in movie_reviews.csv
url = 'https://www.google.com/url?q=https%3A%2F%2Fdrive.google.com%2Fuc%3Fexport%3Ddownload%26id%3D1UA9CyRd8y7Wi4RKruXfItXadT3hY92bE'
df = pd.read_csv(url)
df.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


The data is simply a full text review, and a rating of 0(bad) or 1(good), determined by a human labeller.


We need to remove HTML tags...you can run the following code to remove them.

In [3]:
# remove html tags
df['text'] = df['text'].apply(lambda x: re.sub('<[^<]+?>', '', x))

**1) Lets do all of the things we need to do to prepare text data.  Lemmatize, tokenize, removing stopwords and punctuation.  Feel free to grab the exact code from the `T8_SOTU` notebook (specifically the function `clean_text`) and run it.   Create a new field called "clean_review" and append it to your data frame, so that you retain the original text in one feature, and have the cleaned text in another feature. This will allow us to go back and look at the original text of the review when we are evaluating the model**

In [4]:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
# Download the missing 'punkt_tab' resource
nltk.download('punkt_tab') # This line was added to download the missing resource
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text) # remove punctuation
    text = re.sub(r'\s+', ' ', text) # remove extra spaces
    text = re.sub(r'http\S+', '', text) # remove URLs
    text = re.sub(r'www\S+', '', text) # remove URLs
    text = re.sub(r'[\d]', '', text) #remove numbers
    tokens = word_tokenize(text.lower()) # tokenize and lower
    tokens = [w for w in tokens if not w in stop_words] # remove stopwords
    tokens = [lemmatizer.lemmatize(w) for w in tokens] # lemmatize
    return " ".join(tokens)

df['clean_review'] = df['text'].apply(clean_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [5]:
df['clean_review']

Unnamed: 0,clean_review
0,grew b watching loving thunderbird mate school...
1,put movie dvd player sat coke chip expectation...
2,people know particular time past like feel nee...
3,even though great interest biblical movie bore...
4,im die hard dad army fan nothing ever change g...
...,...
19994,movie stuffed full stock horror movie goody ch...
19995,required watch movie work didnt pay contrary g...
19996,white noise potential one talked movie since e...
19997,five deadly venom great kungfu action movie wr...


**2) Split data into training and test.  Run a TFIDF Vectorizer on the cleaned reviews - you need to `fit` the Vectorizer to the training data, and then `transform` both the training and the test sets using the vectorizer.  Review our T8 notebooks for syntax.**

In [13]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Split data into training and testing sets
# Use 'label' instead of 'rating' for the target variable
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_review'], df['label'], test_size=0.2, random_state=42
)

# Initialize and fit TF-IDF vectorizer to the training data
#vectorizer = TfidfVectorizer()

# vectorizer = TfidfVectorizer(max_features=5000, stop_words='english', ngram_range=(1, 2), min_df=5, max_df=0.7, use_idf=True, smooth_idf=True)
# the max_df is key here to avoid words like "state, government, tax, etc that appear in many speeches"
vectorizer = TfidfVectorizer(max_df=0.7,min_df=2,ngram_range=(1,2)) # these values are important!
#vectorizer = TfidfVectorizer() #gives a better auc score than the previous ones in logreg
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_vec = vectorizer.transform(X_test)

In [14]:
print(X_train_vec.shape)

(15999, 97619)


In [15]:
print(X_test_vec.shape)

(4000, 97619)


**3) Fit your favorite classification model to the training set - you can use Logistic Regression (but be sure to regularlize!), or something more complex like XGBoost or Random Forests, or any other classification model.  Apply your model to the test set and report the AUC.**

In [16]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Initializing and training a Logistic Regression model with regularization
logreg = LogisticRegression(C=0.1, solver='liblinear') # C is the inverse of regularization strength
logreg.fit(X_train_vec, y_train)

# Making predictions on the test set
y_pred_prob = logreg.predict_proba(X_test_vec)[:, 1]

# Calculating the AUC
auc = roc_auc_score(y_test, y_pred_prob)
print(f"AUC Logistic Regression: {auc}")


AUC Logistic Regression: 0.9274721961376107


In [19]:

# Initialize and train an XGBoost classifier
import xgboost as xgb

xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)
xgb_model.fit(X_train_vec, y_train)

# Make predictions on the test set
y_pred_prob_xgb = xgb_model.predict_proba(X_test_vec)[:, 1]

# Calculate the AUC for XGBoost
auc_xgb = roc_auc_score(y_test, y_pred_prob_xgb)
print(f"XGBoost AUC: {auc_xgb}")


XGBoost AUC: 0.9254200955846835


**4) Find the P(label=1) for all cases in the test set.  Identify the review in the test set which has the _lowest_ probability of being a good review BUT has label=1 (good).  This is an error!  Print the complete original review text for this error.  Do you think this is actually a good or bad review?  Do you think this error is due to an error in the labelling (y_test) or a problem with the model?  Explain.**

In [23]:

# Assuming y_pred_prob contains probabilities from the better performing model here Logistic Regression
min_prob_idx = np.argmin(y_pred_prob[y_test == 1])
min_prob = y_pred_prob[y_test == 1][min_prob_idx]
error_review_index = y_test[y_test == 1].index[min_prob_idx]
error_review_original = df['text'][error_review_index]

print(f"Review with lowest probability of being good (but labeled good):")
print(f"Probability: {min_prob}")
print(f"Original review text:\n{error_review_original}")

print("\nIs this actually a good or bad review?")
print("It is a bad review. The movie is trash but the reviewer glorifies the trash movie in ways like praising every other bad movies and having fun criticizing every scene, and the reason for giving it a score of 10 is not that it is good but the worst one could ever watch!")

print("\nIs this error due to a labeling error or a problem with the model?")

print("The error is due to a potential labeling issue.")
print("The human labeler might have interpreted a slight positive aspect as a good review, while the model might be more sensitive to the subtle nuances and negative cues present in the review.")

Review with lowest probability of being good (but labeled good):
Probability: 0.1976834813712349
Original review text:
I have never seen such a movie before. I was on the edge of my seat and constantly laughing throughout the entire movie. I never thought such horrible acting existed it was all just too funny. The story behind the movie is decent but the movies scenes fail to portray them. I have never seen such a stupid movie in my life which is why it I think its worth watching. I give this movie 10 out of 10 for being the most pathetic movie ever created, this movie seems like it was solely created to become trash. I mean the scenes seem so fake and the actors act like "the camera is in front of them". You will get a kick just watching how lame this movie is, me and my friend could not stop making jokes during the movie, the darthvader guy who tries to get the girl got ran over not once but twice and the second time he got ran over it sounded like he said sh!# although he doesn't sp

**5) To get an objective view of whether this review is _really_ positive or negative, we can use a pre-defined sentiment model built off of an existing lexicon.  One such model is called Vader.  [documentation here](https://medium.com/@rslavanyageetha/vader-a-comprehensive-guide-to-sentiment-analysis-in-python-c4f1868b0d2e).  Using your incorrectly labelled review from the last probem and the Vader code below, report what the negativity score is from Vader ('neg' in the output). Does this support your conclusion about the error above?**

In [32]:


# Downloading the VADER lexicon
nltk.download('vader_lexicon')

analyzer = SentimentIntensityAnalyzer()
text = error_review_original
scores = analyzer.polarity_scores(text)
print(text)
print(scores)

print("\nNo, this does not support the conclusion above as the compunded prediction for the text is high meaning the Vader model feels this as a positive sentiment, where as it's a it's a 'so bad it’s good' kind of review !\n")

I have never seen such a movie before. I was on the edge of my seat and constantly laughing throughout the entire movie. I never thought such horrible acting existed it was all just too funny. The story behind the movie is decent but the movies scenes fail to portray them. I have never seen such a stupid movie in my life which is why it I think its worth watching. I give this movie 10 out of 10 for being the most pathetic movie ever created, this movie seems like it was solely created to become trash. I mean the scenes seem so fake and the actors act like "the camera is in front of them". You will get a kick just watching how lame this movie is, me and my friend could not stop making jokes during the movie, the darthvader guy who tries to get the girl got ran over not once but twice and the second time he got ran over it sounded like he said sh!# although he doesn't speak English lol. If you watch this movie you will think to yourself that all those other movies you didn't like you too

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


**6) `eli5` is a Python library that tries to "explain" machine learning models.  It works for simple models like logistic regression as well as more complicated, black box models like XGBoost and Random Forests.   Look up the documentation for `eli5` and use it to show the words contributing most to positive and negative scores in your model.**

In [36]:
!pip install eli5

import eli5



In [37]:
print("\nfor logreg:\n")
eli5.show_weights(logreg, vec=vectorizer)


for logreg:



Weight?,Feature
+3.002,convicted murderer
+1.780,earlier made
+1.722,ann claire
+1.637,character definitely
+1.361,jackass
+1.335,interesting would
+1.208,early afternoon
… 50041 more positive …,… 50041 more positive …
… 47559 more negative …,… 47559 more negative …
-1.208,half scary


In [38]:
print("\nfor xgboost:\n")
eli5.show_weights(xgb_model, vec=vectorizer)


for xgboost:



Weight,Feature
0.0197,waste
0.0182,worst
0.0116,bad
0.0093,boring
0.0093,awful
0.0085,stupid
0.0083,crap
0.0074,minute
0.0072,terrible
0.0071,worse


In [40]:
eli5.show_prediction(xgb_model, doc=error_review_original, vec=vectorizer)



Contribution?,Feature
3.963,Highlighted in text (sum)
0.17,great
0.079,also
0.067,see
0.066,excellent
0.057,good
0.055,fun
0.05,film
0.049,favorite
0.046,world
