<a href="https://colab.research.google.com/github/DotBion/techgb2336-dataSciBiz/blob/main/Copy_of_HW4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Movie Reviews for Text and Sentiment Analysis

The data in this homework is 20k movie reviews which have been already labelled as positive or negative.   We will apply our text toolbox to see if we can fit an effective supervised model.

The Movie Review data can be [downloaded from this link](https://drive.google.com/uc?export=download&id=1UA9CyRd8y7Wi4RKruXfItXadT3hY92bE)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import re
from sklearn.model_selection import train_test_split


In [2]:
# read in movie_reviews.csv
url = 'https://www.google.com/url?q=https%3A%2F%2Fdrive.google.com%2Fuc%3Fexport%3Ddownload%26id%3D1UA9CyRd8y7Wi4RKruXfItXadT3hY92bE'
df = pd.read_csv(url)
df.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


The data is simply a full text review, and a rating of 0(bad) or 1(good), determined by a human labeller.


We need to remove HTML tags...you can run the following code to remove them.

In [3]:
# remove html tags
df['text'] = df['text'].apply(lambda x: re.sub('<[^<]+?>', '', x))

**1) Lets do all of the things we need to do to prepare text data.  Lemmatize, tokenize, removing stopwords and punctuation.  Feel free to grab the exact code from the `T8_SOTU` notebook (specifically the function `clean_text`) and run it.   Create a new field called "clean_review" and append it to your data frame, so that you retain the original text in one feature, and have the cleaned text in another feature. This will allow us to go back and look at the original text of the review when we are evaluating the model**

In [5]:

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
# Download the missing 'punkt_tab' resource
nltk.download('punkt_tab') # This line was added to download the missing resource
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = re.sub(r'[^\w\s]', '', text) # remove punctuation
    tokens = word_tokenize(text.lower()) # tokenize and lower
    tokens = [w for w in tokens if not w in stop_words] # remove stopwords
    tokens = [lemmatizer.lemmatize(w) for w in tokens] # lemmatize
    return " ".join(tokens)

df['clean_review'] = df['text'].apply(clean_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [6]:
df['clean_review']

Unnamed: 0,clean_review
0,grew b 1965 watching loving thunderbird mate s...
1,put movie dvd player sat coke chip expectation...
2,people know particular time past like feel nee...
3,even though great interest biblical movie bore...
4,im die hard dad army fan nothing ever change g...
...,...
19994,movie stuffed full stock horror movie goody ch...
19995,required watch movie work didnt pay contrary g...
19996,white noise potential one talked movie since e...
19997,five deadly venom great kungfu action movie wr...


**2) Split data into training and test.  Run a TFIDF Vectorizer on the cleaned reviews - you need to `fit` the Vectorizer to the training data, and then `transform` both the training and the test sets using the vectorizer.  Review our T8 notebooks for syntax.**

In [8]:

from sklearn.feature_extraction.text import TfidfVectorizer

# Split data into training and testing sets
# Use 'label' instead of 'rating' for the target variable
X_train, X_test, y_train, y_test = train_test_split(
    df['clean_review'], df['label'], test_size=0.2, random_state=42
)

# Initialize and fit TF-IDF vectorizer to the training data
vectorizer = TfidfVectorizer()
X_train_vec = vectorizer.fit_transform(X_train)

# Transform the test data using the fitted vectorizer
X_test_vec = vectorizer.transform(X_test)

**3) Fit your favorite classification model to the training set - you can use Logistic Regression (but be sure to regularlize!), or something more complex like XGBoost or Random Forests, or any other classification model.  Apply your model to the test set and report the AUC.**

In [9]:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

# Initialize and train a Logistic Regression model with regularization
logreg = LogisticRegression(C=0.1, solver='liblinear') # C is the inverse of regularization strength
logreg.fit(X_train_vec, y_train)

# Make predictions on the test set
y_pred_prob = logreg.predict_proba(X_test_vec)[:, 1]

# Calculate the AUC
auc = roc_auc_score(y_test, y_pred_prob)
print(f"AUC: {auc}")


AUC: 0.9278907166451156


In [10]:

# Initialize and train an XGBoost classifier
import xgboost as xgb

xgb_model = xgb.XGBClassifier(objective='binary:logistic', random_state=42)
xgb_model.fit(X_train_vec, y_train)

# Make predictions on the test set
y_pred_prob_xgb = xgb_model.predict_proba(X_test_vec)[:, 1]

# Calculate the AUC for XGBoost
auc_xgb = roc_auc_score(y_test, y_pred_prob_xgb)
print(f"XGBoost AUC: {auc_xgb}")


XGBoost AUC: 0.9295540481483593


**4) Find the P(label=1) for all cases in the test set.  Identify the review in the test set which has the _lowest_ probability of being a good review BUT has label=1 (good).  This is an error!  Print the complete original review text for this error.  Do you think this is actually a good or bad review?  Do you think this error is due to an error in the labelling (y_test) or a problem with the model?  Explain.**

In [11]:

# Assuming y_pred_prob contains probabilities from the better performing model (e.g., XGBoost if it has a better AUC)
min_prob_idx = np.argmin(y_pred_prob_xgb[y_test == 1])
min_prob = y_pred_prob_xgb[y_test == 1][min_prob_idx]
error_review_index = y_test[y_test == 1].index[min_prob_idx]
error_review_original = df['text'][error_review_index]

print(f"Review with lowest probability of being good (but labeled good):")
print(f"Probability: {min_prob}")
print(f"Original review text:\n{error_review_original}")

print("\nIs this actually a good or bad review?")
#  (Your subjective opinion based on reading the review)
# Example:
print("Based on the review, I think it is (good/bad).")  # Replace (good/bad) with your assessment


print("\nIs this error due to a labeling error or a problem with the model?")
# (Your explanation of the potential source of the error)
# Example:
print("The error seems to be due to a potential labeling issue. While the model assigns a very low probability to this review, ")
print("the review itself, however, expresses some positive sentiments, but may be expressed less strongly, with other more negative aspects presented.")
print("The human labeler might have interpreted a slight positive aspect as a good review, while the model might be more sensitive to the subtle nuances and negative cues present in the review.")


Review with lowest probability of being good (but labeled good):
Probability: 0.0025764082092791796
Original review text:
**SPOILERS AHEAD**It is really unfortunate that a movie so well produced turns out to besuch a disappointment. I thought this was full of (silly) clichés andthat it basically tried to hard. To the (American) guys out there: how many of you spend yourtime jumping on your girlfriend's bed and making monkeysounds? To the (married) girls: how many of you have suddenlygone from prudes to nymphos overnight--but not with yourhusband? To the French: would you really ask about someonebeing "à la fac" when you know they don't speak French? Wouldn'tyou use a more common word like "université"? I lived in France for a while and I sort of do know and understandEurope (and I love it), but my (German) roommate and I found thispretty insulting overall. It looked like a movie funded by theEuropean Parliament, and it tried too hard basically. It had allsorts of differences that it tr

**5) To get an objective view of whether this review is _really_ positive or negative, we can use a pre-defined sentiment model built off of an existing lexicon.  One such model is called Vader.  [documentation here](https://medium.com/@rslavanyageetha/vader-a-comprehensive-guide-to-sentiment-analysis-in-python-c4f1868b0d2e).  Using your incorrectly labelled review from the last probem and the Vader code below, report what the negativity score is from Vader ('neg' in the output). Does this support your conclusion about the error above?**

In [13]:

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk # Added this line

# Download the VADER lexicon
nltk.download('vader_lexicon') # Added this line

# Assuming 'error_review_original' contains the text of the misclassified review
analyzer = SentimentIntensityAnalyzer()
scores = analyzer.polarity_scores(error_review_original)
print(scores)
neg_score = scores['neg']
print(f"\nNegativity score from Vader: {neg_score}")

print("\nDoes this support your conclusion about the error above?")
# Your analysis here. Example:
print("The negativity score from VADER helps support the conclusion about the error. If the score is high,")
print("it lends credence to the idea that the model's low probability prediction and my assessment as bad")
print(" are more likely correct, and thus the original label is likely in error.")

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


{'neg': 0.103, 'neu': 0.789, 'pos': 0.108, 'compound': 0.5735}

Negativity score from Vader: 0.103

Does this support your conclusion about the error above?
The negativity score from VADER helps support the conclusion about the error. If the score is high,
it lends credence to the idea that the model's low probability prediction and my assessment as bad
 are more likely correct, and thus the original label is likely in error.


In [None]:
nltk.download('vader_lexicon')
analyzer = SentimentIntensityAnalyzer()

text = ## YOUR REVIEW TEXT HERE ##
scores = analyzer.polarity_scores(text)
print(scores)

**6) `eli5` is a Python library that tries to "explain" machine learning models.  It works for simple models like logistic regression as well as more complicated, black box models like XGBoost and Random Forests.   Look up the documentation for `eli5` and use it to show the words contributing most to positive and negative scores in your model.**

In [14]:
!pip install eli5

import eli5

# Assuming 'logreg' is your trained logistic regression model and 'vectorizer' is your TFIDF vectorizer
eli5.show_weights(logreg, vec=vectorizer)

# Assuming 'xgb_model' is your trained XGBoost model
eli5.show_weights(xgb_model, vec=vectorizer)


Collecting eli5
  Downloading eli5-0.16.0-py2.py3-none-any.whl.metadata (18 kB)
Downloading eli5-0.16.0-py2.py3-none-any.whl (108 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.4/108.4 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: eli5
Successfully installed eli5-0.16.0


Weight,Feature
0.0182,waste
0.0175,worst
0.0129,bad
0.0119,awful
0.0112,boring
0.0106,crap
0.0091,supposed
0.0091,wonderful
0.0080,lame
0.0078,worse
