In [1]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
np.random.seed(0)

## Movie Review Classifier 🍿📽️

In this assignment, we'll be training a model to classify movie reviews as 'good' or 'bad.'\
The data consists of 40,000 real move reviews from IMBD.\


We'll load the data as a zipped csv. \
Notice that `pd.read_csv()` can take a URL as the path argument and that we can read in a compressed file without first expanding it if we specify the `compression` format!

In [2]:
data_url = './data/movie_reviews.zip'
df = pd.read_csv(data_url, compression='zip')

In [3]:
df.head()

Unnamed: 0,text,label
0,If you havent seen this movie than you need to...,1
1,but Cinderella gets my vote not only for the w...,0
2,This movie is pretty cheesy but I do give it c...,1
3,I have not seen a Van Damme flick for a while ...,1
4,This is a sleeper It defines Nicholas Cage The...,1


In [4]:
df.shape

(40000, 2)

In [5]:
df.label.unique()

array([1, 0], dtype=int64)

We see that the dataset consists of text reviews and binary labels. Intuitively, the positive class is "good" while the negative is "bad."

Here are two examples from the dataset:

In [6]:
labels = {0: 'bad', 1: 'good'}
seen = {'bad': False, 'good': False}
for i in range(df.shape[0]):
    label = df.loc[i,'label']
    if not seen[labels[label]]:
        # display/print combination used to appease Ed's strange output behavior
        display(df.loc[i, 'text'])
        print()
        display(f"label: {labels[label]}")
        print()
        seen[labels[label]] = True
    if all(val == True for val in seen.values()):
        break

'If you havent seen this movie than you need to It rocks and you have to watch it It is so funny and will make you laugh your guts out so you have to watch it and i saw it about a billion and a half times and still think it is funny so you have to yes i have memorized the whole movie and could quote it to you from start to finish you must see this move it is also cute because it is half a chick flick if you dont watch it then you are really missing outthis movie even has cute guys in it and that is always a bonus so in summary watch the movie now and trust me you will not be making a mistake did i mention the music is good too So you should like it if you enjoy music This is a movie that they rated correctly and it will work for anyone'




'label: good'




'but Cinderella gets my vote not only for the worst of Disneys princess movies but for the worst movie the company made during Walts lifetime The music is genuinely pretty and the story deserves to be called classic What fails in this movie are the characters particularly the title character who could only be called the heroine in the loosest sense of the term After a brief prologue the audience is introduced to Cinderella She is waking up in the morning and singing A Dream is A wish Your Heart Makes This establishes her as an idealist and thus deserving of our sympathy Unfortunately the script gives us no clue as to what she is dreaming about Freedom from her servant role The respect of her stepfamily Someone to talk to besides mice and birds In one song cut from the movie but presented in the special features section of the latest DVD Cinderella relates her wish that there could be many of her so she could do her work more efficiently You go girlfriend In short Cinderella is a very b




'label: bad'




**Some Preprocessing**

In the 2nd example, we can see some html tags inside the review text.

Complete the `remove_br()` function by providing its call to `re.sub()` with a regex that removes those pesky "\<br />" tags from an input string, `x`.\
Speciffically, we should replace 2 consecutive occurances of "\<br />" with a single space (can you see why?).

**Hint:** It is good practice to use 'raw' string when writing regular expressions to ensure that special characters are treated correctly. Raw strings are appended with an 'r' like this: `r'this is a raw string'`

In [7]:
# please fill this code block!
# fill in the regular expression
# Define the regular expression to match two consecutive "<br />" tags and replace them with a space.
pattern = r'<br\s*/>\s*<br\s*/>'
remove_br = lambda x: re.sub(pattern, ' ', x)

Use the dataframe's `apply()` method to apply `remove_br` to each review in both train and test.

In [8]:
# please fill this code block!
# Apply the function on the 'text' column of the dataframe
df['text'] = df['text'].apply(remove_br)

And we can see that the tags have been removed!

In [9]:
df.loc[4,'text']

'This is a sleeper It defines Nicholas Cage The plot is intricate and totally absorbing The ending will blow you away See it whenever you have the opportunity'

Don't worry about any newline characters or backslashes you may see before apostrophes in the examples above. This is just a quirk of how Jupyter displays strings by default.\
We don't see that these characters if we explicitly `print` the string.

### example_str = df.loc[4,'text']
print(example_str)

We'll continue our preprocessing by next **removing punctuation**.\
But first, let's keep a copy of the data *with* punctuation. This will be useful at the end of the notebook when we want to display the original text of specific observations.

In [10]:
# store copy of data with punctuation
df_raw = df.copy()

The next regex we need is a bit more involved.\
**This should match any non-whitespace, any non-alphanumeric characters, and underscores** (strangly, underscores are not covered by the first 2 conditions).

**Hints:**
- `\w` matches alphanumeric characters
- `\s` matches whitespace
- `[]` can be used to denote a set of characters. ex: `r'[ab]'` will match on 'a' *or* 'b'
- `^` at the beginning of a character set denotes *negation*. ex: `r'[^0-9]'` will matching any non-integer
- `|` is the *logical or* operator. ex: `r'cat|dog'` will match the strings 'cat' *or* 'dog' 
- There are many helpful sites for testing regexes. [Here's a nice one](https://www.regextester.com/).

In [11]:
# please fill this code block!
# create a regex that will match the characters described above 
punc_regex = r'[^\w\s]'

Here we'll use an alternative to the `apply` approach we saw above.\
Pandas has its own set of built-in string methods which includes a version of `replace`. But unlike Python's `str.replace()` this can actually use regexes!

In [12]:
df['text'] = df.text.str.replace(punc_regex, '', regex=True) # remove punctuation

If all went well we can see that punctuation has been removed from our dataset.

In [13]:
example_str = df.loc[4,'text']
print(example_str)

This is a sleeper It defines Nicholas Cage The plot is intricate and totally absorbing The ending will blow you away See it whenever you have the opportunity


**Train/Test Split**

Rather than splitting the data directly with `train_test_split` we'll instead use it to generate indices for the train and test data.\
This may seem strange, but there is a good reason for it. These indices will later allow us to recover the original, unprocessed text from `df_raw` for any given training and test observations. 

Notice too that we are stratifying on the label. This will help ensure that good and bad reviews appear in the same proportions in both train and test.

In [20]:
# generate indices to designate train and test observations
train_idx, test_idx = train_test_split(range(df.shape[0]), test_size=0.2, random_state=0, stratify=df['label'])
# stratify=df['label']: Guarantee that the Composition Ratio of labes in Train an Test are the same 

In [21]:
# Separate the predictor from the response
x = df.text.values
y = df.label.values

In [25]:
# Create train and test sets using the generated indices
x_train = x[train_idx]
y_train = y[train_idx]
x_test = x[test_idx]
y_test = y[test_idx]

**Building the Classifier Pipeline**\
**Step 1: Vectorizor**

It's true that there are still several preprocessing steps to be done such as converting to lowercase and tokenizing the reviews, but these can be done for using sklearn's [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). 

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

Instantiate a `TfidfVectorizer` with parameters such that it will:
- set all reviews to lowercase
- remove english stopwords
- exclude words that occur in less than 1 review in 10,000
- exclude words that occur in more than 90% of reviews

**Hint:** Reading the documentation, you'll see the arguments you need are `lowercase`, `stop_words`, `min_df`, and `max_df`

In [53]:
# please fill this code block!
vec = TfidfVectorizer(lowercase=True,stop_words='english',min_df=0.0001,max_df=0.9)

**Step 2: Classifier**

We'll use logistic regression with l2 regularization as our classifier model. The [LogisticRegressionCV](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html?highlight=logisticregressioncv#sklearn.linear_model.LogisticRegressionCV) object allows us to easily tune for the best regularization parameter.

In [54]:
from sklearn.linear_model import LogisticRegressionCV

With 40,000 training observations and each word in the vectorizer's vocabulary counting acting as a predictor training could be slow.\
This issue is exacerbated when using cross validation as we need fit the model multiple times!\
We'll set our classifier CV parameters so as to help keep the training time down to around 30 seconds or so.
- l2 penalty (e.g., Ridge)
- 10 iterations per fit (remember, logistic regression has no closed form solution for the betas!)
- 5-fold CV
- random state of 0 (the fitting can be stochastic)

In [55]:
# please fill this code block!
# Instantiate our Classifier
clf = LogisticRegressionCV(cv=5, penalty="l2", max_iter=10, random_state=0 )

**Step 3: Pipeline**

Any text data going into our classifier will have to first be converted to numerical data by our vectorizer.\
One way to do this would be to:
1. fit the vectorizor on the training data
2. transform a dataset with the fitted vectorizer
3. pass the transformed data to the classifier

(1) only needs to be done once, but (2) & (3) would need to be done manually for train, test, and any other data we want to give them model.\
This would be tedious! Luckily, sklearn's [Pipline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html?highlight=pipeline#sklearn.pipeline.Pipeline) object allow use to connect one more 'transformers' (such as a scaler or vectorizer) with a model.

In [56]:
from sklearn.pipeline import make_pipeline

Use [make_pipeline()](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=make_pipeline#sklearn.pipeline.make_pipeline) to connect the vectorizor, `vec`, and our classifier, `clf`, into a single pipeline.

**Hint:** You can set `verbose=True` to see the individual steps during the fit process later.

In [57]:
# please fill this code block!
# Construct the pipeline
pipe = make_pipeline(vec,clf)

**Step 4: Fitting**

When it comes to fitting, we can treat the pipeline object as if it were the classifier object itself, and simply call `fit` on the pipeline.

In [58]:
# For the sake of time, we are fitting quickly and we may not converge
# We'll supress those pesky warnings
from warnings import simplefilter
from sklearn.exceptions import ConvergenceWarning
# We also ignore FutureWarnings due to version issues on Ed
simplefilter("ignore", category=(ConvergenceWarning, FutureWarning))

In [59]:
### edTest(test_fit) ###
# Fit the model via the pipeline
pipe.fit(x_train, y_train)

We can inspect the steps of the pipeline.

In [60]:
pipe.get_params()['steps']

[('tfidfvectorizer',
  TfidfVectorizer(max_df=0.9, min_df=0.0001, stop_words='english')),
 ('logisticregressioncv',
  LogisticRegressionCV(cv=5, max_iter=10, random_state=0))]

By default they are named using the all lowercase class name of each object.\
We can use these names to access the fitted objects inside. Here we see the size of our vectorizer's vocabulary.

In [63]:
features = pipe.get_params()['tfidfvectorizer'].get_feature_names_out() # When sklearn.__version__ >= 1.0.x use following method
print('# of features:', len(features))

# of features: 36658


There are too many to print, but we can peek at a random sample.

In [64]:
sample_size = 40
feature_sample_idx = np.random.choice(len(features), size=sample_size, replace=False)
print(np.array(features)[feature_sample_idx])

['integrating' 'gregson' 'deal' 'homophobia' 'chemically' 'explained'
 'messedup' 'shannon' 'regal' 'brick' 'carpathian' 'revelry' 'loos' 'whit'
 'yu' 'strata' 'unlucky' 'unwanted' 'anachronism' 'interior' 'julie'
 'seasickness' 'lex' 'articulated' '10000000' 'genre' 'rebroadcast' 'auer'
 'lapsed' 'dozens' 'crichtons' '2020' 'hoffmans' 'storyteller' 'committee'
 'fellini' 'absolutly' 'sidesplitting' 'repetitiveness' 'boringly']


Similarly, we can access the fitted logistic model and see what regularization parameter was used.

In [65]:
best_C = pipe.get_params()['logisticregressioncv'].C_[0]
print(f'Best C from cross-validation: {best_C:.4f}')

Best C from cross-validation: 2.7826


**Step 5: Prediction**

Just like we did when fitting, we can treat the pipeline object as the classifier when making predictions.\
Predict on the test data to get:
1. class labels
2. probabilities of being the positive class (i.e., 'good' reviews)
3. test accuracy

In [73]:
# please fill this code block!
# Predict class labels on test data
y_pred = pipe.predict(x_test)

# Predict probabilities of the positive class on the test data
y_pred_proba = pipe.predict_proba(x_test)[:, 1]#0-negetive;1-positv

# Calculate test accuracy
test_acc = pipe.score(x_test, y_test)
print(f"test accuracy: {test_acc:0.3f}")

test accuracy: 0.893


Can you get better than 0.893 by tweaking the preprocessing, or vetorizer and classifier parameters? Perhaps inspecting how our model makes its predictions may help us decide how we might improve the model in the future.

### Kaggle Submission Process for Movie Review Classification

In the subsequent steps, we'll process the test dataset provided on Kaggle, produce a predicted output, and generate a CSV file suitable for submission. This will allow us to evaluate our model's predictions on Kaggle. Access the competition through this [link](https://www.kaggle.com/competitions/dsaa-6100-movie-review-classification/): **DSAA 6100 Movie Review Classification**.

When participating in the competition on Kaggle, please ensure your displayed username follows the format "StudentID_Name". This will help the teaching assistants to easily identify and verify your scores. For instance, change your Kaggle display name to a format similar to "50013772_Yupeng Xie" before submitting.

In [74]:
# 1. Load the 'test_data.csv' file
test_data = pd.read_csv('data/test_data.csv')

# Extract reviews from 'test_data.csv' assuming the column name is "text"
test_reviews = test_data['text']

# 2. Predict sentiments using the trained model
y_pred_kaggle = pipe.predict(test_reviews)

# 3. Create a dataframe for Kaggle submission
# Assuming 'test_data' has a column named 'Id' for identifying each review
submission = pd.DataFrame({'Id': test_data['Id'], 'Category': y_pred_kaggle})

# 4. Save the predictions to a .csv file for submission
submission.to_csv('kaggle_submission.csv', index=False)

print("Kaggle submission file saved as 'kaggle_submission.csv'")

Kaggle submission file saved as 'kaggle_submission.csv'


### Interpretation

Below we'll use the `eli5` library to have some fun interpreting what is driving our model's predictions on specific test observations.

- [ELI5](https://eli5.readthedocs.io/en/latest/) is a Python library which allows to visualize and debug various Machine Learning models using unified API. It has built-in support for several ML frameworks and provides a way to explain black-box models.

In [75]:
# please fill this code block!
# Install ELI5
!pip install eli5

Collecting eli5
  Downloading eli5-0.13.0.tar.gz (216 kB)
     ---------------------------------------- 0.0/216.2 kB ? eta -:--:--
     ---------------------------------------- 0.0/216.2 kB ? eta -:--:--
     - -------------------------------------- 10.2/216.2 kB ? eta -:--:--
     ------- ----------------------------- 41.0/216.2 kB 495.5 kB/s eta 0:00:01
     ---------- -------------------------- 61.4/216.2 kB 544.7 kB/s eta 0:00:01
     ------------------ ----------------- 112.6/216.2 kB 656.4 kB/s eta 0:00:01
     ---------------------------- ------- 174.1/216.2 kB 807.1 kB/s eta 0:00:01
     ------------------------------------ 216.2/216.2 kB 880.3 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting graphviz (from eli5)
  Downloading graphviz-0.20.1-py3-none-any.whl (47 kB)
     ---------------------------------------- 0.0/47.0 kB ? eta -:--:--
     ---------------------------------------- 47.0/47.0 kB 2.

In [84]:
# For interpretation
import eli5
# for parsing/formating eli5's HTML output
from bs4 import BeautifulSoup
# for displaying formatted HTML output
from IPython.display import HTML

ImportError: cannot import name 'if_delegate_has_method' from 'sklearn.utils.metaestimators' (D:\Anaconda\Lib\site-packages\sklearn\utils\metaestimators.py)

Here are the words driving positive class predictions.

In [88]:
eli5.show_weights(clf, vec=vec, top=25)

Weight?,Feature
+9.181,excellent
+8.748,710
+8.561,great
+7.245,amazing
+7.041,wonderful
+6.895,best
+6.699,perfect
+6.236,favorite
+6.048,brilliant
… 18538 more positive …,… 18538 more positive …


Hmm, those digits like 710, 810, and 410 driving predictions seems strange. What might they represent?\

We'll use the 'raw' data with punctuation when inspecting the data (See! It is coming in handy!)

In [41]:
x_train_raw = df_raw.text[train_idx].values
x_test_raw = df_raw.text[test_idx].values

In [42]:
df_raw[df.text.str.contains(' 710 ')].iloc[0].text

"The cat and mouse are involved in the usual chases when Jerry dives into a bottle of invisible ink and discovers that it makes him vanish. Instead of seizing the opportunity to go spy on a girl mouse changing room or something, he uses his new-found invisibility to torment Tom. And it's pretty funny and quite inventive despite being a somewhat one-joke cartoon. And the action never leaves the interior of the house, which is usually the trait of below average T&J shorts. Still worth a 7/10. However, I'm not sure how an invisible mouse can cast a shadow on the wall, it defies physics and the very nature of being invisible itself."

These are actually numerical ratings embedded in the reviews! Looking at the text without the punctuation made it hard for us to see this at first.

Here's a helper function used to remove some extraneous things from `eli5`'s output. We just want to see the highlighted text.\
You don't need to read through the function but it is here as a nice resource/example. 🤓

In [43]:
def eli5_html(clf, vec, observation):
    """
    helper function for nicely formatting and displaying eli5 output
    """
    # Get info on is driving a given observation's predictions
    eli5_results = eli5.show_prediction(estimator=clf, doc=observation, vec=vec, targets=[True], target_names=['bad', 'good'])
    # Convert eli5's HTML data to BS object for parsing/formatting
    soup = BeautifulSoup(eli5_results.data, 'html.parser')
    # Remove a table we don't want
    soup.table.decompose()
    # Remove the first <p> tag with unwanted text
    soup.p.decompose()
    # Display the newly formatted HTML!
    display(HTML(str(soup)))

Now all you need to do is find the specific observations requested.\
You'll need your `y_pred_proba` values for this section to find which elements from `x_test_raw` to select.

**Hint:** [np.argsort()](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html), [np.flip()](https://numpy.org/doc/stable/reference/generated/numpy.flip.html?highlight=flip#numpy.flip), and [np.abs()](https://numpy.org/doc/stable/reference/generated/numpy.absolute.html) may be useful here. 

### What are the **5 worst** movie reviews in the test set according to your model? 🍅

In [45]:
# please fill this code block!
# Find indices of 5 worst reviews
worst5 = x_test_raw[___]

In [46]:
for i, review in enumerate(worst5):
    style = 'background-color:black;color:white;font-weight:bold;padding:4px'
    display(HTML(f"<p style={style}>Bad Movie #{i+1} 🍅</p>"))
    eli5_html(clf, vec, review)

### What are the **5 best** movie review in the test set according to your model? 🏆

In [47]:
# please fill this code block!
# Find indices of 5 best reviews
best5 = x_test_raw[___]

In [48]:
for i, review in enumerate(best5):
    display(HTML(f"<p style={style}>Good Movie #{i+1} 🏆</p>"))
    eli5_html(clf, vec, review)

What are the **5 most 'meh'** movie review in the test set according to your model? 😐\
That is, which reviews are the most neutral according to your model?\
Upon reading some of these reviews you may find their sentiment to actually *not* be very ambiguous. What might be confusing our model?

In [49]:
# please fill this code block!
# Find indices of the 5 most neutral reviews
meh5 = x_test_raw[___]

In [50]:
for i, review in enumerate(meh5):
    display(HTML(f"<p style={style}>'Meh' Movie #{i+1} 😐</p>"))
    eli5_html(clf, vec, review)

Despite some difficulties with a few of the 'meh' movies, our model is actually pretty good! In fact, it works so well you can actually use it to find _mistakes_ in the manually labeled data!\
This can be done by inspecting which training observation predictions differ the most from the provided labels.\

**Write your own review**

Finally, you can try writing a review of your own and see what your model does with it!

In [51]:
my_review = """
            your review here
            """

# Remove punctuation using your regex from earlier
my_review = re.sub(punc_regex, '', my_review)
# Remove leading & trailing whitespace
# and put into a numpy array (which the model expects)
my_review = np.array([my_review.strip()])
my_review

array(['The film captivated me from the very start The cinematography was breathtaking and the character development was nuanced and profound The score set the tone perfectly enhancing the emotional depth of each scene While the plot had a couple of predictable moments the stellar performances by the lead actors more than made up for it Its a mustwatch for any cinema lover'],
      dtype='<U367')

In [52]:
my_review_proba = pipe.predict_proba(my_review)[:,1][0]
my_review_label = pipe.predict(my_review)[0]
print('predicted class:', my_review_label)
print('predicted probability:', my_review_proba)

predicted class: 1
predicted probability: 0.7451041113242047


In [53]:
display(HTML(f"<p style={style}>My Review 🍿</p>"))
eli5_html(clf, vec, my_review[0])