In [2]:
import seaborn as sns
sns.set()

In [3]:
from static_grader import grader

# NLP: Analyzing Review Text


Unstructured data makes up the vast majority of data.  This is a basic intro to handling unstructured data.  Our objective is to be able to extract the sentiment (positive or negative) and gain insight from review text.  We will do this from Yelp review data.

## Metrics and scoring

The first two questions task you to build models, of increasing complexity, to predict the rating of a review from its text. The grader uses a test set to evaluate your model's performance against our reference solution, using the $R^2$ score. It **is** possible to receive a score greater than one, indicating that you've beaten our reference model. We compare our model's score on a test set to your score on the same test set. See how high you can go!

The final two questions asks only for the result of a calculation, and your results will be compared directly to those of a reference solution.

## Download and parse the data


To start, let's download the data set from Amazon S3:

In [4]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_review_reduced.json.gz'

The training data are a series of JSON objects, in a Gzipped file. Python supports Gzipped files natively: [`gzip.open`](https://docs.python.org/3/library/gzip.html) has the same interface as `open`, but handles `.gz` files automatically.

The built-in `json` package has a `loads` function that converts a JSON string into a Python dictionary. We could call that once for each row of the file. [`ujson`](http://docs.micropython.org/en/latest/library/ujson.html) has the same interface as the built-in `json` package, but is *substantially* faster (at the cost of non-robust handling of malformed JSON). We will use that inside a list comprehension to get a list of dictionaries:

In [5]:
import gzip
import ujson as json

with gzip.open('yelp_train_academic_dataset_review_reduced.json.gz') as f:
    data = [json.loads(line) for line in f]

The scikit-learn API requires that we keep labels (in this case, the star ratings) and features in separate data structures.

In [6]:
stars = [row['stars'] for row in data]

# Questions


## Question 1: bag_of_words_model

Build a linear model predicting the star rating based on the text reviews. Apply the bag-of-words model using the [`CountVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer) to produce a feature matrix giving the counts of each word in each review.

**Hints**:
1. You will need to extract the review text from the raw input data, a list of dictionaries. You can take a similar approach you took in the `ml` miniproject by first converting the data into a pandas data frame and then using [`ColumnTransformer`](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html?highlight=columntransformer#sklearn.compose.ColumnTransformer) or you can build a custom transform to extract the text. Either way, remember that the `CountVectorizer` accepts as input to its `transform` method a 1D array of text.

1. Try choosing different values for `min_df` (minimum document frequency cutoff) and `max_df` in `CountVectorizer`. Setting `min_df` to zero admits rare words which might only appear once in the entire corpus.  This is both prone to overfitting and makes your data unmanageably large.  Don't forget to use cross-validation to select the right value.

1. Try using [`LinearRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression) or [`Ridge`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html?highlight=ridge#sklearn.linear_model.Ridge). There is also [`RidgeCV`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html?highlight=ridge#sklearn.linear_model.RidgeCV) which has built-in leave-on-out cross-validation. If the memory footprint is too big, try switching to [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor). Don't forget to search for the optimal value of the regularization parameter. How do the regularization parameter `alpha` and the values of `min_df` and `max_df` from `CountVectorizer` change the answer?

1. You will likely pick up several hyperparameters between the vectorization step and the regularization of the predictor. While it is more strictly correct to do a grid search over all of them at once, this can take a long time. Quite often, doing a grid search over a single hyperparameter at a time can produce similar results.  Alternatively, the grid search may be done over a smaller subset of the data, as long as it is representative of the whole.

1. Finally, assemble a pipeline that will transform the data from list of dictionaries all the way to predictions.  This will allow you to submit the model's `predict` method to the grader for scoring as the test set used by the grader is a list of dictionaries.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import TransformerMixin, BaseEstimator, RegressorMixin
from sklearn.model_selection import GridSearchCV
from sklearn.feature_extraction import FeatureHasher
import numpy as np

In [8]:
class TextDataFrame(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        df=pd.DataFrame.from_dict(X)
        return df['text'].values

from sklearn.linear_model import Ridge
#data_1=TextDataFrame().fit_transform(data)

parameters = [{
    'cv__max_df': (0.4, 0.6, 0.8, 1),
    'cv__min_df': (0, 0.2, 0.4, 0.6, 0.8),
    'ridge__alpha': (0.1, 1, 10),
}]


bag_of_words_model = Pipeline([('text_df', TextDataFrame()), 
                         ('grid-search', GridSearchCV(Pipeline([
                                                       ('cv', CountVectorizer()),
                                                       ('ridge', Ridge())]), parameters, cv=5, verbose=60))])
bag_of_words_model.fit(data, stars)

Fitting 5 folds for each of 60 candidates, totalling 300 fits
[CV 1/5; 1/60] START cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1.............
[CV 1/5; 1/60] END cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1;, score=0.015 total time= 5.8min
[CV 2/5; 1/60] START cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1.............
[CV 2/5; 1/60] END cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1;, score=-0.087 total time= 6.0min
[CV 3/5; 1/60] START cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1.............
[CV 3/5; 1/60] END cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1;, score=-0.049 total time= 6.1min
[CV 4/5; 1/60] START cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1.............
[CV 4/5; 1/60] END cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1;, score=-0.007 total time= 6.1min
[CV 5/5; 1/60] START cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1.............
[CV 5/5; 1/60] END cv__max_df=0.4, cv__min_df=0, ridge__alpha=0.1;, score=0.067 total time= 5.9min
[CV 1/5; 2/60] START cv__max_df=0.4

[CV 1/5; 10/60] END cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1;, score=nan total time=  13.8s
[CV 2/5; 10/60] START cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1..........
[CV 2/5; 10/60] END cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1;, score=nan total time=  13.5s
[CV 3/5; 10/60] START cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1..........
[CV 3/5; 10/60] END cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1;, score=nan total time=  13.7s
[CV 4/5; 10/60] START cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1..........
[CV 4/5; 10/60] END cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1;, score=nan total time=  13.7s
[CV 5/5; 10/60] START cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1..........
[CV 5/5; 10/60] END cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=0.1;, score=nan total time=  14.0s
[CV 1/5; 11/60] START cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=1............
[CV 1/5; 11/60] END cv__max_df=0.4, cv__min_df=0.6, ridge__alpha=1;, score=nan total time=  14.

[CV 2/5; 19/60] END cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1;, score=0.145 total time=  17.6s
[CV 3/5; 19/60] START cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1..........
[CV 3/5; 19/60] END cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1;, score=0.165 total time=  17.6s
[CV 4/5; 19/60] START cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1..........
[CV 4/5; 19/60] END cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1;, score=0.165 total time=  17.7s
[CV 5/5; 19/60] START cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1..........
[CV 5/5; 19/60] END cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=0.1;, score=0.178 total time=  17.7s
[CV 1/5; 20/60] START cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=1............
[CV 1/5; 20/60] END cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=1;, score=0.159 total time=  17.7s
[CV 2/5; 20/60] START cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=1............
[CV 2/5; 20/60] END cv__max_df=0.6, cv__min_df=0.2, ridge__alpha=1;, score=0.145 total 

[CV 3/5; 28/60] END cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=0.1;, score=nan total time=  13.7s
[CV 4/5; 28/60] START cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=0.1..........
[CV 4/5; 28/60] END cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=0.1;, score=nan total time=  13.7s
[CV 5/5; 28/60] START cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=0.1..........
[CV 5/5; 28/60] END cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=0.1;, score=nan total time=  14.0s
[CV 1/5; 29/60] START cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=1............
[CV 1/5; 29/60] END cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=1;, score=nan total time=  14.0s
[CV 2/5; 29/60] START cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=1............
[CV 2/5; 29/60] END cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=1;, score=nan total time=  13.8s
[CV 3/5; 29/60] START cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=1............
[CV 3/5; 29/60] END cv__max_df=0.6, cv__min_df=0.8, ridge__alpha=1;, score=nan total time=  14.0s
[

[CV 4/5; 37/60] END cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=0.1;, score=0.108 total time=  16.9s
[CV 5/5; 37/60] START cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=0.1..........
[CV 5/5; 37/60] END cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=0.1;, score=0.109 total time=  16.9s
[CV 1/5; 38/60] START cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1............
[CV 1/5; 38/60] END cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1;, score=0.102 total time=  16.9s
[CV 2/5; 38/60] START cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1............
[CV 2/5; 38/60] END cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1;, score=0.096 total time=  16.9s
[CV 3/5; 38/60] START cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1............
[CV 3/5; 38/60] END cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1;, score=0.106 total time=  16.9s
[CV 4/5; 38/60] START cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1............
[CV 4/5; 38/60] END cv__max_df=0.8, cv__min_df=0.4, ridge__alpha=1;, score=0.108 total time

[CV 5/5; 46/60] END cv__max_df=1, cv__min_df=0, ridge__alpha=0.1;, score=-0.096 total time=  17.0s
[CV 1/5; 47/60] START cv__max_df=1, cv__min_df=0, ridge__alpha=1................
[CV 1/5; 47/60] END cv__max_df=1, cv__min_df=0, ridge__alpha=1;, score=-0.026 total time=  16.9s
[CV 2/5; 47/60] START cv__max_df=1, cv__min_df=0, ridge__alpha=1................
[CV 2/5; 47/60] END cv__max_df=1, cv__min_df=0, ridge__alpha=1;, score=-0.022 total time=  17.1s
[CV 3/5; 47/60] START cv__max_df=1, cv__min_df=0, ridge__alpha=1................
[CV 3/5; 47/60] END cv__max_df=1, cv__min_df=0, ridge__alpha=1;, score=-0.023 total time=  17.0s
[CV 4/5; 47/60] START cv__max_df=1, cv__min_df=0, ridge__alpha=1................
[CV 4/5; 47/60] END cv__max_df=1, cv__min_df=0, ridge__alpha=1;, score=-0.029 total time=  17.0s
[CV 5/5; 47/60] START cv__max_df=1, cv__min_df=0, ridge__alpha=1................
[CV 5/5; 47/60] END cv__max_df=1, cv__min_df=0, ridge__alpha=1;, score=-0.040 total time=  16.9s
[CV 1/5; 48

[CV 1/5; 56/60] END cv__max_df=1, cv__min_df=0.6, ridge__alpha=1;, score=nan total time=  13.9s
[CV 2/5; 56/60] START cv__max_df=1, cv__min_df=0.6, ridge__alpha=1..............
[CV 2/5; 56/60] END cv__max_df=1, cv__min_df=0.6, ridge__alpha=1;, score=nan total time=  13.7s
[CV 3/5; 56/60] START cv__max_df=1, cv__min_df=0.6, ridge__alpha=1..............
[CV 3/5; 56/60] END cv__max_df=1, cv__min_df=0.6, ridge__alpha=1;, score=nan total time=  13.8s
[CV 4/5; 56/60] START cv__max_df=1, cv__min_df=0.6, ridge__alpha=1..............
[CV 4/5; 56/60] END cv__max_df=1, cv__min_df=0.6, ridge__alpha=1;, score=nan total time=  13.8s
[CV 5/5; 56/60] START cv__max_df=1, cv__min_df=0.6, ridge__alpha=1..............
[CV 5/5; 56/60] END cv__max_df=1, cv__min_df=0.6, ridge__alpha=1;, score=nan total time=  14.1s
[CV 1/5; 57/60] START cv__max_df=1, cv__min_df=0.6, ridge__alpha=10.............
[CV 1/5; 57/60] END cv__max_df=1, cv__min_df=0.6, ridge__alpha=10;, score=nan total time=  14.1s
[CV 2/5; 57/60] ST

150 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
45 fits failed with the following error:
Traceback (most recent call last):
  File "/home/jovyan/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jovyan/.local/lib/python3.10/site-packages/sklearn/pipeline.py", line 402, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "/home/jovyan/.local/lib/python3.10/site-packages/sklearn/pipeline.py", line 360, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "/home/jovyan/.local/lib/python3.10/site-packages/joblib/memory.py", line 353, in __call__
    return self.fu

In [9]:
grader.score('nlp__bag_of_words_model', bag_of_words_model.predict)

Your score: 1.0154


## Question 2: bigram_model

In a bigram model, we'll consider both single words and pairs of consecutive words that appear. This is going to be a much higher-dimensional problem so you should be careful about overfitting. You should also use a vectorizer that applies some sort of normalization, e.g., the `TfidfVectorizer` or a word count vectorizer combined with `TfidfTransformer`.

Sometimes, reducing the dimension can be useful. If you're using the `TfidfVectorizer`, you can change the `max_features` hyperparameter to reduce the size of the resulting vocabulary. For `HashingVectorizer`, you can adjust the size of the feature matrix through `n_features`.

**A side note on multi-stage model evaluation:** When your model consists of a pipeline with several stages, it can be worthwhile to evaluate which parts of the pipeline have the greatest impact on the overall accuracy (or other metric) of the model. This allows you to focus your efforts on improving the important algorithms, and leaving the rest "good enough".

One way to accomplish this is through ceiling analysis, which can be useful when you have a training set with ground truth values at each stage. Let's say you're training a model to extract image captions from websites and return a list of names that were in the caption. Your overall accuracy at some point reaches 70%. You can try manually giving the model what you know are the correct image captions from the training set, and see how the accuracy improves (maybe up to 75%). Alternatively, giving the model the perfect name parsing for each caption increases accuracy to 90%. This indicates that the name parsing is a much more promising target for further work, and the caption extraction is a relatively smaller factor in the overall performance.

If you don't know the right answers at different stages of the pipeline, you can still evaluate how important different parts of the model are to its performance by changing or removing certain steps while keeping everything else constant. You might try this kind of analysis to determine how important adding stopwords and stemming to your NLP model actually is, and how that importance changes with parameters like the number of features.

In [13]:
parameters = [{
    'tfidf__min_df': (0.25, 0.5, 0.75, 1),
    'tfidf__max_features': (2500, 5000, 7500, 10000),
    'ridge__alpha': (0.1, 1, 10),
}]


bigram_model = Pipeline([('text_df', TextDataFrame()), 
                         ('grid-search', GridSearchCV(Pipeline([
                                                       ('tfidf', TfidfVectorizer(ngram_range=(1, 2))),
                                                       ('ridge', Ridge())]), parameters, cv=5, verbose=48))])

res=bigram_model.fit(data, stars)

Fitting 5 folds for each of 48 candidates, totalling 240 fits
[CV 1/5; 1/48] START ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25
[CV 1/5; 1/48] END ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25;, score=0.178 total time=  55.8s
[CV 2/5; 1/48] START ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25
[CV 2/5; 1/48] END ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25;, score=0.165 total time=  55.7s
[CV 3/5; 1/48] START ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25
[CV 3/5; 1/48] END ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25;, score=0.187 total time=  55.5s
[CV 4/5; 1/48] START ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25
[CV 4/5; 1/48] END ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25;, score=0.190 total time=  55.5s
[CV 5/5; 1/48] START ridge__alpha=0.1, tfidf__max_features=2500, tfidf__min_df=0.25
[CV 5/5; 1/48] END ridge__alpha=0.1, tfidf__max_features=2

[CV 2/5; 9/48] END ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25;, score=0.165 total time=  55.4s
[CV 3/5; 9/48] START ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25
[CV 3/5; 9/48] END ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25;, score=0.187 total time=  56.1s
[CV 4/5; 9/48] START ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25
[CV 4/5; 9/48] END ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25;, score=0.190 total time=  55.4s
[CV 5/5; 9/48] START ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25
[CV 5/5; 9/48] END ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.25;, score=0.207 total time=  55.9s
[CV 1/5; 10/48] START ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.5
[CV 1/5; 10/48] END ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.070 total time=  55.1s
[CV 2/5; 10/48] START ridge__alpha=0.1, tfidf__max_features=7500, tfidf__min_df=0.5
[CV 2

[CV 4/5; 17/48] END ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.25;, score=0.190 total time=  56.1s
[CV 5/5; 17/48] START ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.25
[CV 5/5; 17/48] END ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.25;, score=0.207 total time=  55.8s
[CV 1/5; 18/48] START ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5
[CV 1/5; 18/48] END ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5;, score=0.070 total time=  55.8s
[CV 2/5; 18/48] START ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5
[CV 2/5; 18/48] END ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5;, score=0.059 total time=  55.5s
[CV 3/5; 18/48] START ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5
[CV 3/5; 18/48] END ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5;, score=0.077 total time=  55.5s
[CV 4/5; 18/48] START ridge__alpha=1, tfidf__max_features=2500, tfidf__min_df=0.5
[CV 4/5; 18/48] END ri

[CV 1/5; 26/48] END ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.070 total time=  55.7s
[CV 2/5; 26/48] START ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5
[CV 2/5; 26/48] END ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.059 total time=  55.7s
[CV 3/5; 26/48] START ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5
[CV 3/5; 26/48] END ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.077 total time=  55.3s
[CV 4/5; 26/48] START ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5
[CV 4/5; 26/48] END ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.080 total time=  55.4s
[CV 5/5; 26/48] START ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5
[CV 5/5; 26/48] END ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.075 total time=  55.8s
[CV 1/5; 27/48] START ridge__alpha=1, tfidf__max_features=7500, tfidf__min_df=0.75
[CV 1/5; 27/48] END ridg

[CV 3/5; 34/48] END ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.5;, score=0.077 total time=  55.3s
[CV 4/5; 34/48] START ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.5
[CV 4/5; 34/48] END ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.5;, score=0.080 total time=  55.2s
[CV 5/5; 34/48] START ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.5
[CV 5/5; 34/48] END ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.5;, score=0.075 total time=  55.2s
[CV 1/5; 35/48] START ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.75
[CV 1/5; 35/48] END ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.75;, score=0.027 total time=  55.3s
[CV 2/5; 35/48] START ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.75
[CV 2/5; 35/48] END ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.75;, score=0.017 total time=  54.8s
[CV 3/5; 35/48] START ridge__alpha=10, tfidf__max_features=2500, tfidf__min_df=0.75
[CV 3/5; 3

[CV 5/5; 42/48] END ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.5;, score=0.075 total time=  55.9s
[CV 1/5; 43/48] START ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75
[CV 1/5; 43/48] END ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75;, score=0.027 total time=  55.4s
[CV 2/5; 43/48] START ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75
[CV 2/5; 43/48] END ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75;, score=0.017 total time=  55.2s
[CV 3/5; 43/48] START ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75
[CV 3/5; 43/48] END ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75;, score=0.026 total time=  55.2s
[CV 4/5; 43/48] START ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75
[CV 4/5; 43/48] END ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75;, score=0.028 total time=  55.3s
[CV 5/5; 43/48] START ridge__alpha=10, tfidf__max_features=7500, tfidf__min_df=0.75
[CV 5/

In [15]:
grader.score('nlp__bigram_model', res.predict)

Your score: 1.1020


## Question 3: word_polarity

Let's consider a different approach and try to derive some insight from our analysis.  

We want to determine the most "polarizing words" in the corpus of reviews.  In other words, we want to identify words that strongly signal a review is either positive or negative.  For example, we understand that a word like "terrible" will most likely appear in negative rather than positive reviews.  

During training, the [naive Bayes model](https://scikit-learn.org/stable/modules/naive_bayes.html#) calculates probabilities such as $Pr(\textrm{terrible}\ |\ \textrm{negative}),$ the probability that the word "terrible" appears in the review text, given that the review is negative.  Using these probabilities, we can define a **polarity score** for each word $w$,

$$\textrm{polarity}(w) = \log\left(\frac{Pr(w\ |\ \textrm{positive})}{Pr(w\ |\ \textrm{negative})}\right).$$

Polarity analysis is an example where a simpler model (naive Bayes) offers more explicability than more complicated models.  Aside from this, naive Bayes models are easy to train, the training process is parallelizable, and these models lend themselves well to online learning.  Given enough training data, naive Bayes models have performed well in NLP applications such as spam filtering.  

For this problem, you are asked to determine the top 25 most positive polar words and the 25 most negative polar words.  For this analysis, you should:

1.  **Filter** the collection of reviews you were using above to **only keep** the one-star and five-star reviews. Since these are the "most polar" reviews, it should give us the most polarizing words.   
1.  Use the naive Bayes model, `MultinomialNB`.  
1.  Use TF-IDF weighting.
1.  Remove stop words.
1.  As mentioned, generate a (Python) list with most positive (25 words) and most negative (25 words) polar words.  

A naive Bayes model (after training) stores the log of the probabilities in an attribute of the model.  It is a `numpy` array of shape (number of classes, number of features).  You will need the mapping between feature indices to words to find the most polarizing words.  

In [17]:
polar_data=[]
polar_stars=[]
for data1, stars1 in zip(data, stars):
    if stars1 in [1, 5]:
        polar_data.append(data1['text'])
        polar_stars.append(stars1)

In [18]:
# We're only keeping the one and five star reviews
grader.check(len(polar_data) == 116576)

True

In [74]:
tf_idf=TfidfVectorizer()
data_1=tf_idf.fit_transform(np.array(polar_data))
import scipy

word_map=tf_idf.vocabulary_
orderWordList=['']*88552
posWeight=[1]*88552
negWeight=[1]*88552
for item in word_map:
    orderWordList[word_map[item]]=item


maxm=0
for i in range(data_1.shape[0]):
    cx = scipy.sparse.coo_matrix(data_1[i])
    if polar_stars[i]==5:
        for j,k,v in zip(cx.row, cx.col, cx.data):
            posWeight[k]+=v
    else:
        for j,k,v in zip(cx.row, cx.col, cx.data):
            negWeight[k]+=v 
        
import heapq

posHeap=[]
negHeap=[]

for i in range(88552):
    val=posWeight[i]/negWeight[i]
    if len(posHeap)<25:
        heapq.heappush(posHeap, [val, orderWordList[i]])
    elif posHeap[0][0]<val:
        heapq.heappushpop(posHeap, [val, orderWordList[i]])
    
    if len(negHeap)<25:
        heapq.heappush(negHeap, [-val, orderWordList[i]])
    elif negHeap[0][0]<-val:
        heapq.heappushpop(negHeap, [-val, orderWordList[i]])      

polar_words=[word for val, word in posHeap+negHeap]


In [75]:
grader.score('nlp__word_polarity', polar_words)

Your score: 0.9800


## Question 4: food_bigrams

Look over all reviews of restaurants.  You can determine which businesses are restaurants by looking in the `yelp_train_academic_dataset_business.json.gz` file from the ml project or downloaded below.

In [9]:
!aws s3 sync s3://dataincubator-course/mldata/ . --exclude '*' --include 'yelp_train_academic_dataset_business.json.gz'

In [10]:
with gzip.open('yelp_train_academic_dataset_business.json.gz') as f:
    business_data = [json.loads(line) for line in f]

Each row of this file corresponds to a single business.  The category key gives a list of categories for each; take all where "Restaurants" appears.

In [11]:
restaurant_ids = []

for item in business_data:
    if 'Restaurants' in item['categories']:
        restaurant_ids.append(item['business_id'])

In [12]:
# Look at the categories to check for spelling and capitalization
grader.check(len(restaurant_ids) == 12876)

True

The "business_id" here is the same as in the review data.  Use this to extract the review text for all reviews of restaurants.

In [13]:
rest_id_set=set(restaurant_ids)
restaurant_reviews=[]
for item in data:
    if item['business_id'] in rest_id_set:
        restaurant_reviews.append(item['text'])

In [14]:
# Just reviews of restaurants
# restaurant_ids is helpful here
grader.check(len(restaurant_reviews) == 143361)

True

We want to find collocations --- that is, bigrams that are "special" and appear more often than you'd expect from chance. We can think of the corpus as defining an empirical distribution over all *n*-grams.  We can find word pairs that are unlikely to occur consecutively based on the underlying probability of their words. Mathematically, if $p(w)$ be the probability of a word $w$ and $p(w_1 w_2)$ is the probability of the bigram $w_1 w_2$, then we want to look at word pairs $w_1 w_2$ where the statistic

  $$ \frac{p(w_1 w_2)}{p(w_1) p(w_2)} $$

is high.  Return the top 100 (mostly food) bigrams with this statistic with the 'right' prior factor (see below).

Estimating the probabilities is simply a matter of counting, and there are number of approaches that will work.  One is to use one of the tokenizers to count up how many times each word and each bigram appears in each review, and then sum those up over all reviews.  You might want to know that the `CountVectorizer` has a `.get_feature_names_out()` method which gives the string associated with each column.  (Question for thought: Why doesn't the `HashingVectorizer` have a similar method?)

*Questions:* This statistic is a ratio and problematic when the denominator is small.  We can fix this by applying Bayesian smoothing to $p(w)$ (i.e. mixing the empirical distribution with the uniform distribution over the vocabulary).

1. How does changing this smoothing parameter affect the word pairs you get qualitatively?

2. We can interpret the smoothing parameter as adding a constant number of occurrences of each word to our distribution.  Does this help you determine set a reasonable value for this 'prior factor'?

3. For fun: also check out [Amazon's Statistically Improbable Phrases](http://en.wikipedia.org/wiki/Statistically_Improbable_Phrases).

*Implementation note:*
As you adjust the size of the Bayesian smoothing parameter, you will notice first nonsense phrases being removed and then legitimate bigrams being removed, leaving you with only generic bigrams.  The goal is to find a value of the smoothing parameter between these two transitions.

The reference solution is not an aggressive filterer: it errors in favor of leaving apparently nonsensical words. On further consideration, many of these are actually somewhat meaningful. The smoothing parameter chosen in the reference solution is equivalent to giving each word 30 previous appearances prior to considering this data.  This was chosen by generating a list of bigrams for a range of smoothing parameters and seeing how many of the bigrams were shared between neighboring values.  When the shared fraction reached 95%, we judged the solution to have converged.

There are a few reviews that include the same nonsense strings multiple times.  To keep these from showing up in our results, we set `min_df=10`, to ensure that a bigram occurs in at least 10 reviews before we consider it.

In [23]:
import heapq, scipy

cv=CountVectorizer(ngram_range=(1, 2), min_df=10)
data_2=cv.fit_transform(np.array(restaurant_reviews))

bigram_list=cv.get_feature_names_out()
bigram_map={}
for i, word in enumerate(bigram_list):
    bigram_map[word]=i
    
bigramCount=[1]*len(bigram_list)

for i in range(data_2.shape[0]):
    cx = scipy.sparse.coo_matrix(data_2[i])
    for j,k,v in zip(cx.row, cx.col, cx.data):
        bigramCount[k]+=v    

minHeap=[]

for i, word in enumerate(bigram_list):
    if len(word.split())>1:
        word1, word2=word.split()
        if word in bigram_map and word1 in bigram_map and word2 in bigram_map:
            val=bigramCount[bigram_map[word]]*1000/(bigramCount[bigram_map[word1]]*bigramCount[bigram_map[word2]])
            if len(minHeap)<100:
                heapq.heappush(minHeap, [val, word])
            elif minHeap[0][0]<val:
                heapq.heappushpop(minHeap, [val, word])

top100=[word for val, word in minHeap]                

In [24]:
grader.score('nlp__food_bigrams', top100)

Your score: 1.0000


*Copyright &copy; 2022 Pragmatic Institute. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.*