## ![BTS](img/Logo-BTS.jpg)

# Session 21: Sentiment Analysis

### Juan Luis Cano Rodríguez <juan.cano@bts.tech> - Data Science Foundations (2018-12-11)

Open this notebook in Google Colaboratory: [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Juanlu001/bts-mbds-data-science-foundations/blob/master/sessions/21-Sentiment-Analysis.ipynb)

## Extra: scikit-learn helpers

There are several libraries that provide higher level tools to evaluate and visualize scikit-learn models, for example:

* https://github.com/reiinakano/scikit-plot
* https://github.com/DistrictDataLabs/yellowbrick
* https://github.com/rasbt/mlxtend

Using the "astrological method" to rank popularity (that is, counting stars on GitHub) the top one is mlxtend. But using the "provides way to plot confusion matrix **with class names** automatically" metric, the best one is scikit-plot, which is the one we will use.

In [None]:
#!pip install scikit-plot

## Exercise 1: Preprocessing

1. Download the "Large Movie Review Dataset" from http://ai.stanford.edu/~amaas/data/sentiment/
2. Read all the text files from `aclImdb/train/pos/` and `aclImdb/train/neg/` into a pandas DataFrame called `train` with two columns: `review` (the text itself) and `sentiment` (`positive` or `negative`) (_Hint: Use the `glob` module_)

<table class="dataframe" border="1">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>review</th>
      <th>sentiment</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Today's sci-fi thrillers are more like Rambo i...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>1</th>
      <td>I had the pleasure of seeing this film at the ...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Deliriously romantic comedy with intertwining ...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>3</th>
      <td>This movie is a fantastic movie. Everything ab...</td>
      <td>positive</td>
    </tr>
    <tr>
      <th>4</th>
      <td>The documentary begins with setting the perspe...</td>
      <td>positive</td>
    </tr>
  </tbody>
</table>

3. Do the same thing with `data_test`
4. Create a `TfidfVectorizer` with:
  * a _preprocessing_ step that removes the spurious `<br />` tags from the text,
  * a _tokenizing_ step that uses Spacy to lemmatize the words, and
  * its list of `stop_words` coming from Spacy

(_Hint:_ https://scikit-learn.org/stable/modules/feature_extraction.html#customizing-the-vectorizer-classes)

## Exercise 2: Single model

1. Apply the `TfidfVectorizer` to the train data and fit a `LogisticRegression` model
2. What is the accuracy?
3. Use scikit-plot to plot the confusion matrix

## Exercise 3: Cross validation

1. Concatenate `train` and `test` to produce `data`
2. Use the `sklearn.cross_validation.cross_val_score` with 5 splits to produce a list of accuracy scores. Are they uniform?
3. Use `scikitplot.estimators.plot_learning_curve` to plot the learning curves of our `LogisticRegression` model

## Exercise 4: Model selection

1. Use `GridSearchCV` to optimize the `C` hyperparameter of the `LogisticRegression`
2. Try different classifiers (like a `RandomForestClassifier`), compute their accuracy, try optimizing their hyperparameters