# Machine Learning with Python

In [None]:
import numpy as np
import pandas as pd

## 3.1 Text Data

In this section we will explore some techniques for making use of *unstructured* text data in supervised and unsupervised learning. The techniques introduced come from the fields of *information retrieval* (IR) and *natural language processing* (NLP).

Each data point consists of a single text, called a *document*.

The set of all documents in the analysis is called a *corpus*.


### The corpus

We will look at a set of user-contributed movie reviews retrieved from IMDb (The Internet Movie Database). Each document is the text of one review, together with a label indicating whether the review is broadly "positive" or "negative".

The data provided here are derived from the dataset available at http://ai.stanford.edu/~amaas/data/sentiment


Firstly, you will need to unpack the archived dataset `imdb.zip`.

The unpacked data contains a `train` and a `test` directory, each of which contains positive and negative examples.

`scikit-learn` can directly load the labelled corpus from this directory structure:

In [None]:
from sklearn.datasets import load_files

reviews_train = load_files("imdb/train/")
# load_files returns a bunch, containing training texts and training labels
text_train, y_train = reviews_train.data, reviews_train.target


In [None]:
vect = CountVectorizer().fit(text_train)
X_train = vect.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

In [None]:
feature_names = vect.get_feature_names()
print("Number of features: {}".format(len(feature_names)))
print("First 20 features:\n{}".format(feature_names[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names[::2000]))

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
scores = cross_val_score(LogisticRegression(), X_train, y_train, cv=5)
print("Mean cross-validation accuracy: {:.2f}".format(np.mean(scores)))

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10]}
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))
print("Best parameters: ", grid.best_params_)

In [None]:
X_test = vect.transform(text_test)
print("Test score: {:.2f}".format(grid.score(X_test, y_test)))

In [None]:
vect = CountVectorizer(min_df=5).fit(text_train)
X_train = vect.transform(text_train)
print("X_train with min_df: {}".format(repr(X_train)))

In [None]:
feature_names = vect.get_feature_names()

print("First 50 features:\n{}".format(feature_names[:50]))
print("Features 20010 to 20030:\n{}".format(feature_names[20010:20030]))
print("Every 700th feature:\n{}".format(feature_names[::700]))

In [None]:
grid = GridSearchCV(LogisticRegression(), param_grid, cv=5)
grid.fit(X_train, y_train)
print("Best cross-validation score: {:.2f}".format(grid.best_score_))

In [None]:
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

`text_train` is a `list` of length 25000.

The individual documents are stored as type `bytes` - i.e. immutable bytestrings that are interpreted as Unicode text data. See https://docs.python.org/3/library/stdtypes.html#bytes-objects

A `bytes` literal looks like a string literal with a `b` prepended.


The training dataset is balanced, with equal numbers of positive and negative reviews:

In [None]:
print("Samples per class (training): {}".format(np.bincount(y_train)))

### Cleaning up the data

Firstly, we should remove the `<br />` tags, which just represent line breaks.

In [None]:
text_train = [doc.replace(b"<br />", b" ") for doc in text_train]

We load and clean up the test data in the same way:

In [None]:
reviews_test = load_files("imdb/test/")
text_test, y_test = reviews_test.data, reviews_test.target

In [None]:
print("Number of documents in test data: {}".format(len(text_test)))
print("Samples per class (test): {}".format(np.bincount(y_test)))
text_test = [doc.replace(b"<br />", b" ") for doc in text_test]

### Exercise

The file `reviews.json` contains expert reviews for papers submitted to an international conference on computing and informatics.

*Appel, Orestes & Chiclana, Francisco & Carter, Jenny & Fujita, Hamido., 2016. A hybrid approach to sentiment analysis.*

The data is held in JSON format, which is a flexible format for structured data. Here's how we can extract the reviews into a pandas DataFrame:

In [None]:
import json

# load data using Python JSON module
with open('reviews.json','r') as f:
    data = json.loads(f.read())

reviews = pd.json_normalize(data, record_path=['review'])

In [None]:
reviews.head()

The `text` column contains the text of each review, whilst the `evaluation` column is a numerical score for each paper.

Prepare a training and testing dataset containing only the Spanish language (`lan == es`) reviews. Later, we will use these documents to attempt regression analysis to predict the `evaluation` score, which you should also extract as the target values. 

*Notes*

It's fine to use strings for the documents rather than the `bytes` datatype we saw earlier. Remember that scikit-learn can handle pandas `Series` data without needing to unpack it.

You will need the DataFrame method `query()`.

Consider any basic cleaning operations you can sensibly apply to the text.