In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


## Predicting the Genre of Books from Summaries

We'll use a set of book summaries from the [CMU Book Summaries Corpus](http://www.cs.cmu.edu/~dbamman/booksummaries.html) in this experiment.  This contains a large number of summaries (16,559) and includes meta-data about the genre of the books taken from Freebase.  Each book can have more than one genre and there are 227 genres listed in total.  To simplify the problem of genre prediction we will select a small number of target genres that occur frequently in the collection and select the books with these genre labels.  This will give us one genre label per book. 

Your goal in this portfolio is to take this data and build predictive models to classify the books into one of the five target genres.  You will need to extract suitable features from the texts and select suitable models to classify them. You should build and evaluate at least TWO models and compare the prediction results.

You should report on each stage of your experiment as you work with the data.


## Data Preparation

The first task is to read the data. It is made available in tab-separated format but has no column headings. We can use `read_csv` to read this but we need to set the separator to `\t` (tab) and supply the column names.  The names come from the [ReadMe](data/booksummaries/README.txt) file.

In [2]:
names = ['wid', 'fid', 'title', 'author', 'date', 'genres', 'summary']

books = pd.read_csv("data/booksummaries/booksummaries.txt", sep="\t", header=None, names=names, keep_default_na=False)
books.head()

Unnamed: 0,wid,fid,title,author,date,genres,summary
0,620,/m/0hhy,Animal Farm,George Orwell,1945-08-17,"{""/m/016lj8"": ""Roman \u00e0 clef"", ""/m/06nbt"":...","Old Major, the old boar on the Manor Farm, ca..."
1,843,/m/0k36,A Clockwork Orange,Anthony Burgess,1962,"{""/m/06n90"": ""Science Fiction"", ""/m/0l67h"": ""N...","Alex, a teenager living in near-future Englan..."
2,986,/m/0ldx,The Plague,Albert Camus,1947,"{""/m/02m4t"": ""Existentialism"", ""/m/02xlf"": ""Fi...",The text of The Plague is divided into five p...
3,1756,/m/0sww,An Enquiry Concerning Human Understanding,David Hume,,,The argument of the Enquiry proceeds by a ser...
4,2080,/m/0wkt,A Fire Upon the Deep,Vernor Vinge,,"{""/m/03lrw"": ""Hard science fiction"", ""/m/06n90...",The novel posits that space around the Milky ...


We next filter the data so that only our target genre labels are included and we assign each text to just one of the genre labels.  It's possible that one text could be labelled with two of these labels (eg. Science Fiction and Fantasy) but we will just assign one of those here. 

In [3]:
target_genres = ["Children's literature",
                 'Science Fiction',
                 'Novel',
                 'Fantasy',
                 'Mystery']

# create a Series of empty strings the same length as the list of books
genre = pd.Series(np.repeat("", books.shape[0]))
# look for each target genre and set the corresponding entries in the genre series to the genre label
for g in target_genres:
    genre[books['genres'].str.contains(g)] = g

# add this to the book dataframe and then select only those rows that have a genre label
# drop some useless columns
books['genre'] = genre
genre_books = books[genre!=''].drop(['genres', 'fid', 'wid'], axis=1)

genre_books.shape


(8954, 5)

In [7]:
genre_books.head(15)

Unnamed: 0,title,author,date,summary,genre
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy
8,Blade Runner 3: Replicant Night,K. W. Jeter,1996-10-01,"Living on Mars, Deckard is acting as a consul...",Science Fiction
9,Blade Runner 2: The Edge of Human,K. W. Jeter,1995-10-01,Beginning several months after the events in ...,Science Fiction
20,Crash,J. G. Ballard,1973,The story is told through the eyes of narrato...,Novel
21,Children of Dune,Frank Herbert,1976,Nine years after Emperor Paul Muad'dib walked...,Science Fiction
23,Chapterhouse Dune,Frank Herbert,1985-04,The situation is desperate for the Bene Gesse...,Science Fiction


In [4]:
# check how many books we have in each genre category
genre_books.groupby('genre').count()


Unnamed: 0_level_0,title,author,date,summary
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Children's literature,1092,1092,1092,1092
Fantasy,2311,2311,2311,2311
Mystery,1396,1396,1396,1396
Novel,2258,2258,2258,2258
Science Fiction,1897,1897,1897,1897


## Feature Exaction

Now you take over to build a suitable model and present your results.

Firstly, you need to perform feature extraction to produce feature vectors for the predictive models.

## Label Encoding

In [8]:
from sklearn.preprocessing import LabelEncoder


In [10]:
le = LabelEncoder()
le.fit(list(genre_books['genre'].values))

LabelEncoder()

In [13]:
#creating a new column genre_number to have encoded labels 
genre_books['genre_number'] = le.transform(list(genre_books['genre']))
genre_books.head(15)

Unnamed: 0,title,author,date,summary,genre,genre_number
0,Animal Farm,George Orwell,1945-08-17,"Old Major, the old boar on the Manor Farm, ca...",Children's literature,0
1,A Clockwork Orange,Anthony Burgess,1962,"Alex, a teenager living in near-future Englan...",Novel,3
2,The Plague,Albert Camus,1947,The text of The Plague is divided into five p...,Novel,3
4,A Fire Upon the Deep,Vernor Vinge,,The novel posits that space around the Milky ...,Fantasy,1
6,A Wizard of Earthsea,Ursula K. Le Guin,1968,"Ged is a young boy on Gont, one of the larger...",Fantasy,1
8,Blade Runner 3: Replicant Night,K. W. Jeter,1996-10-01,"Living on Mars, Deckard is acting as a consul...",Science Fiction,4
9,Blade Runner 2: The Edge of Human,K. W. Jeter,1995-10-01,Beginning several months after the events in ...,Science Fiction,4
20,Crash,J. G. Ballard,1973,The story is told through the eyes of narrato...,Novel,3
21,Children of Dune,Frank Herbert,1976,Nine years after Emperor Paul Muad'dib walked...,Science Fiction,4
23,Chapterhouse Dune,Frank Herbert,1985-04,The situation is desperate for the Bene Gesse...,Science Fiction,4


In [14]:
target_names = list(le.inverse_transform([0,1,2,3,4]))
print(target_names)

["Children's literature", 'Fantasy', 'Mystery', 'Novel', 'Science Fiction']


## TfidVectorization 

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer 

In [19]:
vectorizer = TfidfVectorizer(max_features = 5000)
x = vectorizer.fit_transform(genre_books.summary).toarray() 
print(x.shape)

(8954, 5000)


## Data Split into training and testing 

In [21]:
from sklearn.model_selection import train_test_split 

In [23]:
X_train, X_test, y_train, y_test = train_test_split(x, genre_books['genre_number'], test_size = 0.2, random_state = 143)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(7163, 5000)
(7163,)
(1791, 5000)
(1791,)


## Model Training

Then, train two predictive models from the given data set.

## Modelling using Logistic Regression

In [26]:
from sklearn.linear_model import LogisticRegression 

In [28]:
lr = LogisticRegression() 
lr.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression()

In [30]:
# preditctions on trainning and testing set 
y_train_pred = lr.predict(X_train)
y_test_pred = lr.predict(X_test)

In [31]:
from sklearn.metrics import classification_report

In [32]:
print(classification_report(y_test, y_test_pred, target_names = target_names))

                       precision    recall  f1-score   support

Children's literature       0.65      0.47      0.54       242
              Fantasy       0.75      0.76      0.76       485
              Mystery       0.76      0.70      0.73       274
                Novel       0.61      0.74      0.67       427
      Science Fiction       0.74      0.73      0.73       363

             accuracy                           0.70      1791
            macro avg       0.70      0.68      0.69      1791
         weighted avg       0.70      0.70      0.70      1791



In [33]:
from sklearn.metrics import confusion_matrix 

In [34]:
print(confusion_matrix(y_test, y_test_pred))

[[113  47  16  62   4]
 [ 19 369  10  43  44]
 [ 10  16 191  51   6]
 [ 24  22  23 317  41]
 [  7  37  12  43 264]]


In [35]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

In [36]:
print("Accuracy score:", accuracy_score(y_test,y_test_pred))
print("Precision:", precision_score(y_test,y_test_pred, average = "macro"))
print("Recall:", recall_score(y_test,y_test_pred, average = "macro"))
print("f1_score", precision_score(y_test,y_test_pred, average = "macro"))

Accuracy score: 0.7001675041876047
Precision: 0.7024720646871733
Recall: 0.6789017338108208
f1_score 0.7024720646871733


## Inference (Prediction)  

In [37]:
 text = ['In the year 2122, the crew of a space craft called the Nostromo are awakened from their stasis sleep after the ship detects what might have been a distress signal from a small planetoid. The Nostromo is crewed by 7 astronauts on a corporately owned expedition, which is returning to Earth when the movie begins. Picking up the unexpected transmission, the crew members decide to explore the source of the signal. Warrant Officer Ripley (Sigourney Weaver) quickly realizes that the signal was not a distress signal at all, but a warning from an alien space craft, as the Nostromo lands on the stormy planet. On the alien craft, Executive Officer Kane discovers a chamber filled with unhatched eggs. While examining one egg, Kane is attacked by a small alien that attaches to his face. Dallas and Kane return the unconscious Kane to the Nostromo. Ripley invokes protocol, wanting to prevent Kane from coming back onto the ship for 24 hours, but Science Officer Ash overrides her quarantine and allows the three to re-board the ship. Kane is placed in the infirmary, where they find an alien creature attached to his face.Kane awakens and the alien appears to have died. Kane is believed to be recovered from the incident and the crew sits down to a meal. However, during the meal, he experiences severe stomach pains, and then, in one of the most iconic scenes in contemporary film, an alien erupts from his stomach, escaping into the Nostromo. The crew debates how to exterminate the creature, worrying about the safety of the ship. They decide to try to capture the alien and burn it.']
s = vectorizer.transform(text)
predicted_label = lr.predict(s)

In [38]:
print(predicted_label) 
print(le.inverse_transform(predicted_label))

[4]
['Science Fiction']


## Model Evaluation

Finally, evaluate and compare the learned predictive models.

In [None]:
print("Accuracy score:", accuracy_score(y_test,y_test_pred))
print("Precision:", precision_score(y_test,y_test_pred, average = "macro"))
print("Recall:", recall_score(y_test,y_test_pred, average = "macro"))
print("f1_score", precision_score(y_test,y_test_pred, average = "macro"))