For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing<br/>
2. Create features using two different NLP methods: For example, BoW vs tf-idf.<br/>
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes.<br/>
4. Assess your models using cross-validation and determine whether one model performed better.<br/>
5. Pick one of the models and try to increase accuracy by at least 5 percentage points.

In [1]:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import nltk
from nltk.corpus import gutenberg
import re
from sklearn.model_selection import train_test_split
import spacy
from collections import Counter

In [2]:
print(gutenberg.fileids())

['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']


In [3]:
alice = gutenberg.raw('carroll-alice.txt')
persuasion = gutenberg.raw('austen-persuasion.txt')
alice[0:1000]

"[Alice's Adventures in Wonderland by Lewis Carroll 1865]\n\nCHAPTER I. Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversation?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her.\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seeme

In [4]:
def data_cleaning(txt):
    pattern = "[\[].*?[\]]"
    text = re.sub(pattern,'',txt)
    text = re.sub(r'CHAPTER *.', '', text)
    text = re.sub(r'Chapter \d+', '', text)
    text = ' '.join(text.split())
    return text

In [5]:
alice_cleaned = data_cleaning(alice[:int(len(alice)/10)])
persuasion_cleaned = data_cleaning(persuasion[:int(len(persuasion)/10)])
print(alice_cleaned[:500])

. Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be wort


In [6]:
#data parsing
nlp = spacy.load('en_core_web_sm')
alice_doc = nlp(alice_cleaned)
persuasion_doc = nlp(persuasion_cleaned)

In [7]:
#sentences
alice_sents = list(alice_doc.sents)
persuasion_sents = list(persuasion_doc.sents)
print(alice_sents[1],'\n',persuasion_sents[1])

Down the Rabbit-Hole Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?' 
 This was the page at which the favourite volume always opened: "ELLIOT OF KELLYNCH HALL.


In [8]:
alice_doc[1:100].lemma_

"down the Rabbit - Hole Alice be begin to get very tired of sit by -PRON- sister on the bank , and of have nothing to do : once or twice -PRON- have peep into the book -PRON- sister be read , but -PRON- have no picture or conversation in -PRON- , ' and what be the use of a book , ' think Alice ' without picture or conversation ? ' so -PRON- be consider in -PRON- own mind ( as well as -PRON- could , for the hot day make -PRON- feel very sleepy and stupid )"

In [9]:
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

In [10]:
print(bag_of_words(alice_doc))

['Alice', 'think', 'little', 'go', 'way', 'like', 'time', 'find', 'fall', 'come', 'say', 'wonder', 'look', 'door', 'Rabbit', 'begin', 'get', 'eat', 'oh', 'foot', 'large', 'try', 'thing', 'dear', 'shall', 'good', 'hall', 'key', 'feel', 'moment', 'happen', 'bat', 'garden', 'poor', 'use', 'suddenly', 'eye', 'hurry', 'see', 'great', 'right', 'head', 'cat', 'remember', 'table', 'cry', 'book', 'hot', 'take', 'manage', 'word', 'people', 'walk', 'ask', 'know', 'soon', 'Dinah', 'hand', 'long', 'round', 'golden', 'small', 'open', 'telescope', 'bottle', 'mark', 'fan', 'sit', 'have', 'picture', 'make', 'White', 'hear', 'rabbit', 'burn', 'hole', 'deep', 'dark', 'let', 'sort', 'nice', 'talk', 'leave', 'passage', 'low', 'lock', 'glass', 'inch', 'high', 'shut', 'wait', 'poison', 'forget', 'candle', 'reach', 'cake', 'grow', 'glove', 'sister', 'read', 'conversation', 'consider', 'mind', 'day', 'sleepy', 'stupid', 'daisy', 'trouble', 'run', 'close', 'late', 'ought', 'watch', 'start', 'stop', 'notice', 'c

In [11]:
def bow_features(sentences, common_words):
    
    # Scaffold the data frame and initialize counts to zero.
    df_new = pd.DataFrame(columns=common_words)
    df_new['text_sentence'] = sentences[0]
    df_new['text_source'] = sentences[1]
    df_new.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df_new['text_sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df_new.loc[i, word] += 1
    return df_new

In [12]:
alice_sents_for_df = [[sent,'alice'] for sent in alice_sents]
persuasion_sents_for_df = [[sent,'persuasion'] for sent in persuasion_sents]
sentences = pd.DataFrame(alice_sents_for_df + persuasion_sents_for_df)

# Set up the bags.
alicewords = bag_of_words(alice_doc)
persuasionwords = bag_of_words(persuasion_doc)

# Combine bags to create a set of unique words.
common_words = set(alicewords + persuasionwords)

word_counts = bow_features(sentences, common_words)
word_counts.head()

Unnamed: 0,colour,earth,1806,Cheshire,have,let,sea,respect,hey,civility,...,insert,curtseying,injury,gift,present,coldness,think,force,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,(.),alice
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",alice
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",alice
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"(There, was, nothing, so, VERY, remarkable, in...",alice
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",alice


In [13]:
word_counts.query('Anne == 0')

Unnamed: 0,colour,earth,1806,Cheshire,have,let,sea,respect,hey,civility,...,insert,curtseying,injury,gift,present,coldness,think,force,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,(.),alice
1,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"(Down, the, Rabbit, -, Hole, Alice, was, begin...",alice
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(So, she, was, considering, in, her, own, mind...",alice
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"(There, was, nothing, so, VERY, remarkable, in...",alice
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",alice
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(Oh, dear, !)",alice
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, shall, be, late, !, ')",alice
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,"((, when, she, thought, it, over, afterwards, ...",alice
8,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(A, WATCH, OUT, OF, ITS, WAISTCOAT, -, POCKET,...",alice
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(,, Alice, started, to, her, feet, ,, for, it,...",alice


In [14]:
#People in Alice novel
person_alice = [entity.text for entity in list(alice_doc.ents) if entity.label_=='PERSON']
person_persuasion = [entity.text for entity in list(persuasion_doc.ents) if entity.label_=='PERSON']

#Adding new features to increase accuracy
common_names = set(person_persuasion + person_alice)

for i in common_names:
    word_counts[i] = 0
    
for i, sentence in enumerate(word_counts['text_sentence']):
    words = []
    words = [token.text for token in sentence if token.text in common_names]
    for word in words:
        word_counts.loc[i, word] += 1

In [15]:
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence','text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X, 
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=0)

In [16]:
#Logistic regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2') # No need to specify l2 as it's the default. But we put it for demonstration.
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
y_pred = lr.predict(X_test)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

from sklearn.metrics import confusion_matrix
print("Confusion matrix for logistic regression \n",format(confusion_matrix(y_test, y_pred)))



(272, 1661) (272,)
Training set score: 0.9742647058823529

Test set score: 0.8956043956043956
Confusion matrix for logistic regression 
 [[ 36  17]
 [  2 127]]
