## Exercises
The goal of these exercises is to challenge you a little bit and make you think. The questions are designed to allow you to have a play around and try a few things before getting to the answer. The solutions are obtainable by writing very similar code to what has been explored in the previous notebooks, but may require an extra step of logic.

The solutions will be released after the lecture on Thursday.

Let's start with importing the required libraries.

In [1]:
import json
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

### Exercise 1

**1. a)** Print the `text_data` variable defined below, but with the word `is` removed from all text blocks.

In [2]:
text_data = [
    "This is the best app ever",
    "Why is this app so bad?",
    "Who is the user of this app meant to be?",
    "The develop of this app is a god",
]

# Write your answer here
print([x.replace(" is","") for x in text_data])

['This the best app ever', 'Why this app so bad?', 'Who the user of this app meant to be?', 'The develop of this app a god']


**1. b)** Use a `CountVectorizer` to remove the stop-words from `text_data` and transform `text_data` into document-word count matrix. Load the document-word count matrix into a DataFrame and print the head.

In [3]:
# Helper function
def create_doc_word_df(sparse_mat, feature_names):
    return(pd.DataFrame.sparse.from_spmatrix(sparse_mat, 
                        columns=feature_names))
    
# Write your answer here
count_vect = CountVectorizer(stop_words='english')
count_vect.fit(text_data)
create_doc_word_df(count_vect.transform(text_data), count_vect.get_feature_names()).head()

Unnamed: 0,app,bad,best,develop,god,meant,user
0,1,0,1,0,0,0,0
1,1,1,0,0,0,0,0
2,1,0,0,0,0,1,1
3,1,0,0,1,1,0,0


**1. c)** Create a text processing pipeline using `CountVectorizer` (with English stop-words removed) and `TfidfTransformer`, then fit the pipeline to `text_data` and print head of results.

In [4]:
# Write your answer here
text_pipe = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
])
text_pipe.fit(text_data);
create_doc_word_df(text_pipe.transform(text_data),text_pipe.named_steps['vect'].get_feature_names()).head()

Unnamed: 0,app,bad,best,develop,god,meant,user
0,0.462637,0.0,0.886548,0.0,0.0,0.0,0.0
1,0.462637,0.886548,0.0,0.0,0.0,0.0,0.0
2,0.346182,0.0,0.0,0.0,0.0,0.663385,0.663385
3,0.346182,0.0,0.0,0.663385,0.663385,0.0,0.0


### Exercise 2

**2. a)** Load the file `Data/newsgroups_train.json` into a Pandas DataFrame named `df_newsgroup_train` and print the first $10$ rows.

In [5]:
# Write your answer here
with open('Data/newsgroups_train.json','r') as in_file:
    df_newsgroup_train = pd.DataFrame(json.load(in_file))
    
df_newsgroup_train.head(10)

Unnamed: 0,category,document
0,rec.autos,I was wondering if anyone out there could enli...
1,comp.sys.mac.hardware,A fair number of brave souls who upgraded thei...
2,comp.sys.mac.hardware,"well folks, my mac plus finally gave up the gh..."
3,comp.graphics,\nDo you have Weitek's address/phone number? ...
4,sci.space,"From article <C5owCB.n3p@world.std.com>, by to..."
5,talk.politics.guns,\n\n\n\n\nOf course. The term must be rigidly...
6,sci.med,There were a few people who responded to my re...
7,comp.sys.ibm.pc.hardware,...
8,comp.os.ms-windows.misc,I have win 3.0 and downloaded several icons an...
9,comp.sys.mac.hardware,"\n\n\nI've had the board for over a year, and ..."


**2. b)** Using `CountVectorizer` (with English stopwords filtered out), `TfidfTransformer`, and `MultinomialNB` create a text classification pipeline named `text_pipe`, and then fit `text_pipe` to the `df_newsgroup_train` dataset.

In [6]:
# Write your answer here
text_pipe = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_pipe.fit(df_newsgroup_train.document, df_newsgroup_train.category);

**2. c, i)** Print the classes that can be output by `text_pipe`. 

*Hint*: Use the `named_steps` attribute to extract `classes_` from the `clf` step of `text_pipe`.

In [7]:
# Write your answer here
print(text_pipe.named_steps['clf'].classes_)

['alt.atheism' 'comp.graphics' 'comp.os.ms-windows.misc'
 'comp.sys.ibm.pc.hardware' 'comp.sys.mac.hardware' 'comp.windows.x'
 'misc.forsale' 'rec.autos' 'rec.motorcycles' 'rec.sport.baseball'
 'rec.sport.hockey' 'sci.crypt' 'sci.electronics' 'sci.med' 'sci.space'
 'soc.religion.christian' 'talk.politics.guns' 'talk.politics.mideast'
 'talk.politics.misc' 'talk.religion.misc']


**2. c, ii)** Create some text input examples, pass them through the text classification pipeline you just created, and print the outputs. Try to create four example inputs such that the classification pipeline outputs the categories `comp.graphics`, `rec.autos`, `rec.sport.baseball`, and `sci.space`, respectively.

In [8]:
# Write your answer here
x_test = [
    "Render on screen",
    "What a car!",
    "Hit the ball out the park",
    "To the moon!",
]
print(text_pipe.predict(x_test))

['comp.graphics' 'rec.autos' 'rec.sport.baseball' 'sci.space']


**2. c, iii)** Try to create an example text input you think should result in a particular category, but the text classification pipeline outputs something unexpected.

In [9]:
# Write your answer here
x_test = [
    "This startup is a moon shot",
]
print(text_pipe.predict(x_test))

['sci.space']


**2. d)** For each category, print out the names of the $10$ features with the largest classification weights.

*Hint*: Use the variables `feature_names`, `classifier_weights`, and `classifier_classes` defined below.

In [10]:
# Variables
feature_names = text_pipe.named_steps['vect'].get_feature_names()
classifier_weights = text_pipe.named_steps['clf'].coef_
classifier_classes = text_pipe.named_steps['clf'].classes_

# Write your answer here
for i, category in enumerate(classifier_classes):
    top10 = np.argsort(-classifier_weights[i])[:10]
    print(category)
    print([feature_names[j] for j in top10])

alt.atheism
['god', 'people', 'don', 'think', 'atheism', 'religion', 'just', 'say', 'atheists', 'islam']
comp.graphics
['graphics', 'image', 'thanks', 'files', 'file', 'program', 'know', '3d', 'format', 'looking']
comp.os.ms-windows.misc
['windows', 'file', 'dos', 'files', 'use', 'drivers', 'driver', 'thanks', 'problem', 'card']
comp.sys.ibm.pc.hardware
['drive', 'scsi', 'card', 'bus', 'controller', 'ide', 'pc', 'thanks', 'disk', 'monitor']
comp.sys.mac.hardware
['mac', 'apple', 'drive', 'problem', 'thanks', 'simms', 'quadra', 'does', 'monitor', 'know']
comp.windows.x
['window', 'motif', 'server', 'widget', 'thanks', 'application', 'use', 'x11r5', 'windows', 'using']
misc.forsale
['sale', '00', 'offer', 'shipping', 'new', 'condition', 'price', 'sell', 'email', 'asking']
rec.autos
['car', 'cars', 'like', 'engine', 'just', 'dealer', 'good', 'new', 'ford', 'don']
rec.motorcycles
['bike', 'dod', 'bikes', 'ride', 'motorcycle', 'like', 'riding', 'helmet', 'just', 'don']
rec.sport.baseball
['

**2. e)** Re-attempt **2. c, iii)** using some of the words output in **2. d)**.

In [11]:
# Write your answer here

### Exercise 3

**3. a)** Load the file `Data/newsgroups_test.json` into a Pandas DataFrame named `df_newsgroup_test` and use it to evaluate the performance of `text_pipe` on `df_newsgroup_test`.

*Hint*: Use the `score` method of `text_pipe`.

In [12]:
# Write your answer here
with open('Data/newsgroups_test.json','r') as in_file:
    df_newsgroup_test = pd.DataFrame(json.load(in_file))
    
print(text_pipe.score(df_newsgroup_test.document, df_newsgroup_test.category))

0.6779075942644716


**3. b)** Create a text classification pipeline named `text_pipe_wstops` that is the same as `text_pipe` (from **2. b)**), but that does not filter out stop words. Then fit `text_pipe_wstops` to `df_newsgroup_train` and evaluate `text_pipe_wstops` on `df_newsgroup_test`.

In [13]:
# Write your answer here
text_pipe_wstops = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_pipe_wstops.fit(df_newsgroup_train.document, df_newsgroup_train.category);
print(text_pipe_wstops.score(df_newsgroup_test.document, df_newsgroup_test.category))

0.6062134891131173


Did the text classification pipeline perform better with or without stopwords filtered out?

**3. c)** Setup a grid search over the values `[None, 'english']` for the `stop_words` attribute of the `vect` stage in `text_pipe_wstops`, fit the grid search to `df_newsgroup_train` and evaluate on `df_newsgroup_test`.

In [14]:
# Write your answer here
grid_params = {
    'vect__stop_words': [None, 'english'],
}
search = GridSearchCV(text_pipe_wstops, grid_params)
search.fit(df_newsgroup_train.document, df_newsgroup_train.category);
print(search.score(df_newsgroup_test.document, df_newsgroup_test.category))

0.6779075942644716


Can you explain what is happening?