## Question 2: Applied ML

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).

2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.



In [184]:
# Import dependencies for part 2
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from email.parser import Parser as EmailParser
import sklearn
import pandas
import numpy

In [108]:
# Download the training dataset to `~/scikit_learn_data/20news_home` then load it to a variable
newsgroups_train = fetch_20newsgroups(subset='train')

# Dataset analysis

[Official description](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)

In [109]:
# Check the available keys
newsgroups_train.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

The loaded data contains the following properties:

- `data`: List of 11314 strings representing the messages.
- `filenames`: Absolute path to the downloaded file containing the message (11314 items).
- `target_names`: List of the names of the 20 newsgroups:
  ```
  ['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']
  ```
- `target`: List of ids for the targets: from 0 to 19 referencing the targets defined in `target_names`.
- `DESCR`: No value (`None`)
- `description`: String describing the dataset: `'the 20 newsgroups by date dataset'`

The messsages are formatted as emails: they have a header followed by a blank line and then the body with the actual text content. 
For example, you can view the message with the id `0` below.

Here are some observations:
- The most common header seem to be `From`, `Subject`, `Organization` and `Lines` 
- Messages `754`, `8000` quote other messages
- Message `1704` seems to have an attachment (or manually pasted source code)

In [110]:
# Set the message id to view specific messages
msg_id = 0
print(f"Category: {newsgroups_train.target_names[newsgroups_train.target[msg_id]]}\n")
print(newsgroups_train.data[msg_id])

Category: rec.autos

From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Nntp-Posting-Host: rac3.wam.umd.edu
Organization: University of Maryland, College Park
Lines: 15

 I was wondering if anyone out there could enlighten me on this car I saw
the other day. It was a 2-door sports car, looked to be from the late 60s/
early 70s. It was called a Bricklin. The doors were really small. In addition,
the front bumper was separate from the rest of the body. This is 
all I know. If anyone can tellme a model name, engine specs, years
of production, where this car is made, history, or whatever info you
have on this funky looking car, please e-mail.

Thanks,
- IL
   ---- brought to you by your neighborhood Lerxst ----







# Dataframe creation

In [111]:
def read_header_or_none(headers, name):
    """
    Tries to read the header `name` from `headers`, else returns `None`.
    """
    if name in headers:
        return headers[name]
    return None

In [112]:
def create_entry(label, label_id, subject, body, from_, organization, raw):
    return {
        "label": label,
        "label_id": label_id,
        "subject": subject,
        "body": body,
        "from": from_,
        "organization": organization,
        "raw": raw
    }

In [113]:
def create_df(emails, labels, label_ids):
    email_parser = EmailParser()
    entries = []
    for email, label_id in zip(emails, label_ids):
        label = labels[label_id]
        parsed_email = email_parser.parsestr(email, headersonly=False)
        body = parsed_email.get_payload()
        headers = dict(parsed_email.items())
        subject = read_header_or_none(headers, "Subject")
        from_ = read_header_or_none(headers, "From")
        organization = read_header_or_none(headers, "Organization")
        entry = create_entry(label, label_id, subject, body, from_, organization, email)
        entries.append(entry)
    return pandas.DataFrame(entries)
    
    
    

In [114]:
news_df = create_df(newsgroups_train.data, newsgroups_train.target_names, newsgroups_train.target)

In [115]:
news_df.head(10)

Unnamed: 0,body,from,label,label_id,organization,raw,subject
0,I was wondering if anyone out there could enl...,lerxst@wam.umd.edu (where's my thing),rec.autos,7,"University of Maryland, College Park",From: lerxst@wam.umd.edu (where's my thing)\nS...,WHAT car is this!?
1,A fair number of brave souls who upgraded thei...,guykuo@carson.u.washington.edu (Guy Kuo),comp.sys.mac.hardware,4,University of Washington,From: guykuo@carson.u.washington.edu (Guy Kuo)...,SI Clock Poll - Final Call
2,"well folks, my mac plus finally gave up the gh...",twillis@ec.ecn.purdue.edu (Thomas E Willis),comp.sys.mac.hardware,4,Purdue University Engineering Computer Network,From: twillis@ec.ecn.purdue.edu (Thomas E Will...,PB questions...
3,Robert J.C. Kyanko (rob@rjck.UUCP) wrote:\n> a...,jgreen@amber (Joe Green),comp.graphics,1,Harris Computer Systems Division,From: jgreen@amber (Joe Green)\nSubject: Re: W...,Re: Weitek P9000 ?
4,"From article <C5owCB.n3p@world.std.com>, by to...",jcm@head-cfa.harvard.edu (Jonathan McDowell),sci.space,14,"Smithsonian Astrophysical Observatory, Cambrid...",From: jcm@head-cfa.harvard.edu (Jonathan McDow...,Re: Shuttle Launch Question
5,In article <1r1eu1$4t@transfer.stratus.com> cd...,dfo@vttoulu.tko.vtt.fi (Foxvog Douglas),talk.politics.guns,16,VTT,From: dfo@vttoulu.tko.vtt.fi (Foxvog Douglas)\...,Re: Rewording the Second Amendment (ideas)
6,There were a few people who responded to my re...,bmdelane@quads.uchicago.edu (brian manning del...,sci.med,13,University of Chicago,From: bmdelane@quads.uchicago.edu (brian manni...,Brain Tumor Treatment (thanks)
7,DXB132@psuvm.psu.edu writes:\n>In article <1ql...,bgrubb@dante.nmsu.edu (GRUBB),comp.sys.ibm.pc.hardware,3,"New Mexico State University, Las Cruces, NM",From: bgrubb@dante.nmsu.edu (GRUBB)\nSubject: ...,Re: IDE vs SCSI
8,I have win 3.0 and downloaded several icons an...,holmes7000@iscsvax.uni.edu,comp.os.ms-windows.misc,2,University of Northern Iowa,From: holmes7000@iscsvax.uni.edu\nSubject: WIn...,WIn 3.0 ICON HELP PLEASE!
9,jap10@po.CWRU.Edu (Joseph A. Pellettiere) writ...,kerr@ux1.cso.uiuc.edu (Stan Kerr),comp.sys.mac.hardware,4,University of Illinois at Urbana,From: kerr@ux1.cso.uiuc.edu (Stan Kerr)\nSubje...,Re: Sigma Designs Double up??


# Features computation

In [158]:
# Reload the dataset without metadata
newsgroups_train = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
# Seems to completely remove some messages (754?)

In [159]:
newsgroups_train.data[0]

'I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.'

In [160]:
vectorizer = TfidfVectorizer()
vectors = vectorizer.fit_transform(newsgroups_train.data)

In [161]:
vectors.shape

(11314, 101631)

In [162]:
def create_permutation(n: int, rand_state = None):
    """
    Create a ranom permutation of size `n`.
    
    :param n: Size of the permutation
    :param rand_state: Random state to use to create the permutation.
    :return: A 1D numpy ndarray of size. It contains a permutation of integers from 0 to n-1
    """
    if rand_state is None:
        rand_state = numpy.random.RandomState()
    return rand_state.permutation(n)

In [163]:
def split_data(data, test_ratio, validate_ratio, rand_state = None):
    """
    Pseudo-randomly splits a numpy 2D array into a `test`, `validate` and `train` dataset.
    
    :param data: Matrix of to split
    :param test_ratio: Ratio of the data to use for the `test` dataset
    :param validate_ratio: Ratio of the data to use for the `validate` dataset
    :param rand_state: Random state used to create the permutation
    :return: A tuple (test, validate, train) where each item is a tuple (old_indexes, dataset)
    """
    from math import floor
    if rand_state is None:
        rand_state = numpy.random.RandomState()
    row_count = data.shape[0]
    test_row_count = floor(test_ratio * row_count)
    validate_row_count = floor(validate_ratio * row_count)
    permutation = create_permutation(row_count, rand_state)

    test_indexes = permutation[:test_row_count]
    validate_indexes = permutation[test_row_count:test_row_count + validate_row_count]
    train_indexes = permutation[test_row_count + validate_row_count:]
    
    test = (test_indexes, data[test_indexes])
    validate = (validate_indexes, data[validate_indexes])
    train = (train_indexes, data[train_indexes])

    return (test, validate, train)

In [169]:
(test, validate, train) = split_data(vectors, 0.1, 0.1)
test_indexes, test_vectors = test
validate_indexes, validate_vectors = validate
train_indexes, train_vectors = test
test_labels = numpy.array(newsgroups_train.target)[test_indexes]
validate_labels = numpy.array(newsgroups_train.target)[validate_indexes]
train_labels = numpy.array(newsgroups_train.target)[train_indexes]

In [170]:
train_labels

array([ 6, 10, 14, ..., 12,  4,  0])

In [198]:
classifier = RandomForestClassifier(n_estimators=100)

In [199]:
trained_classifier = classifier.fit(train_vectors, train_labels)

In [200]:
predicted_test_labels = trained_classifier.predict(test_vectors)

In [201]:
predicted_test_labels

array([ 6, 10,  7, ..., 12,  4,  0])

In [202]:
test_labels

array([ 6, 10, 14, ..., 12,  4,  0])

In [203]:
sklearn.metrics.f1_score(test_labels, predicted_test_labels, average='macro')

0.9798549086301479

In [214]:
def evaluate_parameters(n_estimators, max_depth):
    classifier = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth)
    trained_classifier = classifier.fit(train_vectors, train_labels)
    predicted_test_labels = trained_classifier.predict(test_vectors)
    score = sklearn.metrics.f1_score(test_labels, predicted_test_labels, average='macro')
    return score, trained_classifier

In [215]:
def grid_search(n_estimators_range, max_depth_range):
    result = None
    for n_estimators in n_estimators_range:
        for max_depth in max_depth_range:
            score, classifier = evaluate_parameters(n_estimators, max_depth)
            if result is None or score > result["score"]:
                result = {"score": score, "params": (n_estimators, max_depth), "classifier": classifier}
    return result


In [219]:
best = grid_search(range(10, 30), range(10, 50))

In [220]:
best

{'classifier': RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=48, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=25, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
 'params': (25, 48),
 'score': 0.96118732348172586}

In [221]:
predicted_validate_labels =  best["classifier"].predict(validate_vectors)

In [222]:
sklearn.metrics.f1_score(validate_labels, predicted_validate_labels, average='macro')

0.36242265934454976

:(