# Exercise 1

This exercise takes inspiration from the tutorial found here: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html, but here served in a notebook format.

Try to follow this notebook by itself but if you should get in trouble you can find help by refering to the original tutorial

We will go through how to do the following with scikit-learn:
1. Download the data
1. Load the data
1. Extract Features
1. Build a classifier
1. Put it all together into a scikit-learn pipeline
1. Evaluation
1. Optimization of parameters

## Initialization

In [1]:
%matplotlib inline
from sklearn import datasets

## Exercise 1.0 Download the data

As in the tutorial above we will also be using the 20 newsgroups dataset with the official description (http://people.csail.mit.edu/jrennie/20Newsgroups/):

> The 20 Newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of our knowledge, it was originally collected by Ken Lang, probably for his paper “Newsweeder: Learning to filter netnews,” though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

Run the script `fetch_data.py` from the root folder or first load code into the workspace by running the IPython magic-command `%load fetch_data.py` in the cell below

Take a look at the structure of the files in your local filesystem. This is important for the next step. See here why: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_files.html#sklearn.datasets.load_files

## Exercise 1.1 Loading the data

We will load the data using the native function of scikit-learn suitable for loading text data which you have (hopefully) just read about

First you need to set `file_path` variable below to point at your data directory with the 20 news data. Use this variable and the `categories` variable to only choose a subset of the document topics

In [2]:
file_path_train='/path/to/your_data_directory/20news-bydate-train'

# Only select a subset of the categories
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med'] 

In [3]:
twenty_train= ... # instert code here to load the data

Let's have a look at the data. Run the following cells and inspect their output

In [4]:
print(type(twenty_train))
print(twenty_train.keys())

<class 'sklearn.utils.Bunch'>
dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])


In [5]:
for key, value in twenty_train.items():
    print('{} is of type {}'.format(key,type(value)))

data is of type <class 'list'>
filenames is of type <class 'numpy.ndarray'>
target_names is of type <class 'list'>
target is of type <class 'numpy.ndarray'>
DESCR is of type <class 'NoneType'>


In [6]:
twenty_train.data[0]

b'From: clipper@mccarthy.csd.uwo.ca (Khun Yee Fung)\nSubject: Re: looking for circle algorithm faster than Bresenhams\nOrganization: Department of Computer Science, The University of Western\n\tOntario, London, Ontario, Canada\nIn-Reply-To: graeme@labtam.labtam.oz.au\'s message of Wed, 14 Apr 1993 04:49:46 GMT\n\t<1993Apr13.025240.8884@nwnexus.WA.COM>\n\t<1993Apr14.044946.12144@labtam.labtam.oz.au>\nNntp-Posting-Host: mccarthy.csd.uwo.ca\nLines: 41\n\n>>>>> On Wed, 14 Apr 1993 04:49:46 GMT, graeme@labtam.labtam.oz.au (Graeme Gill) said:\n\nGraeme> \tYes, that\'s known as "Bresenhams Run Length Slice Algorithm for\nGraeme> Incremental lines". See Fundamental Algorithms for Computer Graphics,\nGraeme> Springer-Verlag, Berlin Heidelberg 1985.\n\n> I have tried to extrapolate this to circles but I can\'t figure out\n> how to determine the length of the slices. Any ideas?\n\nGraeme> \tHmm. I don\'t think I can help you with this, but you might\nGraeme> take a look at the following:\n\nGraem

In [7]:
twenty_train.target_names

['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']

In [8]:
print(twenty_train.target.shape)
print(twenty_train.target[0:5,])

(2257,)
[1 0 2 2 0]


#### Small extra exercise for working with data in python:
- count the number of samples in total and per category

## Exercise 1.2 Extracting features from text

Now it is time to make use of this data in a scikit-learn friendly way.

We need to transform the text into the format required by scikit-learn and machine learning in general. Remember we want $n_{samples} \times n_{features}$ in a numpy array $X$ and another array with the targets $y$ for each sample.

A well known representation is bag of words (BoW). To transform the text strings into a BoW representation find a suitable method here http://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_extraction.text 

Run the below cell if you get errors due to encoding/decoding

In [10]:
#Function for fixing potential issue with encoding
def fix_unicode(data):
     return [text.decode("utf-8", "ignore") for text in data]

# Uncomment line below if you get decoding errors
twenty_train.data=fix_unicode(twenty_train.data)

In [19]:
# Instert your code here

vect_method = ...


X_train_counts = vect_method.fit_transform(twenty_train.data)
X_train_counts.shape # check dimensions

(2257, 35787)

## Term Frequencies

BoW have some issues as explained in the original tutorial:

> Occurrence count is a good start but there is an issue: longer documents will have higher average count values than shorter documents, even though they might talk about the same topics.
> To avoid these potential discrepancies it suffices to divide the number of occurrences of each word in a document by the total number of words in the document: these new features are called tf for Term Frequencies.
> Another refinement on top of tf is to downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus.
> This downscaling is called tf–idf for “Term Frequency times Inverse Document Frequency”.

Therefore we will quickly transform our counts into frequencies with the code below

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(2257, 35787)

## Exercise 1.3 Classifier

Use the flowchart to determine a suitable classifier for the 20 news data (it contains link directly to the API of the different models)

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

When you have determined which classifier to use, train the model and test it on the the following sentences by using the function `clf.predict(X_test)`. You will need to transform them similar to the training data

In [14]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']

In [15]:
# clf = ... #your trained classifier



You should get

```python
'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics
```

## Exercise 1.4 Putting It All in a Pipeline

Scikit-learn makes it easy to work with many data processing steps from the raw data to a classifier. See http://scikit-learn.org/stable/modules/pipeline.html#pipeline

Combine the steps from exercise 1.21.3 into a pipeline. Try to test it on the same two sentences as above to verify you get similar results

In [None]:
from sklearn.pipeline import Pipeline

text_clf = Pipeline(...)

In [28]:
text_clf.fit(twenty_train.data, twenty_train.target) 

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...inear_tf=False, use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

## Exercise 1.5 Evaluation

Now we will test our trained classfier on the test data. Start by loading the test data and use your pipeline from exercise 1.4 to predict its targets. Afterwards compute the accuracy using the *metric* module of scikit-learn.

If you did not choose the classifier `SGDClassifier`, try to compare your solution with a pipeline using that instead of your classifier.

Use the `metrics.classification_report` function to a detailed report of results, or `metrics.confusion_matrix` for obvious reasons

**Extra exercise:** implement a function which takes the test data and a metric (e.g. accuracy) and a pipeline as input returning the result of the metric. Bonus if you make it take a list of metrics for comparing them

In [30]:
file_path_test='/path/to/your_data_directory/20news-bydate-test'

In [31]:
# Load test
twenty_test= ...

# Uncomment below to avoid decoding errors
# twenty_test.data = fix_unicode(twenty_test.data)

## Exercise 1.6 Parameter tuning

Read the user guide on parameter tuning using grid search here http://scikit-learn.org/stable/modules/grid_search.html#grid-search

Use it to tune the parameters of your pipeline. A suggestion would the `ngram_range` for the vectorization, `use_idf` for the term frequency transformation, and `alpha` for the classifier.

Setting `n_jobs=-1` when calling the grid search function detects the number of cpus in your system and uses them for parallelization.

In [26]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
               'tfidf__use_idf': (True, False),
               'clf__alpha': (1e-2, 1e-3),}

In [27]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

## Extras

Try the following the things:
- load your own data/other data (or some available from  e.g. https://archive.ics.uci.edu/ml/index.php)
- if you choose to work with non-text data (image, csv etc.), use suitable methods as described in http://scikit-learn.org/stable/datasets/index.html#loading-from-external-datasets
- for more advanced integration between pandas and sklearn, see https://github.com/scikit-learn-contrib/sklearn-pandas
- build some pipelines using different vectorized features and models for comparison