# Classifying News Articles with K-Nearest Neighbor Model

The 20 Newsgroups dataset collects 20,000 newsgroup documents, partitioned into 20 groups. It has become a popular dataset to demo text classification and clustering. This exercise might seems difficult at first, especially for those you have not studied text processing, but you will eventually get it!

### Load Data

First let's load 20newsgroup data which contain newsletter articles + news categorical labels. We picked 5 news group to to this exercise.

In [1]:
from sklearn.datasets import fetch_20newsgroups

categories = ['comp.graphics', 'rec.motorcycles', 'sci.space', 'talk.politics.mideast', 'talk.religion.misc']

print("Loading 20 newsgroups dataset for categories:")
print(categories)

dataset = fetch_20newsgroups(subset='all', categories=categories,
                             shuffle=True, random_state=42)

Loading 20 newsgroups dataset for categories:
['comp.graphics', 'rec.motorcycles', 'sci.space', 'talk.politics.mideast', 'talk.religion.misc']


### Understanding the data

Your dataset is organized in a dictionary format, with the following attributes.

In [2]:
dir(dataset)

['DESCR', 'data', 'description', 'filenames', 'target', 'target_names']

The targets are numerical code of news categories. Their labels are also provided.

In [3]:
dataset['target']

array([1, 3, 1, ..., 4, 3, 2])

In [4]:
dataset['target_names']

['comp.graphics',
 'rec.motorcycles',
 'sci.space',
 'talk.politics.mideast',
 'talk.religion.misc']

Let's look at the number of news articles included.

In [5]:
len(dataset['data'])

4524

This is an example of a news article.

In [6]:
print(dataset['data'][0])

From: James Leo Belliveau <jbc9+@andrew.cmu.edu>
Subject: First Bike??
Organization: Freshman, Mechanical Engineering, Carnegie Mellon, Pittsburgh, PA
Lines: 17
NNTP-Posting-Host: po2.andrew.cmu.edu

 Anyone, 

    I am a serious motorcycle enthusiast without a motorcycle, and to
put it bluntly, it sucks.  I really would like some advice on what would
be a good starter bike for me.  I do know one thing however, I need to
make my first bike a good one, because buying a second any time soon is
out of the question.  I am specifically interested in racing bikes, (CBR
600 F2, GSX-R 750).  I know that this may sound kind of crazy
considering that I've never had a bike before, but I am responsible, a
fast learner, and in love.  Please give me any advice that you think
would help me in my search, including places to look or even specific
bikes that you want to sell me.

    Thanks  :-)

    Jamie Belliveau (jbc9@andrew.cmu.edu)  




### Text feature preprocessing

#### A little bit of background on how to go from text to feature


Our first task is to change from text to numerical feature vectors. You do not have to be too concerned about this, we have done this step for you. For those of you who wants to understand the background of this transformation, you can read about it below. Feel free to shoot us any questions!

#### Term frequency

We want to transform text to vector first. The easiest thing is to count the number of words in the document.

Suppose this is our document d:

"The dog is chasing another dog."

We would like to transform it to a vector:
\begin{bmatrix}
\frac{1}{6} & \frac{1}{3} & \frac{1}{6} & \frac{1}{6} & \frac{1}{6}
\end{bmatrix}

which corresponds to term frequencies of:
\begin{bmatrix}
the & dog & is & chasing & another
\end{bmatrix}

The term frequencies vector is the "feature vector" of a document in our machine learning problem.

The frequency of term $t$ in document $d$ is defined as the following:

$$tf(t,d) = \frac{\text{The number of term t in document d}}{\text{the number of all words in document d}}$$

Often, we use logarithmic scaled $tf$ with the following definition:

$$tf_{scaled}(t,d) = 1+log(tf(t,d))$$

#### Inverse Document Frequency

Now you can imagine that the frequencies of common words like "the" or "and" are going to be really high, even though the frequency of these words do not tell us anything about the gist of the document. So people often use another word frequency measure called "inverse document frequency" or "idf". This is the ratio of the number of documents we have and the number of documents with the term t in it. 

$$idf(t) = \frac{\text{the number of documents}}{\text{the number of documents that contain term t}}$$

Usually, $idf$ is also transformed by logarithmic scaling:

$$idf_{scaled}(t) = log(1+idf(t))$$	

If a lot of documents contain term $t$, idf will be low, meaning that term is not important. If only a few documents contain term $t$, idf will be high.

#### tfidf

In practice, we often use $tf$ and $idf$ together. So the each feature in the vector is a product of the two measures.

$$tfidf(t,d) = tf(t,d) \cdot idf(t)$$	


If a terms has high $tfidf$, it means that small number of documents have this term and the term has high frequency when it appears. Together this filters out common terms from the analysis.

Scikit-learn provides a module for calculating this, this is called TfidfVectorizer. We are going to create a TfidfVectorizer object and use function `fit` and `fit_transform` to generate the right input vector for our classifiers. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, max_features=2500, stop_words='english', use_idf=True)
x = vectorizer.fit_transform(dataset.data)
y = dataset['target']

### KNN Classifier

#### 1. Split data into train and test sets

#### 2. Initiate KNN classifier and fit the model to the data

#### 3. Compute the accuracy of the classification

If you want to try something new, use `sklearn.metrics.classification_report` to test your classification

#### 4. Try changing n_neighbors parameter or other parameters to see what's the best setting.

#### 5. Try comparing KNN model with, say MLPClassifier and LogisticRegression. Which one performs best?