<a href="https://colab.research.google.com/github/EmilySanderson/EmilySanderson.github.io/blob/master/COMPSCI_4ML3_A3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Application of ML in analyzing text documents
In this asssignment we take advantage of scikit-learn in working with text documents. If you have missed the tutorial, you are encouraged to watch the associated tutorial. This will also be an excercise to figure out how to write a code with a new machine learning package; this is a necessary skill in applied machine learning, since the packages evolve quickly (and there are many of them) so being able to figure out how to work with a tool within a reasonble time frame is important. If you need further details you can check out to this <a href="https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html" > scikit example </a>, or other scikit documentations.

# Submission
- There are three tasks for you.
- Report the results and answer the questions in a pdf file, along with your other solutions.
- Additionally, submit your code in the same Jupiter notebook format. (keep the overal format of the notebook unchanged)

Make a copy of this colab so that you can modify it for yourself. If google colab is slow, you can also download the notebook and use Jupyter notebook on your computer (just like assignment 2). Using the online notebook has the benefit that all the required packages are already installed.


# Packages

First of all, let's import the packages we need for this assignment.


In [None]:
# loading need libraries
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import fetch_20newsgroups
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# Dataset characteristics

Here we take a look at the structure/properties of the dataset. To have a faster code, we just pick 4 class labels out of 20 from this dataset. We are going to classify the documents into these 4 categories. So each data point is a text document.


In [None]:
categories = ['alt.atheism', 'soc.religion.christian',
              'comp.graphics', 'sci.med']
twenty_train = fetch_20newsgroups(subset='train',
    categories=categories, shuffle=True, random_state=42)
twenty_test = fetch_20newsgroups(subset='test',
    categories=categories, shuffle=True, random_state=42)

print("Dataset properties on train section:")
print("\t Number of training data points: %d" % len(twenty_train.data))
print("\t Number of test data points: %d" % len(twenty_test.data))
print("\t Number of Classes: %d" % len(categories))
print("\t Class names: " ,(twenty_train.target_names))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


Dataset properties on train section:
	 Number of training data points: 2257
	 Number of test data points: 1502
	 Number of Classes: 4
	 Class names:  ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


# A sample of dataset
We can see the first instance/element of the training set like this,

In [None]:
print("\n".join(twenty_train.data[0].split("\n")))

From: sd345@city.ac.uk (Michael Collier)
Subject: Converting images to HP LaserJet III?
Nntp-Posting-Host: hampton
Organization: The City University
Lines: 14

Does anyone know of a good way (standard PC application/PD utility) to
convert tif/img/tga files into LaserJet III format.  We would also like to
do the same, converting to HPGL (HP plotter) files.

Please email any response.

Is this the correct group?

Thanks in advance.  Michael.
-- 
Michael Collier (Programmer)                 The Computer Unit,
Email: M.P.Collier@uk.ac.city                The City University,
Tel: 071 477-8000 x3769                      London,
Fax: 071 477-8565                            EC1V 0HB.



the category name of the instance can be found as follows:

In [None]:
print(twenty_train.target_names[twenty_train.target[0]])

comp.graphics


To get the categries of a range of data, e.g., first 10 samples, we can do something like this:

In [None]:
for t in twenty_train.target[:10]:
    print(twenty_train.target_names[t])

comp.graphics
comp.graphics
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
soc.religion.christian
sci.med
sci.med
sci.med


# Feature extraction
since our data is text, to run classification models on the dataset we will turn them into vectors with numerical features. Therefore, in this section, we extract features using the **Bag of Words** method. To this regard, 


*   Assign an integer ID to each word in the dataset (like a dictionary).
*   For each data point ( document i), count the number of occurances of word w and put it in X[i,j] where i is the i'th document and j is the index of the word w in the dictionary.
Thus, if we have e.g., 10000 data points and 100000 words in the dictionary, then X will be a 10,000 by 100,000 matrix, which is huge! The good news is that most elements of the matrix X are zero (not all the words are used in every document). Therefore, it is possible to (somehow) just store non-zero elements and save up a lot of memory. Fortunately, the library that we use supports using "sparse" data representations, meaning that it does not actually store all the zero-values.
# Tokenizing with scikit-learn
In the following part we extract whole words that have been used in the dataset and compute their occurance count in each document. This shows number of documents are **2257** and number of features (unique words in the whole set of documents) is **35788**.


In [None]:

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
X_train_counts.shape

(2257, 35788)

Up to here, we turned each document into an occurrence feature map (i.e., bag of words representation). But there is an issue with this solution: longer documents tend to have larger occurrence values. This is not ideal; for example, if we just repeat the same text twice, we don't expect the category of that document to change, but the occurance values will drastically change. Solution: we better normalize each document by dividing the occurrence values of each word by the total number of words in the document (*tf* normalization, where tf stands for term-frequency).

Another issue is that we have some words that are so common that do not give much information (think of words like "is", "the", etc.). In order to reduce the effect of those words, one can use the *tf-idf* method, where on top of normalizing based on the length of the documents (*tf*), we also downscale weights for words that are presented in many documents (*idf* stands for inverse document frequency)

If you are interested to know more about tf-idf, feel free to check out the wikipedia page. For this assignment, we will use *tf* and also *tf-idf* noramalization.

The below application of ***TfidfTransformer*** is showed when idf is turned off. Evidently, we don't observe any changes in our feature dimension after performing **tf-idf** step.

In [None]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)
X_train_tf.shape

(2257, 35788)

# Document classification
Support Vector Machines (SVMs) are one of the most common classifiers in practices. Here we train an SVM classifier on the transformed features, and try to classify two tiny documents using the trained classifier.

In [None]:

clf = SVC().fit(X_train_tf, twenty_train.target)

In [None]:
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tf_transformer.transform(X_new_counts)

predicted = clf.predict(X_new_tfidf)

for doc, category in zip(docs_new, predicted):
    print('%r => %s' % (doc, twenty_train.target_names[category]))

'God is love' => soc.religion.christian
'OpenGL on the GPU is fast' => comp.graphics


# <font color="red">Task1 </font>
Given the numerical features we can train a classifier. In the following, a simple linear SVM is trained on tf features and verified on the test dataset.
## Pipeline
We can create a "pipeline" for performing a sequence of steps, namely first extracting the words and creating vectors, then using tf or tf-idf, and then training the classifier. This helps to make our code cleaner (and allows for more code optimization, etc.) We utilize a pipeline to do vectorizer -> transformer -> classifier



In [None]:
text_clf = Pipeline([
      ('vect', CountVectorizer()),
      ('tfidf', TfidfTransformer(use_idf=False)),
      ('clf', SVC(kernel='linear')),
  ])
text_clf.fit(twenty_train.data, twenty_train.target)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
print('linear kernel, accuracy:{}'.format(np.mean(predicted == twenty_test.target)))

linear kernel, accuracy:0.8808255659121171




*  **A. (Kernel effect - 15 points)** Try RBF SVM (which is a version of SVM that uses Gaussian kernel) on the above example. Report the performance on three different gamma values on the test set: 0.70, 0.650, and 0.60 (see https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)
. Can you justify the results? (why the higher/lower gamma worked better?)




In [None]:
# your code comes here

*  **B. (idf importance - 15 points)** How would the results of part "a" change if we turn on *TfidfTransformer(use_idf=True)*? Report the results and justify them.

In [None]:
# your code comes here

# <font color="red">Task2-Confusion matrix</font>
The confusion matrix is a k x k matrix where k is the number of classes. Computing the confusion matrix gives more detailed information than just computing the accuracy of a classifier. The element on the row i and column j of this matrix indicates the number of data points that were from class i but we classified them as class j.

In scikit, the confusion matrix is a 2d array (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html). 
Let's fix gamma = 0.7 and turn on the use_idf flag.

**A. (15 points)** Report the confusion matrix on the test data. What is the most common mistake of the classifier? (for example, you can say that the data points from category xxxx were classified as category yyyy 200 times, which is more than any other pair of classes.)









In [None]:
# your code comes here

# <font color="red">Task3- The effect of n-gram</font>
Let's say we have a document that contains "apple watch", and another document that contains "I looked at my watch and had a bit of the apple". The problem is that the bag-of-words representation will say that each of these documents have one occurances of the word apple and one occurance of the word watch; therefore, we lose the important fact that the combination word "apple watch" was present in the first document. To address this, it is possible to use "n-grams".

Normally CountVectorizer assumes unigrams which means it just counts the word in each document. The idea of n-gram is to have the capability to also count sequences of n consecutive words. For example, if we use 2-grams, then "apple watch" will be considered as a single word (as well as things like "I looked", "my watch", "watch and", ...).


In scikit, if we set ngram_rangetuple: (min_n,max_n) = (1,2) it counts both single words and also sequences of two words. Further details are available in <a href="https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer" > this link </a>.

taking the following setting,
*   vectorizer = CountVectorizer(ngram_range=n_gram)
*   tf_idf = TfidfTransformer()

**A. (20 points)** Compare the accuracy of svm with RBF kernel (gamma=0.7) for different values of n_gram = (1,1),(1,2), and (2,2) on the test set. Which one works better? Justify the result. Also report the number of features (dimension of the input) for each of the three cases.


In [None]:
# your code comes here

**B. Word or Character analyzer? (20 points)** Now that we are using n-grams, we can actually use n-characters rather than n-words. In this section we aim to investigate the feature space and classificaition performance by setting *analayzer='char'*. So, repeat the previous part with *CountVectorizer(ngram_range=n_gram,analyzer='char')* where n_gram in [(1,2),(1,3),(1,4)]. Which one of the three works better? Report test accuracies and justify the results. Also report the number of features.

In [None]:
# your code comes here