# Project 2: Topic Classification

In this project, you'll work with text data from newsgroup posts on a variety of topics. You'll train classifiers to distinguish posts by topics inferred from the text. Whereas with digit classification, where each input is relatively dense (represented as a 28x28 matrix of pixels, many of which are non-zero), here each document is relatively sparse (represented as a bag-of-words). Only a few words of the total vocabulary are active in any given document. The assumption is that a label depends only on the count of words, not their order.

The `sklearn` documentation on feature extraction may be useful:
http://scikit-learn.org/stable/modules/feature_extraction.html

Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on Slack, but <b> please prepare your own write-up with your own code. </b>

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn library for importing the newsgroup data.
from sklearn.datasets import fetch_20newsgroups

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

import nltk


Bad key "text.kerning_factor" on line 4 in
C:\Users\sarahwang\anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution


Load the data, stripping out metadata so that only textual features will be used, and restricting documents to 4 specific topics. By default, newsgroups data is split into training and test sets, but here the test set gets further split into development and test sets.  (If you remove the categories argument from the fetch function calls, you'd get documents from all 20 topics.)

In [2]:
categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)
newsgroups_test  = fetch_20newsgroups(subset='test',
                                      remove=('headers', 'footers', 'quotes'),
                                      categories=categories)

num_test = int(len(newsgroups_test.target) / 2)
test_data, test_labels   = newsgroups_test.data[num_test:], newsgroups_test.target[num_test:]
dev_data, dev_labels     = newsgroups_test.data[:num_test], newsgroups_test.target[:num_test]
train_data, train_labels = newsgroups_train.data, newsgroups_train.target

print('training label shape:', train_labels.shape)
print('dev label shape:',      dev_labels.shape)
print('test label shape:',     test_labels.shape)
print('labels names:',         newsgroups_train.target_names)

training label shape: (2034,)
dev label shape: (676,)
test label shape: (677,)
labels names: ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc']


### Part 1:

For each of the first 5 training examples, print the text of the message along with the label.

In [3]:
def P1(num_examples=5):
    ### STUDENT START ###
    for label, text in zip(newsgroups_train.target[:num_examples],newsgroups_train.data[:num_examples]):
        print (f"Label:\n {categories[label]} \n")
        print (f"Text:\n {text} \n")
    ### STUDENT END ###

P1(5)

Label:
 talk.religion.misc 

Text:
 Hi,

I've noticed that if you only save a model (with all your mapping planes
positioned carefully) to a .3DS file that when you reload it after restarting
3DS, they are given a default position and orientation.  But if you save
to a .PRJ file their positions/orientation are preserved.  Does anyone
know why this information is not stored in the .3DS file?  Nothing is
explicitly said in the manual about saving texture rules in the .PRJ file. 
I'd like to be able to read the texture rule information, does anyone have 
the format for the .PRJ file?

Is the .CEL file format available from somewhere?

Rych 

Label:
 sci.space 

Text:
 

Seems to be, barring evidence to the contrary, that Koresh was simply
another deranged fanatic who thought it neccessary to take a whole bunch of
folks with him, children and all, to satisfy his delusional mania. Jim
Jones, circa 1993.


Nope - fruitcakes like Koresh have been demonstrating such evil corruption
for centuri


### Part 2:

Transform the training data into a matrix of **word** unigram feature vectors.  What is the size of the vocabulary? What is the average number of non-zero features per example?  What is the fraction of the non-zero entries in the matrix?  What are the 0th and last feature strings (in alphabetical order)?<br/>
_Use `CountVectorization` and its `.fit_transform` method.  Use `.nnz` and `.shape` attributes, and `.get_feature_names` method._

Now transform the training data into a matrix of **word** unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"].  Confirm the size of the vocabulary. What is the average number of non-zero features per example?<br/>
_Use `CountVectorization(vocabulary=...)` and its `.transform` method._

Now transform the training data into a matrix of **character** bigram and trigram feature vectors.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(analyzer=..., ngram_range=...)` and its `.fit_transform` method._

Now transform the training data into a matrix of **word** unigram feature vectors and prune words that appear in fewer than 10 documents.  What is the size of the vocabulary?<br/>
_Use `CountVectorization(min_df=...)` and its `.fit_transform` method._

Now again transform the training data into a matrix of **word** unigram feature vectors. What is the fraction of words in the development vocabulary that is missing from the training vocabulary?<br/>
_Hint: Build vocabularies for both train and dev and look at the size of the difference._

Notes:
* `.fit_transform` makes 2 passes through the data: first it computes the vocabulary ("fit"), second it converts the raw text into feature vectors using the vocabulary ("transform").
* `.fit_transform` and `.transform` return sparse matrix objects.  See about them at http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.csr_matrix.html.  

In [4]:
def P2():
    ### STUDENT START ###
    #Transform the training data into a matrix of word unigram feature vectors.
    vectorizer = CountVectorizer()
    vectorized = vectorizer.fit_transform(train_data)
    size = vectorized.shape[1]
    print(f'Size of the vocabulary: {size}')
    avg = float(vectorized.nnz)/vectorized.shape[0]
    print(f'The average number of non-zero features per example: {avg}')
    fraction = float(vectorized.nnz)/(vectorized.shape[0]*vectorized.shape[1])
    print(f'The fraction of the non-zero entries in the matrix: {fraction}')
    print(f'The 0th feature strings: {vectorizer.get_feature_names()[0]}')
    print(f'The last feature strings: {vectorizer.get_feature_names()[-1]}')
    
    #Now transform the training data into a matrix of word unigram feature vectors using your own vocabulary with these 4 words: ["atheism", "graphics", "space", "religion"]
    vectorizer2 = CountVectorizer(vocabulary= ["atheism", "graphics", "space", "religion"])      
    vectorized2 = vectorizer2.fit_transform(train_data)    
    avg2 = float(vectorized2.nnz)/vectorized2.shape[0]
    print(f'The average number of non-zero features per example: {avg2}')
    
    #Now transform the training data into a matrix of character bigram and trigram feature vectors.
    vectorizer3 = CountVectorizer(analyzer="char",ngram_range=(2,3))    
    vectorized3 = vectorizer3.fit_transform(train_data)
    size3 = vectorized3.shape[1]
    print(f'Size of the vocabulary: {size3}')
    #Now transform the training data into a matrix of word unigram feature vectors and prune words that appear in fewer than 10 documents.
    vectorizer4 = CountVectorizer(min_df=10)    
    vectorized4 = vectorizer4.fit_transform(train_data)
    size4 = vectorized4.shape[1]
    print(f'Size of the vocabulary: {size4}')
    
    #Now again transform the training data into a matrix of word unigram feature vectors.
    vectorizer_dev =  CountVectorizer()
    vectorized_dev = vectorizer_dev.fit_transform(dev_data)
    vocabulary_size_dev = vectorized_dev.shape[1]
    strings_in_dev_not_in_train = set(vectorizer_dev.get_feature_names()).difference(vectorizer.get_feature_names())
    fraction_missing_dev = float(len(strings_in_dev_not_in_train))/vocabulary_size_dev
    fraction_missing_train = float(len(strings_in_dev_not_in_train))/size
    print(f'The fraction of words in the development vocabulary that is missing from the training vocabulary: {fraction_missing_dev} of dev size or {fraction_missing_train} of training size')
    ### STUDENT END ###

P2()

Size of the vocabulary: 26879
The average number of non-zero features per example: 96.70599803343165
The fraction of the non-zero entries in the matrix: 0.0035978272269590263
The 0th feature strings: 00
The last feature strings: zyxel
The average number of non-zero features per example: 0.26843657817109146
Size of the vocabulary: 35478
Size of the vocabulary: 3064
The fraction of words in the development vocabulary that is missing from the training vocabulary: 0.24787640034470024 of dev size or 0.14981956173964806 of training size


### Part 3:

Transform the training and development data to matrices of word unigram feature vectors.

1. Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score.  For each model, show the k value and f1 score.
1. Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score.  For each model, show the alpha value and f1 score.
1. Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score.  For each model, show the C value, f1 score, and sum of squared weights for each topic.

* Why doesn't k-Nearest Neighbors work well for this problem?
* Why doesn't Logistic Regression work as well as Naive Bayes does?
* What is the relationship between logistic regression's sum of squared weights vs. C value?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer` and its `.fit_transform` and `.transform` methods to transform data.
* You can use `KNeighborsClassifier(...)` to produce a k-Nearest Neighbors model.
* You can use `MultinomialNB(...)` to produce a Naive Bayes model.
* You can use `LogisticRegression(C=..., solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.

In [5]:
from sklearn.metrics import f1_score
    
def P3():
    #Produce several k-Nearest Neigbors models by varying k, including one with k set to optimize f1 score. For each model, show the k value and f1 score.
    best_fscore = 0
    for param in [1,2,3,4,5]:
        KNN = Pipeline([('vect', CountVectorizer()),
                     ('clf', KNeighborsClassifier(n_neighbors=param)),
                     ])
        KNN = KNN.fit(train_data, train_labels)
        KNN_predicted = KNN.predict(dev_data)
        fscore = f1_score(dev_labels,KNN_predicted,average='weighted')
        if fscore > best_fscore:
            best_fscore = fscore
            best_k = param
        print (f" K is {param}, F1-score of KNN: {fscore}" )
    print(f' K {best_k} has optimal f1 score of {best_fscore}')
    
    #Produce several Naive Bayes models by varying smoothing (alpha), including one with alpha set approximately to optimize f1 score. For each model, show the alpha value and f1 score.
    
    best_fscore = 0
    for param in np.linspace(1e-9,5,30):
        MNB = Pipeline([('vect', CountVectorizer()),
                         ('clf', MultinomialNB(param)),
                         ])
        MNB = MNB.fit(train_data, train_labels)
        MNB_predicted = MNB.predict(dev_data)
        fscore = f1_score(dev_labels,MNB_predicted,average='weighted')
        if fscore > best_fscore:
            best_fscore = fscore
            best_alpha = param
        print (f" Alpha is {param}, F1-score of Naive Bayes: {fscore}" )
    print(f'Alpha {best_alpha} has optimal f1 score of {best_fscore}')
    
    #Produce several Logistic Regression models by varying L2 regularization strength (C), including one with C set approximately to optimize f1 score. For each model, show the C value, f1 score, and sum of squared weights for each topic.
    
    best_fscore = 0
    for param in np.linspace(1e-5,2,10):
        lg = Pipeline([('vect', CountVectorizer()),
                         ('clf', LogisticRegression(C=param)),
                         ])
        lg = lg.fit(train_data, train_labels)
        lg_predicted = lg.predict(dev_data)
        fscore = f1_score(dev_labels,lg_predicted,average='weighted')
        if fscore > best_fscore:
            best_fscore = fscore
            best_c = param
        sum_square_weights = [round(sum(lg.named_steps['clf'].coef_[category_num]**2),2)
                                                                          for category_num in range(4)]
        print (f" C value is {round(param, 2)}, F1-score is {fscore}, and sum of squared weights is {sum_square_weights}" )
    print(f'C value {best_c} has optimal f1 score of {best_fscore}')
    
P3()

 K is 1, F1-score of KNN: 0.3805030018531525
 K is 2, F1-score of KNN: 0.38054212404441684
 K is 3, F1-score of KNN: 0.4084150225437623
 K is 4, F1-score of KNN: 0.4031227993847515
 K is 5, F1-score of KNN: 0.4287607236218357
 K 5 has optimal f1 score of 0.4287607236218357
 Alpha is 1e-09, F1-score of Naive Bayes: 0.7490691140139345
 Alpha is 0.1724137940689655, F1-score of Naive Bayes: 0.7860617245406876
 Alpha is 0.34482758713793105, F1-score of Naive Bayes: 0.787644978407044
 Alpha is 0.5172413802068965, F1-score of Naive Bayes: 0.7862862961995258
 Alpha is 0.689655173275862, F1-score of Naive Bayes: 0.7847459594060138
 Alpha is 0.8620689663448275, F1-score of Naive Bayes: 0.7810859843567709
 Alpha is 1.034482759413793, F1-score of Naive Bayes: 0.7777320236017224
 Alpha is 1.2068965524827586, F1-score of Naive Bayes: 0.7712103067772015
 Alpha is 1.3793103455517242, F1-score of Naive Bayes: 0.7714404484397059
 Alpha is 1.5517241386206897, F1-score of Naive Bayes: 0.771348752056862
 A



 C value is 0.0, F1-score is 0.36519316740206303, and sum of squared weights is [0.0, 0.0, 0.0, 0.0]
 C value is 0.22, F1-score is 0.7074106551088876, and sum of squared weights is [54.32, 46.31, 53.68, 46.66]
 C value is 0.44, F1-score is 0.7052789695911175, and sum of squared weights is [93.97, 76.63, 90.98, 81.4]
 C value is 0.67, F1-score is 0.6955933920829318, and sum of squared weights is [126.4, 100.83, 120.9, 109.94]
 C value is 0.89, F1-score is 0.694281290097654, and sum of squared weights is [154.25, 121.55, 146.38, 134.56]
 C value is 1.11, F1-score is 0.6948409274379154, and sum of squared weights is [178.98, 139.71, 168.82, 156.38]
 C value is 1.33, F1-score is 0.6904779034670234, and sum of squared weights is [201.18, 156.13, 189.03, 176.14]
 C value is 1.56, F1-score is 0.6905320168042494, and sum of squared weights is [221.56, 171.15, 207.31, 194.28]
 C value is 1.78, F1-score is 0.6923682134497545, and sum of squared weights is [240.27, 184.96, 224.15, 211.01]
 C valu

Why doesn't k-Nearest Neighbors work well for this problem? Because the feature space is very sparse from each other, it has a hard time trying to differentiatie/compare between training examples

Why doesn't Logistic Regression work as well as Naive Bayes does? Because there is large number of features compared to the number of training samples

What is the relationship between logistic regression's sum of squared weights vs. C value? Sum of squared weights and C value are positively correlated. If the penalty is small, the coefficients are allowed to be further away from 0, however, when the penalty is large, they shrank towards 0.

### Part 4:

Transform the data to a matrix of word **bigram** feature vectors.  Produce a Logistic Regression model.  For each topic, find the 5 features with the largest weights (that's 20 features in total).  Show a 20 row (features) x 4 column (topics) table of the weights.

Do you see any surprising features in this table?

Notes:
* Train on the transformed training data.
* You can use `CountVectorizer` and its `.fit_transform` method to transform data.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a Logistic Regression model.
* You can use `LogisticRegression`'s `.coef_` method to get weights for each topic.
* You can use `np.argsort` to get indices sorted by element value. 

In [6]:
def P4():
    ### STUDENT START ###
    #Transform the data to a matrix of word bigram feature vectors. Produce a Logistic Regression model. 
    
    vectorizer = CountVectorizer(ngram_range=(2,2)) 
    vectorizedtrain = vectorizer.fit_transform(train_data)
    lr = Pipeline([('vect', CountVectorizer(ngram_range=(2,2))),
                     ('clf', LogisticRegression(C=0.2))
                      ])
    lr.fit(train_data, train_labels)
    weights = lr.named_steps['clf'].coef_
    lr_predict = lr.predict(dev_data)
    idx_list = map(lambda category: np.argpartition(abs(weights[category]), -5)[-5:], range(4))

    idx = [i for j in idx_list for i in j]
    feature_names = [vectorizer.get_feature_names()[ind] for ind in idx]
    largest_weights = [np.round(weights[category][idx],5) for category in range(4)]

    pinrt_template = '|| {0:30} ||{1:10}|{2:10}|{3:10}|{4:10}|' 
    print ('\nBigrams\n')
    for row in range(20):
        print (pinrt_template.format(feature_names[row], largest_weights[0][row], largest_weights[1][row],
                              largest_weights[2][row], largest_weights[3][row]) )
    ### STUDENT END ###

P4()


Bigrams

|| in this                        ||   0.36223|  -0.00991|  -0.37137|  -0.08571|
|| for the                        ||  -0.36596|   0.20903|  -0.06999|  -0.10565|
|| claim that                     ||   0.40469|  -0.13775|  -0.19039|  -0.08422|
|| cheers kent                    ||   0.41872|  -0.49562|  -0.47815|   0.42447|
|| looking for                    ||  -0.48307|    0.8482|  -0.37346|  -0.42481|
|| comp graphics                  ||  -0.20343|   0.53674|  -0.26674|  -0.18136|
|| is there                       ||  -0.24088|   0.55336|  -0.35165|  -0.17129|
|| looking for                    ||  -0.48307|    0.8482|  -0.37346|  -0.42481|
|| out there                      ||  -0.21577|    0.5851|  -0.35794|  -0.20996|
|| in advance                     ||  -0.35536|   0.65336|  -0.33018|  -0.31511|
|| and such                       ||  -0.15692|  -0.25872|   0.46337|  -0.16983|
|| sci space                      ||  -0.19029|  -0.25517|   0.47133|  -0.16153|
|| cheers kent    

In [7]:
vectorizer = CountVectorizer(ngram_range=(2,2)) 
vectorized_train = vectorizer.fit_transform(train_data)
print ('\nCategories for "cheers kent":')
categories.sort()
[categories[cat_index] for cat_index in train_labels[vectorized_train[:,37174].nonzero()[0]]]


Categories for "cheers kent":


['alt.atheism',
 'alt.atheism',
 'talk.religion.misc',
 'talk.religion.misc',
 'talk.religion.misc',
 'alt.atheism',
 'talk.religion.misc',
 'talk.religion.misc',
 'alt.atheism',
 'talk.religion.misc',
 'talk.religion.misc',
 'talk.religion.misc',
 'alt.atheism',
 'alt.atheism',
 'talk.religion.misc',
 'alt.atheism',
 'talk.religion.misc',
 'alt.atheism',
 'talk.religion.misc',
 'alt.atheism',
 'alt.atheism',
 'alt.atheism',
 'alt.atheism',
 'alt.atheism',
 'talk.religion.misc',
 'alt.atheism',
 'talk.religion.misc',
 'talk.religion.misc',
 'alt.atheism',
 'alt.atheism',
 'talk.religion.misc',
 'talk.religion.misc',
 'alt.atheism',
 'talk.religion.misc']

ANSWER: "cheers kent" is surprising. But if we take a closer look, it only appears in religion and atheism categories.

### Part 5:

To improve generalization, it is common to try preprocessing text in various ways before splitting into words. For example, you could try transforming strings to lower case, replacing sequences of numbers with single tokens, removing various non-letter characters, and shortening long words.

Produce a Logistic Regression model (with no preprocessing of text).  Evaluate and show its f1 score and size of the dictionary.

Produce an improved Logistic Regression model by preprocessing the text.  Evaluate and show its f1 score and size of the vocabulary.  Try for an improvement in f1 score of at least 0.02.

How much did the improved model reduce the vocabulary size?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `CountVectorizer(preprocessor=...)` to preprocess strings with your own custom-defined function.
* `CountVectorizer` default is to preprocess strings to lower case.
* You can use `LogisticRegression(C=0.5, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `metrics.f1_score(..., average="weighted")` to compute f1 score.
* If you're not already familiar with regular expressions for manipulating strings, see https://docs.python.org/2/library/re.html, and re.sub() in particular.

In [8]:
def better_preprocessor(s):
    """
    A better preprocessor is elimiating digits from strings, elimiating strings with underscores, 
    and eliminating strings with less than 4 characters
    """

    preprocessed_s = re.sub(r"\d+", " ", s)

    preprocessed_s = re.sub(r"_+"," ",preprocessed_s)
    preprocessed_s = re.sub(r"(.)\1{3,}",' ',preprocessed_s)
    preprocessed_s = re.sub(r"\b[\w']{1,3}\b", " ",preprocessed_s)
    preprocessed_s = re.sub(r"\s{3,}", " ",preprocessed_s)
    return preprocessed_s

def no_preprocessor(s):
    """
    Not do any preprocessing
    """
    return s
def calc_F1_preprocess(func):
   
    vectorizer = CountVectorizer(preprocessor=func)
    train = vectorizer.fit_transform(train_data)
    dev = vectorizer.transform(dev_data)
    lr = LogisticRegression()
    lr.fit(train, train_labels)
    lr_predict = lr.predict(dev)
    f1 = f1_score(dev_labels,lr_predict,average='weighted')
    print (f"Vocabulary size: {train.shape[1]}")
    print (f"F1-score: {f1}")
    return [train.shape[1],f1]
    
def P5():
    print ('No preprocessor:\n')
    res_no_prepro = calc_F1_preprocess(no_preprocessor)
    print ('\nBetter preprocessor:\n')
    res_with_prepro = calc_F1_preprocess(better_preprocessor)
    print (f'\nVocabulary size is decreased by {res_no_prepro[0]-res_with_prepro[0]}.')
    print (f'\nF1-score has increased by {round(res_with_prepro[1]-res_no_prepro[1],2)}.')

P5()

No preprocessor:

Vocabulary size: 33291
F1-score: 0.7023340087555402

Better preprocessor:

Vocabulary size: 27704
F1-score: 0.7216598231816684

Vocabulary size is decreased by 5587.

F1-score has increased by 0.02.


### Part 6:

The idea of regularization is to avoid learning very large weights (which are likely to fit the training data, but not generalize well) by adding a penalty to the total size of the learned weights. Logistic regression seeks the set of weights that minimizes errors in the training data AND has a small total size. The default L2 regularization computes this size as the sum of the squared weights (as in Part 3 above). L1 regularization computes this size as the sum of the absolute values of the weights. Whereas L2 regularization makes all the weights relatively small, L1 regularization drives many of the weights to 0, effectively removing unimportant features.

For several L1 regularization strengths ...<br/>
* Produce a Logistic Regression model using the **L1** regularization strength.  Reduce the vocabulary to only those features that have at least one non-zero weight among the four categories.  Produce a new Logistic Regression model using the reduced vocabulary and **L2** regularization strength of 0.5.  Evaluate and show the L1 regularization strength, vocabulary size, and f1 score associated with the new model.

Show a plot of f1 score vs. log vocabulary size.  Each point corresponds to a specific L1 regularization strength used to reduce the vocabulary.

How does performance of the models based on reduced vocabularies compare to that of a model based on the full vocabulary?

Notes:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `LogisticRegression(..., penalty="l1")` to produce a logistic regression model using L1 regularization.
* You can use `LogisticRegression(..., penalty="l2")` to produce a logistic regression model using L2 regularization.
* You can use `LogisticRegression(..., tol=0.015)` to produce a logistic regression model using relaxed gradient descent convergence criteria.  The gradient descent code that trains the logistic regression model sometimes has trouble converging with extreme settings of the C parameter. Relax the convergence criteria by setting tol=.015 (the default is .0001).

In [None]:
def get_accuracy(dev_data,dev_labels, C_val,tol_val):
    
    model = LogisticRegression(penalty='l1',C=C_val,tol=tol_val)
    dev_features = CountVectorizer().transform(dev_data)
    model = model.fit(dev_features,dev_labels)
    model_predicted = self.fit.predict(dev_features)
    accuracy = float(np.sum(model_predicted == dev_labels))/len(dev_labels)
    return accuracy

def get_nonzero_coef(coefs):
        def get_nonzero_coef_extract(coefs):
            return coefs.nonzero()[0]
        nonzero_features = map(lambda x: get_nonzero_coef_extract(x),coefs)
        nonzero_list = map(lambda x: nonzero_features[x].tolist(), range(len(coefs)))
        nonzero_flat = [i for sublist in nonzero_list for i in sublist]
    return nonzero_flat
    
def get_vocabulary(train_data,train_labels, C_val,tol_val):
    model = LogisticRegression(penalty='l1',C=C_val,tol=tol_val)
    train_features = CountVectorizer().transform(train_data)
    model = model.fit(train_features,train_labels)
    nonzero_flat = get_nonzero_coef(model.coef_)
    nonzero_list = np.unique(nonzero_flat)
    nonzero_vocabulary = [CountVectorizer().get_feature_names()[index] for index in nonzero_list]
    return nonzero_vocabulary

def nonzero_coef_num(train_data,train_labels, C_val,tol_val):
    model = LogisticRegression(penalty='l1',C=C_val,tol=tol_val)
    train_features = CountVectorizer().transform(train_data)
    model = model.fit(train_features,train_labels)
    coef_num = np.sum(map(lambda x: len(get_nonzero_coef(x)),model.coef_))
    return coef_num
        
def P6():
    
    C_vals = [1e-3,0.1,0.5,1,2,5,10,50,100,1000]
    l1_vocab_size = []
    l1_accuracy = []

    for row in range(len(C_vals)+1):
        print(train_data)
        
        new_vocab = get_vocabulary(train_data,train_labels, C_vals[row-1],0.01)
        
        l1_vocab_size.append(l1.nonzero_coef_num)
        l1_accuracy.append(get_accuracy(dev_data,dev_labels,C_vals[row-1],0.01))
        
    # -------------------------- plot --------------------------------- #
    plt.plot(np.log(l1_vocab_size), l1_accuracy)
    plt.ylabel('Accuracy') 
    plt.xlabel('L2 vocabulary size (log)')
    plt.title('L2 vocabulary size against accuracy')  
    plt.show() 

P6()

ANSWER:

### Part 7:

How is `TfidfVectorizer` different than `CountVectorizer`?

Produce a Logistic Regression model based on data represented in tf-idf form, with L2 regularization strength of 100.  Evaluate and show the f1 score.  How is `TfidfVectorizer` different than `CountVectorizer`?

Show the 3 documents with highest R ratio, where ...<br/>
$R\,ratio = maximum\,predicted\,probability \div predicted\,probability\,of\,correct\,label$

Explain what the R ratio describes.  What kinds of mistakes is the model making? Suggest a way to address one particular issue that you see.

Note:
* Train on the transformed training data.
* Evaluate on the transformed development data.
* You can use `TfidfVectorizer` and its `.fit_transform` method to transform data to tf-idf form.
* You can use `LogisticRegression(C=100, solver="liblinear", multi_class="auto")` to produce a logistic regression model.
* You can use `LogisticRegression`'s `.predict_proba` method to access predicted probabilities.

In [11]:
def cacl_R(probability,correct_label):
    return float(np.max(probability))/probability[correct_label]

def P7():
    vectorizer = TfidfVectorizer()
    train_features = vectorizer.fit_transform(train_data)
    dev_features = vectorizer.transform(dev_data)
    lr = LogisticRegression(C=100)
    lr.fit(train_features,train_labels)
    lr_predict_prob = lr.predict_proba(dev_features)
    pred_labels = lr.predict(dev_features)
    r = list(map(cacl_R,lr_predict_prob,dev_labels)       )
    index = np.argpartition(r, -3)[-3:]
    categories.sort()
    for rank in np.arange(len(index)-1,-1,-1):
        print (f'Error rank is {-rank+len(index+1)}\n')
        print (f'R-ratio is {r[index[rank]]}\n')
        print (f'Observed category is {categories[dev_labels[index[rank]]]}\n')
        print (f'Predicted category is {categories[pred_labels[index[rank]]]}\n')
P7()

Error rank is 1

R-ratio is 929.3580170576168

Observed category is talk.religion.misc

Predicted category is comp.graphics

Error rank is 2

R-ratio is 325.00408025526406

Observed category is talk.religion.misc

Predicted category is comp.graphics

Error rank is 3

R-ratio is 287.306306820337

Observed category is alt.atheism

Predicted category is talk.religion.misc



ANSWER: The only difference is that the TfidfVectorizer() returns floats while the CountVectorizer() returns ints. And that’s to be expected, TfidfVectorizer() assigns a score while CountVectorizer() counts. tf-idf is not only a simple count of the expressions in a document but it also takes into the frequency of the occurrence of the expression in the corpus into account.

R ratio describe the relationship between 𝑚𝑎𝑥𝑖𝑚𝑢𝑚 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 and 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑙𝑎𝑏𝑒𝑙. The higher the R ratio is, the less accurate the model is. 

The mistake the model is making is that it incorrectly classifies some categories as comp.graphics while they are suppoed to belong to other category.

A solution could be - if some technical features are good predictors of computer graphics but are used often in other categories, they might be removed from the model to increase the accuracy

### Part 8 EXTRA CREDIT:

Produce a Logistic Regression model to implement your suggestion from Part 7.