# Retrain Classifier

Run this notebook to train the classifier for new data. 
Just make sure that you put the new data into the Data folder and name it 'Labelled Data.xlsx'  
Then all you have to do is run through each of the following cells to create the new classifier.  
Once that's created, copy the files over for the pipeline:  
* OurClassifier.p --> Pipeline
* content-wnLemm-FeatureSet.csv --> Pipeline/Data
* title-wnLemm-FeatureSet.csv --> Pipeline/Data



### 1. Clean the Data

The first step cleans the articles and preps them for the rest of the classifier training.  
Along with other cleaning functions, this script removes:
* duplicate articles
* invalid 'articles' (e.g. "Your usage has been flagged", Chinese characters, etc.)  

*From articles:*
* html tags (e.g. 'div', etc.)
* stop words (e.g. 'and', 'the', dates, prepositions, etc.)
* standard phrases (e.g. 'click here', 'read more') 

In [1]:
import DataClean as dc

articleDB = dc.main()

### 2. Select Features
This script runs to select the most valuable words out of the article set that can help the classifier determine whether an article is market moving or not.  
Essentially, this script normalizes words and then selects the top 1000 according to the Mutual Information metric.

Because both the article body and title provide a lot of information, this script runs for both of those datasets and outputs two feature sets.

In [2]:
import FeatureSelection as fs

titleFts, contentFts = fs.main(articleDB)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
starting Binary Encoding


2228it [00:00, 2599.56it/s]


Finished Bin Encoding. Collecting Highest Features
     MI_Values    target_group
833   0.020135       regulator
305   0.019532          winner
800   0.019016      revolution
303   0.018799            hold
96    0.018552        briefing
695   0.018299           music
269   0.018228            save
257   0.018132          rising
25    0.018127            year
965   0.017988            ripe
244   0.017538        industry
917   0.017254           fresh
98    0.016643            work
617   0.016491        explains
530   0.016397            must
465   0.016351  cryptocurrency
310   0.016340            elon
72    0.015912            sell
636   0.015793            perk
473   0.015578         deficit
255   0.015539            game
494   0.015494           avoid
320   0.015160          energy
464   0.015135            lead
245   0.014777          secret
243   0.014552        trillion
295   0.014089         venture
577   0.013745        broadcom
672   0.013729           space
763   0.013634     

2228it [00:16, 136.71it/s]


Finished Bin Encoding. Collecting Highest Features
     MI_Values    target_group
97    0.040577        retailer
41    0.028876           store
154   0.028782          retail
613   0.027383  infrastructure
498   0.026625         shopper
408   0.022393        declined
684   0.022369          disney
436   0.021520          agency
842   0.020074          author
633   0.019995      california
23    0.019614            sale
159   0.019408            used
952   0.019239        original
141   0.019181           house
58    0.018724            high
348   0.018098          always
955   0.017832       secretary
274   0.017770           chain
223   0.017527        question
877   0.017382         reached
3     0.016860          amazon
911   0.016598         raising
336   0.016530         getting
7     0.016520            like
474   0.016477      republican
91    0.016047            city
193   0.015982         growing
298   0.015937          europe
246   0.015800          social
522   0.015755     

### 3. Encode Features

Now that the top features have been selected, each article is encoded into a vector according to the features that it contains.

In other words, this script creates a matrix where each row represents an article (i) and each column is a selected feature (j). A cell receives a 1 if the feature (i.e. word) j appears in article i, otherwise it receives a 0. 



In [3]:
import FeatureEncoding as fe

titleEnc, contentEnc = fe.main(titleFts, contentFts, articleDB)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Padmanie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
encoding title features
encoding content features


### 4. Create a new logistic regression classifier

Finally, this script takes in the encoded articles and feature sets to combine them into a single matrix.
This matrix is fed into a logistic regression classifier and tested with various combinations of hyperparameters.

Based on the results of this testing (hyperparameter tuning), the top combination for linear regression is run and stored as a new classifier, "OurClassifier.p".

This pickle file contains logistic regression classifier, including hyperparameters and feature weights, that can be used regularly on the new data that the Pipeline pulls. 

In [6]:
import LogisticRegression as lr

results = lr.main(contentEnc, titleEnc)

(2228, 351)
RangeIndex(start=0, stop=2228, step=1)
Best Penalty: l2 Best C: 1
Best Penalty: l2 Best C: 0.1
Best Penalty: l2 Best C: 0.01
Best Penalty: l1 Best C: 0.1
Best Penalty: l2 Best C: 0.1
0.7169249471928401
0.8534286873605194
