# Topic Modelling - ADS09

For this assignment, you will be working with a collection of articles on topics, including baseball, cryptography, electronics, hardware, medicine, mideast, motorcycles, politics, religion, and space. The posts are extracted from the 20 Newsgroups dataset.

Your tasks in this assignment will include preprocessing this data, and predicting the topics from this collection of texts using a supervised machine learning algorithm.

## 0. Settings

In [179]:
import sys
import pickle
import numpy as np
import pandas as pd
import sidetable
import matplotlib.pyplot as plt
import seaborn as sns 
import re
import string
import spacy
from spacy_langdetect import LanguageDetector
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction import text
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

%matplotlib inline

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

## 1. Import the data 

In [130]:
train = pd.read_csv("data/X_train.csv")
labels = pd.read_csv("data/y_train.csv")
test = pd.read_csv("data/X_test.csv")

In [131]:
display(train.head(10))
train.shape
#(train.iloc[1])

Unnamed: 0,text
0,csc2imd@cabell.vcu.edu (Ian M. Derby) writes:\...
1,In <30MAR93.02086551.0010@MUSIC.LIB.MATC.EDU P...
2,In article <15780004@hpspdla.spd.HP.COM garyr@...
3,"Hi, baseball fans! So what do you say? Don't y..."
4,"In article <93104.233239ISSBTL@BYUVM.BITNET, <..."
5,"NOTE: Saturday, April 20th's scores should be ..."
6,Does that mean they have to pay his salary? D...
7,In <1993Apr19.194025.8967@adobe.com snichols@a...
8,In article <1qkkodINN5f5@jhunix.hcf.jhu.edu pa...
9,In article <C5p6xq.GuI@me.utoronto.ca steinman...


(6384, 1)

#### Create numeric labels for the categories

In [159]:
le = preprocessing.LabelEncoder()
labels['label_num'] = le.fit_transform(labels['label'])
labels.groupby(['label','label_num']).label.agg('count')

label         label_num
baseball      0            570
cryptography  1            652
electronics   2            596
hardware      3            580
medicine      4            596
mideast       5            606
motorcycles   6            661
politics      7            770
religion      8            736
space         9            617
Name: label, dtype: int64

There are 10 classifications and the dataset is well balanced

## 2. Preprocessing - First Pass
Here we have defined a function that will do some standard data cleaning and lemmatise the data 


In [162]:
#load spacy model 
nlp = spacy.load('en_core_web_md')

def preprocess(text, lemmatise, pos = None):
    # Basic Preprocessing
    text = text.lower() #makes lower case
    text = re.sub('[\w-]+@([\w-]+\.)+[\w-]+', '', text)  # remove words with @
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # removes punctuation
    text = re.sub('\w*\d\w*', '', text) #removes words with numbers
    text = re.sub('/(\r\n)+|\r+|\n+|\t+/i', ' ', text) #removes carriage returns line breas tabs replace wuith space

    #pass through spacy model
    nlp_text = nlp(text)
    
    if lemmatise == 'on':
        if pos == None:
            lemmas = [token.lemma_ for token in nlp_text if token.lemma_ !='-PRON-']
        else: 
            lemmas = [token.lemma_ for token in nlp_text if token.pos_ in pos]    
        return ' '.join(lemmas)
    
    else: 
        if pos == None:        
            return  nlp_text
        else:
            return [token for token in nlp_text if token.pos_==pos]
    

In [163]:
# Call the preprocessing function first just lemmatise and secondly returning noun/ verb/adj
train['lemm'] = train['text'].apply(lambda x: preprocess(x, 'on')) 
train['nounadjverb'] = train['text'].apply(lambda x: preprocess(x, 'on', ['NOUN','VERB','ADJ'])) 
train.head()

Unnamed: 0,text,lemm,nounadjverb
0,csc2imd@cabell.vcu.edu (Ian M. Derby) writes:\...,ian m derby write since someone bring up spo...,write bring sport radio sportswrite happen big...
1,In <30MAR93.02086551.0010@MUSIC.LIB.MATC.EDU P...,in pfan write for those of who know who be...,write know s team mascot will give walking pap...
2,In article <15780004@hpspdla.spd.HP.COM garyr@...,in article gary rosen write thomas miller ...,article write think weekend strange one strang...
3,"Hi, baseball fans! So what do you say? Don't y...",hi baseball fan so what do say do not think de...,baseball fan say think deserve mean consider g...
4,"In article <93104.233239ISSBTL@BYUVM.BITNET, <...",in article write i would like to make every...,article write would like make aware win lead g...


## 3. Feature Extraction
Here we will convert the text into word vectors

In [164]:
stop_words = text.ENGLISH_STOP_WORDS

# create a function that vectorises the data using tfidf
def text2vec(train, test):
    tfidf = TfidfVectorizer(stop_words=stop_words, ngram_range = (1,1), max_df = 0.7)
    train_transformed = tfidf.fit_transform(train)
    test_transformed = tfidf.transform(test)
    return tfidf,train_transformed, test_transformed

#### Split the data into train test 

In [166]:
X_train, X_test, y_train, y_test = train_test_split(train, labels, test_size=0.2, random_state=42)

#### Vectorise using the preprocessor (lemmatise only)

In [167]:
vectorizer, vectors_train, vectors_test = text2vec(X_train['lemm'].tolist(), X_test['lemm'].tolist())

In [168]:
print("\nTrain Data format:  {}   Non-zero components estimate:  {}".format(vectors_train.shape,vectors_train.nnz / float(vectors_train.shape[0])))
print("Test Data format:   {}   Non-zero components estimate:  {}".format(vectors_test.shape,vectors_test.nnz / float(vectors_test.shape[0])))


Train Data format:  (5107, 37325)   Non-zero components estimate:  77.54944194243195
Test Data format:   (1277, 37325)   Non-zero components estimate:  73.23805794831637


## 4. Classify and Evaluate

#### Define some helper functions

In [172]:
def show_top10(classifier, categories, vectorizer):
    feature_names = np.asarray(vectorizer.get_feature_names())
    for i, category in enumerate(sorted(categories)):
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s: %s" % (category, " ".join(feature_names[top10])))

def classify(categories, vectorizer, vectors_train, vectors_test, train_set,  clf):
    clf.fit(vectors_train, train_set)
    predictions = clf.predict(vectors_test)
    show_top10(clf, categories, vectorizer)
    return clf.predict(vectors_test)


#### Predict with Naive Bayes

In [173]:
cats = sorted(labels['label'].value_counts().index.to_list())
# Create test prediction
test_pred = classify(cats,vectorizer, vectors_train, vectors_test, y_train['label_num'],  clf=MultinomialNB(alpha=.01))
print(classification_report( y_test['label_num'], test_pred, target_names=cats))

baseball: fan pitch run hit player win baseball year team game
cryptography: escrow algorithm nsa phone use clipper government encryption chip key
electronics: amp good output voltage power line work battery circuit use
hardware: problem monitor disk bus use ide controller scsi card drive
medicine: gordon cause medical know article patient disease doctor food msg
mideast: armenian say article jewish people turkish arab jews israeli israel
motorcycles: bmw helmet rider like dog motorcycle article ride dod bike
politics: just make say right clayton man homosexual government article people
religion: moral morality value people article jesus theory objective say god
space: pat like shuttle article nasa just orbit moon launch space
              precision    recall  f1-score   support

    baseball       0.99      1.00      1.00       136
cryptography       0.96      0.99      0.98       130
 electronics       0.94      0.90      0.92       120
    hardware       0.92      0.93      0.93   

## 5. Model Improvements
The base model gives an accuracy of 91%

#### Just compare with Standard Linear SVC
accuracy 90%

In [178]:
test_pred = classify(cats, vectorizer, vectors_train, vectors_test, y_train['label_num'],  clf=LinearSVC())
print(classification_report( y_test['label_num'], test_pred, target_names=cats))

baseball: pitcher career phillies season cub hit player team game baseball
cryptography: des encrypt pgp wiretap sternlight security nsa encryption clipper key
electronics: amp input battery cool audio tv radar copy voltage circuit
hardware: disk bio cache modem motherboard computer monitor pc drive card
medicine: food medicine treatment needle pain gordon msg medical doctor disease
mideast: jewish armenian turkey policy zionism jews turkish arab israel israeli
motorcycles: ama honda rider dog helmet bmw motorcycle ride bike dod
politics: william child drieux deficit liberal cramer homosexual clayton kaldis gay
religion: beast koresh odwyer god objective universe moral bible mormons christian
space: earth sky moon orbit scispace shuttle launch nasa pat space
              precision    recall  f1-score   support

    baseball       0.98      1.00      0.99       136
cryptography       0.98      0.98      0.98       130
 electronics       0.96      0.94      0.95       120
    hardware  

#### Create a pipeline and combine with gridsearch cv
Best Score 0.913 

In [181]:
text_clf = Pipeline([('tfidf_vectorizer', TfidfVectorizer(stop_words = stop_words)),
                     ('clf', MultinomialNB())
                    ])

text_clf.fit(X_train['lemm'].tolist(), y_train['label_num'])
test_pred = text_clf.predict(X_test['lemm'].tolist())

parameteres = { 'tfidf_vectorizer__ngram_range' : [(1, 1), (1, 2)],
                'tfidf_vectorizer__max_df' : [0.5, 0.8, 1.0],
               'clf__alpha':[0.1, 0.01]}

grid = GridSearchCV(text_clf, param_grid=parameteres, cv=5)
grid.fit(X_train['lemm'].tolist(), y_train['label_num'])
print ("score = %3.3f" %(grid.score(X_test['lemm'].tolist(),y_test['label_num'])))
print (grid.best_params_)


score = 0.913
{'clf__alpha': 0.01, 'tfidf_vectorizer__max_df': 0.8, 'tfidf_vectorizer__ngram_range': (1, 2)}


--------------------------------------------------- 
# Version 2.0 without Spacy
Now going to try without Spacy due to issues submitting to KATE when using spacy

## 2a Preprocess 
Remove Spacy from the preprocess function

In [183]:
def preprocess(text,):
    # Bsic Preprocessing
    #Make text lowercase, remove punctuation and remove words containing numbers and email addresses
    text = text.lower() #makes lower case
    text = re.sub('[\w-]+@([\w-]+\.)+[\w-]+', '', text)  # remove words with @
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text) # removes punctuation
    text = re.sub('\w*\d\w*', '', text) #removes words with numbers
    text = re.sub('/(\r\n)+|\r+|\n+|\t+/i', ' ', text) #removes carriage returns line breas tabs replace wuith space
    return text
    


Rerun the preprocess function on the dataframe and train test split

In [184]:
train['processed'] = train['text'].apply(lambda x: preprocess(x)) 
train.head()
X_train, X_test, y_train, y_test = train_test_split(train, labels, test_size=0.2, random_state=42)


## 2b GridSearch 
Repeat the gridsearch with the new processed data 

In [186]:
text_clf = Pipeline([('tfidf_vectorizer', TfidfVectorizer()),
                     ('clf', MultinomialNB())
                    ])

text_clf.fit(X_train['processed'].tolist(), y_train['label_num'])
test_pred = text_clf.predict(X_test['processed'].tolist())

parameteres = { 'tfidf_vectorizer__ngram_range' : [(1, 1), (1,2)],
                'tfidf_vectorizer__max_df' : [.5,  .8,  1.0],
                'tfidf_vectorizer__stop_words' :['english', None],
                'tfidf_vectorizer__max_features': (None, 5000, 10000),
                'clf__alpha':[0.01, 0.1]}

grid = GridSearchCV(text_clf, param_grid=parameteres, cv=5)
grid.fit(X_train['processed'].tolist(), y_train['label_num'])
print ("score = %3.3f" %(grid.score(X_test['processed'].tolist(),y_test['label_num'])))
print (grid.best_params_)

score = 0.915
{'clf__alpha': 0.01, 'tfidf_vectorizer__max_df': 0.5, 'tfidf_vectorizer__max_features': None, 'tfidf_vectorizer__ngram_range': (1, 2), 'tfidf_vectorizer__stop_words': 'english'}


#### Refit using the best params

In [187]:
# Retrain  with the best parameters

text_clf = Pipeline([('tfidf_vectorizer', TfidfVectorizer(max_df = 0.5, stop_words = 'english', ngram_range= (1, 2))),
                     ('clf', MultinomialNB(alpha = 0.01))
                    ])

text_clf.fit(X_train['processed'].tolist(), y_train['label_num'])
test_pred = text_clf.predict(X_test['processed'].tolist())

print(classification_report( y_test['label_num'], test_pred, target_names=cats))

              precision    recall  f1-score   support

    baseball       1.00      1.00      1.00       136
cryptography       0.98      1.00      0.99       130
 electronics       0.95      0.96      0.95       120
    hardware       0.95      0.94      0.95       102
    medicine       0.99      0.96      0.97       114
     mideast       0.80      0.84      0.82       113
 motorcycles       1.00      0.99      0.99       139
    politics       0.74      0.71      0.72       156
    religion       0.82      0.85      0.83       149
       space       0.97      0.97      0.97       118

    accuracy                           0.92      1277
   macro avg       0.92      0.92      0.92      1277
weighted avg       0.92      0.92      0.92      1277

