# Task 1 - Text Classification
## Creating a benchmark analysis with different algorithms and feature extractors.
### Algorithms: Multinomial Naïve Bayes, Logistic Regression, Support Vector Machines, Decision Trees
### Feature Extractors: CountVectorizer, Word2Vec, Doc2Vec, Fastai

### Import all the necessary libraries

In [9]:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from gensim.models import KeyedVectors, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.metrics import accuracy_score
import random

### Choose a few categories fro the entire 20 categories

In [3]:
# Load some categories from the training set
categories = [
    'alt.atheism',
    'talk.religion.misc']

In [4]:
print("Loading 20 newsgroups dataset for categories:")
print(categories)

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc']


### Loading data

In [5]:
data = fetch_20newsgroups(subset='train', categories=categories)
print(f"{len(data.filenames)} documents")
print(f"{len(data.target_names)} categories")
print()

857 documents
2 categories



In [6]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.2, random_state=0)

In [7]:
classification_algorithms = {'Multinomial Naïve Bayes': MultinomialNB(),
                            'Logistic Regression': LogisticRegression(),
                            'Support Vector Machine': SVC(),
                            'Decision Tree': DecisionTreeClassifier()}

### CountVectorizer

In [9]:
for ca in classification_algorithms:
    pipeline = Pipeline([
                        ("CountVectorizer", CountVectorizer()),
                        (ca, classification_algorithms[ca]),
                        ])
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    print(f"Accuracy for CountVectorizer with {ca}: {score}")

Accuracy for CountVectorizer with Multinomial Naïve Bayes: 0.9186046511627907


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Accuracy for CountVectorizer with Logistic Regression: 0.9244186046511628
Accuracy for CountVectorizer with Support Vector Machine: 0.6511627906976745
Accuracy for CountVectorizer with Decision Tree: 0.8023255813953488


### Word2Vec

In [14]:
model_path = 'GoogleNews-vectors-negative300.bin'
w2v_model = KeyedVectors.load_word2vec_format(model_path, binary=True)

In [15]:
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this = np.zeros(DIMENSION)
        count_for_this = 0
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this += 1
        feats.append(feat_for_this/count_for_this)
    return feats

In [16]:
vector_train = embedding_feats(X_train)
vector_test = embedding_feats(X_test)

In [18]:
for ca in list(classification_algorithms.values())[1:]:
    ca.fit(vector_train, y_train)
    y_pred = ca.predict(vector_test)
    score = accuracy_score(y_test, y_pred)
    print(f"Accuracy for Word2Vec with {ca}: {score}")

Accuracy for Word2Vec with LogisticRegression(): 0.5755813953488372
Accuracy for Word2Vec with SVC(): 0.5581395348837209
Accuracy for Word2Vec with DecisionTreeClassifier(): 0.5232558139534884


### Doc2Vec

In [32]:
d2vtrain = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate((X_train))]
d2v_model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm=1, epochs=100)
d2v_model.build_vocab(d2vtrain)
d2v_model.train(d2vtrain, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)
d2v_model.save("d2v_train.model")

In [33]:
# Convert the sentences to TaggedDocuments
documents = [TaggedDocument(words=sentence.split(), tags=[str(i)]) for i, sentence in enumerate(X_train)]

# Infer the vectors for the documents
vectors = [d2v_model.infer_vector(document.words) for document in documents]

# Convert the list of vectors to a numpy array
vector_train = np.array(vectors)

In [34]:
d2vtrain = [TaggedDocument((d), tags=[str(i)]) for i, d in enumerate((X_test))]
d2v_model = Doc2Vec(vector_size=50, alpha=0.025, min_count=10, dm=1, epochs=100)
d2v_model.build_vocab(d2vtrain)
d2v_model.train(d2vtrain, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)
d2v_model.save("d2v_test.model")

In [35]:
# Convert the sentences to TaggedDocuments
documents = [TaggedDocument(words=sentence.split(), tags=[str(i)]) for i, sentence in enumerate(X_test)]

# Infer the vectors for the documents
vectors = [d2v_model.infer_vector(document.words) for document in documents]

# Convert the list of vectors to a numpy array
vector_test = np.array(vectors)

In [31]:
for ca in list(classification_algorithms.values())[1:]:
    ca.fit(vector_train, y_train)
    y_pred = ca.predict(vector_test)
    score = accuracy_score(y_test, y_pred)
    print(f"Accuracy for Doc2Vec with {ca}: {score}")

Accuracy for Doc2Vec with LogisticRegression(): 0.5058139534883721
Accuracy for Doc2Vec with SVC(): 0.5465116279069767
Accuracy for Doc2Vec with DecisionTreeClassifier(): 0.436046511627907


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Choose the best model

In [14]:
pipeline = Pipeline([("CountVectorizer", CountVectorizer()),
                     ("Logistic Regression", LogisticRegression())])
pipeline.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Use the model to classify a piece of text

In [13]:
sample = random.choice(X_test)
sample

"From: mathew@mantis.co.uk (mathew)\nSubject: Re: After 2000 years, can we say that Christian Morality is\nOrganization: Mantis Consultants, Cambridge. UK.\nLines: 12\nX-Newsreader: rusnews v1.01\n\nfrank@D012S658.uucp (Frank O'Dwyer) writes:\n> (b) I am neither a Christian nor a theist, but I believe in objective\n> morality in preference to a relativist soup of gobbledegook.\n\nWell, there are two approaches we can take here.  One is to ask you what this\nobjective morality is, assuming it's not a secret.\n\nThe other is to ask you what you think is wrong with relativism, so that we\ncan correct your misconceptions :-)\n\n\nmathew\n"

In [16]:
pipeline.predict([sample])

array([1], dtype=int64)