<a href="https://colab.research.google.com/github/MohammedHamood/20NewsGroup/blob/main/20NewsGroup_LR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 20NewsGroup - Logistic Regression

MEMO: Include any decisions about training/validation split, regularization
strategies, any optimization tricks, setting hyper-parameters, etc.

## Data Pre-Processing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import time
import nltk
nltk.download('stopwords')
import nltk
nltk.download('punkt')
import nltk
nltk.download('wordnet')

import pipelines_FEngineering as BaseLine
from sklearn.datasets import fetch_20newsgroups
import preprocessingNLP as PNLP

from imblearn.over_sampling import RandomOverSampler
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier

from sklearn.decomposition import PCA, NMF,TruncatedSVD
from sklearn.feature_selection import SelectKBest, chi2

from sklearn.metrics import classification_report

# Import Dataset
TNGD_Train = fetch_20newsgroups(subset='train',shuffle=True, random_state=42,remove=['headers', 'footers', 'quotes'])
TNGD_test = fetch_20newsgroups(subset='test',shuffle=True, random_state=42,remove=['headers', 'footers', 'quotes'])

# Preprocessing
print("PREPROCESSING ...")
TNGD_Train.data = PNLP.customNLP(TNGD_Train.data)
TNGD_test.data = PNLP.customNLP(TNGD_test.data)
TNGD_Train.data, TNGD_Train.target = PNLP.removeEmptyInstances(TNGD_Train.data, TNGD_Train.target)



print("PREPROCESSING DONE!")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
PREPROCESSING ...


  ' Beautiful Soup.' % markup)


PREPROCESSING DONE!


## Cross-Validation Splitting Strategy

Let's first analyse how the cross-validation splitting strategy makes an impact on the training accuracy of the dataset using Logistic Regression classifier with its default parameters. In the following experiment, Logistic Regression will be applied on the dataset for different number of folds (5 to 15). The elapsed time will also be collected for each training.

In [None]:
# Logistic Regression Model
accuracies = []
folds = []
runtimes = []
for i in range(5,16):
  start = time.perf_counter()
  LR_base = BaseLine.Pipeline_FeatureEngineering(TNGD_Train.data, TNGD_Train.target,
                                                parameters={}, CV=i,
                                                reductionMethod=TruncatedSVD(n_components=100),
                                                reductionType=3, model=LogisticRegression(n_jobs=-1))
  end = time.perf_counter()
  elapsed_time = end-start
  accuracies.append(LR_base.best_score_)
  folds.append(i)
  runtimes.append(elapsed_time)

#Display Results
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,5))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
ax1.plot(folds, accuracies, 'o-')
ax1.set_ylabel('Accuracy')
ax1.set_xlabel('Fold Size')
ax2.plot(folds, runtimes, 'o-')
ax2.set_ylabel('Training Time')
ax2.set_xlabel('Fold Size')
plt.tight_layout()
plt.show()

The first graph shows that within a range of 5 to 15 folds, a 11-fold cross-validation gives the best accuracy.

## Setting Hyper-Parameters

In the following section, RandomSearchCV is used to evaluate the validation accuracy of Logistic Regression based on random combinations of hyper-parameters used in the validation pipeline. The relevant parameters needed for CountVectorizer are max_features and ngram_range whereas the parameters for TfidfTransformer are use_idf and norm. Logistic Regression is used with its default values except for max_itr=400, which guarantees convergence on this dataset.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer,TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import Normalizer

# Set possible parameters
parameters = {
     'vect__max_features': (None, 50000, 100000),
     'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
     'tfidf__use_idf': (True, False),
     'tfidf__norm': ('l1', 'l2'),
}

# Create a pipeline
pip = Pipeline([('vect', CountVectorizer()),('tfidf', TfidfTransformer()),
                ('Norm', Normalizer(copy=False)),('clf', LogisticRegression(max_iter=400, n_jobs=-1))])

# Initialize RandomizedSearchCV
n_iter_search = 10
cv_folds = 11
LR_rand_search = RandomizedSearchCV(pip, param_distributions=parameters, 
                               n_iter=n_iter_search, cv=cv_folds)

# Utility function to report best scores
def report(results, n_top=10):
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            print("Model with rank: {0}".format(i))
            print("Mean validation score: {0:.3f} (std: {1:.3f})"
                  .format(results['mean_test_score'][candidate],
                          results['std_test_score'][candidate]))
            print("Parameters: {0}".format(results['params'][candidate]))
            print("")

# Execute RandomizedSearchCV and print best results
LR_rand_search.fit(TNGD_Train.data, TNGD_Train.target)
report(LR_rand_search.cv_results_)

These results show a brief idea of the required parameters for CountVectorizer and TfidfTransformer.
RandomSearchCV constantly gives different rankings as it always chooses its parameters randomly. However, to improve the probabilities of obtaining the best accuracy, the parameters used for rank 1 are applied for the next experiments. 

Now, RandomizedSearchCV is used to evaluate the hyper-parameters that could potentially optimize the validation accuracy of Logistic Regression. Only the relevant parameters of the model are evaluated. The 'saga' solver, a Stochastic Average Gradient variant, is used for optimization because it is compatible with the different L1 and L2 regularizations and it also converges faster on large datasets. A Normalizer is also included in the pipeline to make sure that the features have the same scale which guarantees fast convergence.

In [None]:
# Get parameters used for rank 1
indices = np.flatnonzero(LR_rand_search.cv_results_['rank_test_score'] == 1)
rank1_params = LR_rand_search.cv_results_['params'][indices[0]]
ngram_range = rank1_params['vect__ngram_range']
max_features = rank1_params['vect__max_features']


# Set relevant parameters
parameters = {
     'clf__penalty': ('l2', 'l1'),
     'clf__max_iter': (300, 400, 500),
     'clf__C': (0.5, 1.0, 10.0),
     'clf__tol': (0.001, 0.0001)
}

# Create a pipeline
pip = Pipeline([('vect', CountVectorizer(ngram_range=ngram_range, max_features=max_features)),
                ('tfidf', TfidfTransformer()),
                ('Norm', Normalizer(copy=False)),('clf', LogisticRegression(solver='saga', n_jobs=-1))])

# Initialize RandomizedSearchCV
n_iter_search=10
cv_folds=10
LR_rand_search2 = RandomizedSearchCV(pip, param_distributions=parameters, 
                               n_iter=n_iter_search, cv=cv_folds)

# Execute RandomizedSearchCV and print best results 
LR_rand_search2.fit(TNGD_Train.data, TNGD_Train.target)
report(LR_rand_search2.cv_results_)

In the following section, GridSearchCV is used to properly detect the most optimal parameters needed to optimize the validation accuracy of Logistic Regression. 

The previous results help to properly select the hyper-parameters. It seems that the default tolerance of 0.0001 is a good value for stochastic gradient descent. It also seems that a higher validation accuracy is obtained when no penalty is applied on the training. Although this might look good, it could also be a sign of overfitting, which could then generate a poor test accuracy. To avoid this issue, Lasso Regression (L1 penalty) is applied with an appropriate regularization strength. Indeed, this type of regression works the best because it produces sparse weights, which helps with feature selection by reducing the amount of features in the provided design matrix. This eventually reduces the complexity of the model. Furthermore, every test was able to find the minimum error without reaching the maximum iteration provided in the paremeters. Therefore, 300 iterations seems like an appropriate maximum value. Finally, using GridSearchCV helps choosing the appropriate regularization strength for the training.

In [None]:
# Set relevant parameters
parameters = {
    'clf__C': (0.1, 1.0, 10)
}

# Create a pipeline
pip = Pipeline([('vect', CountVectorizer(ngram_range=ngram_range, max_features=max_features)),
                ('tfidf', TfidfTransformer(use_idf=use_idf)),
                ('Norm', Normalizer(copy=False)),
                ('clf', LogisticRegression(max_iter=300, penalty='l1', solver='saga', n_jobs=-1))])

# Initialize and execute GridSearchCV
gs_clf = GridSearchCV(pip, parameters, cv=cv_folds, n_jobs=-1)
start_time = time.perf_counter()
gs_clf = gs_clf.fit(IMDB_train.data, IMDB_train.target)
elapsed_time = time.perf_counter() - start_time

# Print results
print("Validation Accuracy: " + str(gs_clf.best_score_))
print("Optimal Parameters: " + str(gs_clf.best_estimator_))
print("Optimal Parameters: " + str(gs_clf.best_params_))
print("")

train_accuracies = gs_clf.cv_results_['mean_test_score']
train_time = gs_clf.cv_results_['mean_fit_time']
train_params = gs_clf.cv_results_['params']

for i in range(len(train_time)):
  print("Parameter: {0}".format(train_params[i]))
  print("Training Time: %.3f seconds" %train_time[i])
  print("Valdation Accuracy: {0:.3f}".format(train_accuracies[i]))
  print("")