

**Name:** Jeong Woo Yang


## Downloading and loading Data

This code loads the prepared split of the Reddit data into training, validation and testing set.

In [50]:
!wget -O reddit_data_split.zip https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EapVNOIV84tPnQuuFBNgG9UBYIWipQ9JL4QTfSgRtIacBw?download=1
!unzip -o reddit_data_split.zip

--2022-03-21 20:10:38--  https://gla-my.sharepoint.com/:u:/g/personal/jake_lever_glasgow_ac_uk/EapVNOIV84tPnQuuFBNgG9UBYIWipQ9JL4QTfSgRtIacBw?download=1
Resolving gla-my.sharepoint.com (gla-my.sharepoint.com)... 52.105.15.53
Connecting to gla-my.sharepoint.com (gla-my.sharepoint.com)|52.105.15.53|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: /personal/jake_lever_glasgow_ac_uk/Documents/Teaching/reddit_data_split.zip [following]
--2022-03-21 20:10:39--  https://gla-my.sharepoint.com/personal/jake_lever_glasgow_ac_uk/Documents/Teaching/reddit_data_split.zip
Reusing existing connection to gla-my.sharepoint.com:443.
HTTP request sent, awaiting response... 200 OK
Length: 468327 (457K) [application/x-zip-compressed]
Saving to: ‘reddit_data_split.zip’


2022-03-21 20:10:39 (876 KB/s) - ‘reddit_data_split.zip’ saved [468327/468327]

Archive:  reddit_data_split.zip
  inflating: reddit_test.json        
  inflating: reddit_train.json       
  inflating: reddit_va

In [51]:
import json

with open('reddit_train.json') as f:
    train_data = json.load(f)
with open('reddit_val.json') as f:
    validation_data = json.load(f)
with open('reddit_test.json') as f:
    test_data = json.load(f)

print("Number of posts in training data:", len(train_data))
print("Number of posts in validation data:", len(validation_data))
print("Number of posts in test data:", len(test_data))

Number of posts in training data: 1200
Number of posts in validation data: 400
Number of posts in test data: 400


Q1 - Comparing Classifiers [8 marks]

Use the text from the reddit posts (known as “body”) to train classification models using the Scikit Learn package. The labels to predict are the subreddit for each post. Conduct experiments using the following combinations of classifier models and feature representations:

Dummy Classifier with strategy="most_frequent"
Dummy Classifier with strategy="stratified"
LogisticRegression with One-hot vectorization 
LogisticRegression with TF-IDF vectorization (default settings)
SVC Classifier with  One-hot vectorization (SVM with RBF kernel, default settings))

(a) An important first step for any machine learning project is to explore the dataset. Calculate counts for the various labels and comment on the distribution of labels in the training/validation/test sets [1 mark]

(b) Implement the five classifiers above, train them on the training set and evaluate on the test set. Discuss the classifier performance in comparison to the others and preprocessing techniques [5 marks]

For the above classifiers report the classifier accuracy as well as macro/weighted-averaged precision, recall, and F1 (to three decimal places). Show the overall results obtained by the classifiers on the training and test sets in one table, and highlight the best performance. For the best performing classifier (by weighted F1 in test set) Include a bar chart graph with the F1 score for each class - (subreddits on x-axis, F1 score on Y axis).
Analyse and discuss the effectiveness of the classifiers. Your discussion should include how the models perform relative to the baselines and each other. It should discuss the classifiers’ behaviours with respect to: 

1) Appropriate model “fit” (how well is the model fit to the training/test dataset),

2) Dataset considerations (e.g. how are labels distributed, any other dataset issues?)

3) Classifier models (and their key parameters).
(c) Choose your own classifier/tokenization/normalisations approach, and report on its performance with respect to the five previous ones on the test set. [2 marks]
You should describe your selected classifier and vectorization approach including a justification for its appropriateness.


## Q1:

Data cleaning and tokenization

1. only subreddit and body columns were taken as labels and training data.

2. body was run through text_pipeline_spacy for stopwords

### Q1a: An important first step for any machine learning project is to explore the dataset. Calculate counts for the various labels and comment on the distribution of labels in the training/validation/test sets [1 mark]

All the 3 datasets body, training and validation were cleaned. 

As 4 of the subreddits were gaming related, words such as 'game','like', 'play' were in the top stop. Furthermore, due to the nature of reddit as a community that simlar people gather,words related to opinion such as 'like', 'know,' 'try', 'think' were also in the top spot.

In [52]:
import pandas as pd 
labels = ["author","body","id","score","subreddit","title"]
df_train = pd.DataFrame(train_data,columns = labels)
df_test = pd.DataFrame(test_data,columns = labels)
df_validation = pd.DataFrame(validation_data,columns = labels)

df_train=df_train.drop(['title', 'id','score','author'], axis = 1)
df_train.head(5)

Unnamed: 0,body,subreddit
0,"Long story short, I saw ESO in my library, dow...",PS4
1,I have seen a video online where someone took ...,pcgaming
2,"Hi, hope this is the right place/way to post t...",NintendoSwitch
3,After buying a majority share in Limelight/Alc...,antiMLM
4,Is it ok for me to drink coffee in the morning...,HydroHomies


In [53]:
df_train['subreddit'].value_counts()

tea               146
NintendoSwitch    145
PS4               142
Coffee            136
pcgaming          135
HydroHomies       134
xbox              132
antiMLM           128
Soda              102
Name: subreddit, dtype: int64

In [54]:
df_validation['subreddit'].value_counts()

antiMLM           54
NintendoSwitch    52
tea               48
Soda              43
PS4               43
pcgaming          43
Coffee            42
HydroHomies       38
xbox              37
Name: subreddit, dtype: int64

In [55]:
df_test['subreddit'].value_counts()

Coffee            56
NintendoSwitch    52
PS4               48
pcgaming          47
xbox              44
antiMLM           44
tea               42
HydroHomies       38
Soda              29
Name: subreddit, dtype: int64

There is a total of 9 subreddits in this dataset. 

In [56]:
subreddits = df_train.subreddit.unique()
print(subreddits)

['PS4' 'pcgaming' 'NintendoSwitch' 'antiMLM' 'HydroHomies' 'Coffee' 'xbox'
 'Soda' 'tea']


Token Analysis to check whether the most frequent tokens are consistent throughout the three datasets (test, validation, training)

In [57]:
from collections import Counter
def doc_frequency(corpus):
  doc_freq = Counter()
  for d in corpus:
    unique_tokens = set(d)
    for t in unique_tokens:
      doc_freq[t] += 1
  doc_freq_ordered = doc_freq.most_common(20)
          
  return doc_freq_ordered

In [58]:
import spacy

# Load the medium english model. 
# We will use this model to get embedding features for tokens later.
#!python -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.remove_pipe('tagger')
nlp.remove_pipe('parser')

# Download a stopword list
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [59]:
train_labels = df_train['subreddit']
validation_labels = df_validation['subreddit']
test_labels = df_test['subreddit']

In [60]:
def text_pipeline_spacy(text):
    tokens = []
    doc = nlp(text)
    for t in doc:
        if not t.is_stop and not t.is_punct and not t.is_space:
            tokens.append(t.lemma_.lower())
    return tokens

In [61]:
## Solution 
from sklearn.feature_extraction.text import CountVectorizer

# Pass in the tokenizer as the tokenizer to the vectorizer.
# Create a one-hot encoding vectorizer.
one_hot_vectorizer = CountVectorizer(tokenizer=text_pipeline_spacy, binary=True)
train_features = one_hot_vectorizer.fit_transform(df_train['body'])

# This creates input features for our classification on all subsets of our collection.
validation_features = one_hot_vectorizer.transform(df_validation['body'])
test_features = one_hot_vectorizer.transform(df_test['body'])

In [62]:
#Cleaned dataframe for the sake of analysis
test_cleaned = df_test["body"].apply(text_pipeline_spacy)
validation_cleaned = df_validation["body"].apply(text_pipeline_spacy)
train_cleaned = df_train["body"].apply(text_pipeline_spacy)

In [63]:
print ("Test Data token statistics")
print(doc_frequency(test_cleaned))
print ("Valdation Data token statistics")
print(doc_frequency(validation_cleaned))
print ("Train Data token statistics")
print(doc_frequency(train_cleaned))

Test Data token statistics
[('like', 122), ('game', 101), ('know', 95), ('try', 82), ('play', 81), ('time', 78), ('get', 77), ('want', 76), ('find', 70), ('think', 66), ('good', 63), ('look', 62), ('buy', 53), ('work', 53), ('go', 49), ('people', 48), ('coffee', 45), ('new', 45), ('thing', 45), ('well', 44)]
Valdation Data token statistics
[('like', 124), ('know', 89), ('game', 84), ('want', 83), ('try', 71), ('think', 66), ('get', 63), ('find', 62), ('look', 60), ('play', 60), ('time', 57), ('good', 52), ('people', 50), ('buy', 47), ('well', 46), ('lot', 45), ('water', 44), ('thing', 44), ('go', 43), ('drink', 42)]
Train Data token statistics
[('like', 341), ('game', 283), ('know', 264), ('get', 235), ('think', 214), ('try', 210), ('want', 201), ('play', 185), ('buy', 183), ('find', 178), ('look', 177), ('time', 175), ('good', 157), ('go', 151), ('new', 142), ('work', 141), ('well', 141), ('drink', 136), ('water', 135), ('tea', 130)]


### Q1b: Implement the five classifiers above, train them on the training set and evaluate on the test set. Discuss the classifier performance in comparison to the others and preprocessing techniques [5 marks]

For the above classifiers report the classifier accuracy as well as macro/weighted-averaged precision, recall, and F1 (to three decimal places). Show the overall results obtained by the classifiers on the training and test sets in one table, and highlight the best performance. For the best performing classifier (by weighted F1 in test set) Include a bar chart graph with the F1 score for each class - (subreddits on x-axis, F1 score on Y axis). Analyse and discuss the effectiveness of the classifiers. Your discussion should include how the models perform relative to the baselines and each other. It should discuss the classifiers’ behaviours with respect to:

Dummy Classifier with strategy="most_frequent"
Dummy Classifier with strategy="stratified"
LogisticRegression with One-hot vectorization 
LogisticRegression with TF-IDF vectorization (default settings)
SVC Classifier with  One-hot vectorization (SVM with RBF kernel, default settings))


In [64]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import fbeta_score

def evaluation_summary(description, true_labels, predictions, target_classes=["tea","NintendoSwitch","PS4","Coffee","pcgaming","HydroHomies","xbox","antiMLM","Soda"]):
  print("Evaluation for: " + description)
  print(classification_report(true_labels, predictions,  digits=3, zero_division=0, target_names=["tea","NintendoSwitch","PS4","Coffee","pcgaming","HydroHomies","xbox","antiMLM","Soda"]))
  print('\nConfusion matrix:\n',confusion_matrix(true_labels, predictions)) # Note the order here is true, predicted

In [65]:
#Solution
from sklearn.dummy import DummyClassifier

dummy_prior = DummyClassifier(strategy='stratified')
dummy_prior.fit(train_features, train_labels)
print(dummy_prior.score(validation_features, validation_labels))

dummy_prior_predicted_labels = dummy_prior.predict(validation_features)
evaluation_summary("Dummy Prior", validation_labels, dummy_prior_predicted_labels,  ["negative","positive"])

dummy_mf = DummyClassifier(strategy='most_frequent')
dummy_mf.fit(train_features, train_labels)
print(dummy_mf.score(validation_features, validation_labels))

dummy_mf_predicted_labels = dummy_mf.predict(validation_features)
evaluation_summary("Dummy Majority", validation_labels, dummy_mf_predicted_labels,  ["negative","positive"])

0.1325
Evaluation for: Dummy Prior
                precision    recall  f1-score   support

           tea      0.118     0.143     0.129        42
NintendoSwitch      0.143     0.132     0.137        38
           PS4      0.098     0.096     0.097        52
        Coffee      0.082     0.093     0.087        43
      pcgaming      0.048     0.047     0.047        43
   HydroHomies      0.121     0.074     0.092        54
          xbox      0.071     0.070     0.071        43
       antiMLM      0.152     0.146     0.149        48
          Soda      0.137     0.189     0.159        37

      accuracy                          0.107       400
     macro avg      0.108     0.110     0.108       400
  weighted avg      0.108     0.107     0.106       400


Confusion matrix:
 [[ 6  1  7  4  6  6  2  4  6]
 [ 5  5  5  2  6  3  6  2  4]
 [ 9  9  5  6  5  1  3  9  5]
 [ 1  3  9  4  7  4  8  4  3]
 [ 8  3  4  6  2  3  5  8  4]
 [11  4  8 12  2  4  3  4  6]
 [ 2  5  6  7  3  3  3  6  8]
 [ 5

In [66]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='saga', random_state=101)
lr_model = lr.fit(train_features, train_labels)

lr_predicted_labels = lr_model.predict(test_features)
evaluation_summary("LR onehot", test_labels, lr_predicted_labels)

Evaluation for: LR onehot
                precision    recall  f1-score   support

           tea      0.806     0.893     0.847        56
NintendoSwitch      0.810     0.895     0.850        38
           PS4      0.818     0.692     0.750        52
        Coffee      0.600     0.688     0.641        48
      pcgaming      0.767     0.793     0.780        29
   HydroHomies      0.868     0.750     0.805        44
          xbox      0.659     0.574     0.614        47
       antiMLM      0.860     0.881     0.871        42
          Soda      0.733     0.750     0.742        44

      accuracy                          0.765       400
     macro avg      0.769     0.768     0.767       400
  weighted avg      0.768     0.765     0.764       400


Confusion matrix:
 [[50  0  0  1  2  0  1  2  0]
 [ 0 34  0  1  1  1  0  1  0]
 [ 3  0 36  3  0  0  5  0  5]
 [ 1  0  3 33  1  0  6  0  4]
 [ 1  3  0  1 23  0  0  1  0]
 [ 3  3  0  2  1 33  0  2  0]
 [ 0  2  4  9  0  2 27  0  3]
 [ 3  0  0  0



In [67]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfIdf_vectorizer = TfidfVectorizer()
train_features = tfIdf_vectorizer.fit_transform(df_train['body'])

# This creates input features for our classification on all subsets of our collection.
validation_features = tfIdf_vectorizer.transform(df_validation['body'])
test_features = tfIdf_vectorizer.transform(df_test['body'])

lr = LogisticRegression(solver='saga', random_state=101)
lr_model = lr.fit(train_features, train_labels)

lr_predicted_labels = lr_model.predict(test_features)
evaluation_summary("LR TF-IDF", test_labels, lr_predicted_labels)

Evaluation for: LR TF-IDF
                precision    recall  f1-score   support

           tea      0.895     0.911     0.903        56
NintendoSwitch      0.968     0.789     0.870        38
           PS4      0.672     0.788     0.726        52
        Coffee      0.569     0.688     0.623        48
      pcgaming      0.788     0.897     0.839        29
   HydroHomies      0.841     0.841     0.841        44
          xbox      0.692     0.574     0.628        47
       antiMLM      0.902     0.881     0.892        42
          Soda      0.806     0.659     0.725        44

      accuracy                          0.777       400
     macro avg      0.793     0.781     0.783       400
  weighted avg      0.787     0.777     0.778       400


Confusion matrix:
 [[51  0  0  0  2  0  0  3  0]
 [ 0 30  2  1  1  3  1  0  0]
 [ 1  1 41  4  1  1  1  0  2]
 [ 0  0  6 33  1  0  5  0  3]
 [ 0  0  0  2 26  0  0  1  0]
 [ 2  0  1  2  0 37  1  0  1]
 [ 0  0  8 10  0  1 27  0  1]
 [ 2  0  0  0

In [68]:
from sklearn.svm import SVC

svc = SVC(kernel="rbf")
svc_model = svc.fit(train_features, train_labels)

svc_predicted_labels = svc_model.predict(test_features)
evaluation_summary("SVC", test_labels, svc_predicted_labels)

Evaluation for: SVC
                precision    recall  f1-score   support

           tea      0.960     0.857     0.906        56
NintendoSwitch      1.000     0.763     0.866        38
           PS4      0.714     0.769     0.741        52
        Coffee      0.449     0.729     0.556        48
      pcgaming      0.885     0.793     0.836        29
   HydroHomies      0.850     0.773     0.810        44
          xbox      0.474     0.574     0.519        47
       antiMLM      0.973     0.857     0.911        42
          Soda      0.852     0.523     0.648        44

      accuracy                          0.738       400
     macro avg      0.795     0.738     0.755       400
  weighted avg      0.785     0.738     0.750       400


Confusion matrix:
 [[48  0  0  2  2  0  3  1  0]
 [ 0 29  2  1  0  3  3  0  0]
 [ 0  0 40  7  0  1  3  0  1]
 [ 0  0  4 35  0  0  7  0  2]
 [ 0  0  0  4 23  0  2  0  0]
 [ 0  0  0  4  0 34  6  0  0]
 [ 0  0  7 11  0  1 27  0  1]
 [ 2  0  0  1  1  0

### Q1c: 3) Classifier models (and their key parameters). (c) Choose your own classifier/tokenization/normalisations approach, and report on its performance with respect to the five previous ones on the test set. [2 marks] You should describe your selected classifier and vectorization approach including a justification for its appropriateness.

In [141]:
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix

from tensorflow import keras
layers = keras.layers
models = keras.models
words = 250

model = models.Sequential()
model.add(layers.Dense(512, input_shape=(words,)))
model.add(layers.Activation('relu'))
model.add(layers.Dense(9))
model.add(layers.Activation('softmax'))

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [142]:
df_train_dl = pd.DataFrame(train_data,columns = labels)
df_test_dl = pd.DataFrame(test_data,columns = labels)

In [143]:

tokenize = keras.preprocessing.text.Tokenizer(num_words=words, 
                                              char_level=False)
tokenize.fit_on_texts(df_train_dl["body"]) # fit tokenizer to our training text data
x_train = tokenize.texts_to_matrix(df_train_dl["body"])
x_test = tokenize.texts_to_matrix(df_test_dl["body"])

train_cat = df_train_dl["subreddit"]
test_cat = df_test_dl["subreddit"]

encoder = LabelEncoder()
encoder.fit(train_cat)

y_train = encoder.transform(train_cat)
y_test = encoder.transform(test_cat)

y_train = keras.utils.to_categorical(y_train, 9)
y_test = keras.utils.to_categorical(y_test, 9)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

history = model.fit(x_train, y_train,
                    batch_size=10,
                    epochs=3,
                    verbose=1)

(1200, 250)
(400, 250)
(1200, 9)
(400, 9)
Epoch 1/3
Epoch 2/3
Epoch 3/3


In [124]:
print(x_train)
print(x_test)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 1. 0.]
 [1. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 1. 0. 0.]]


In [144]:
score = model.evaluate(x_test, y_test,
                       batch_size=32, verbose=1)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

Test loss: 1.0720510482788086
Test accuracy: 0.6449999809265137




```
# 코드로 형식 지정됨
```

## Q2:Parameter tuning  - Tune the parameters for both the vectorizer and classifier on the validation set (or using CV-fold validation on the train). [5 marks]

### Q2a:

In [74]:
tfIdf_vectorizer = TfidfVectorizer()
train_features = tfIdf_vectorizer.fit_transform(df_train['body'])

# This creates input features for our classification on all subsets of our collection.
validation_features = tfIdf_vectorizer.transform(df_validation['body'])
test_features = tfIdf_vectorizer.transform(df_test['body'])

lr = LogisticRegression(solver='lbfgs', random_state=101)
lr_model = lr.fit(train_features, train_labels)

lr_predicted_labels = lr_model.predict(test_features)
evaluation_summary("LR TF-IDF", test_labels, lr_predicted_labels)
lr.get_params().keys()

Evaluation for: LR TF-IDF
                precision    recall  f1-score   support

           tea      0.911     0.911     0.911        56
NintendoSwitch      0.909     0.789     0.845        38
           PS4      0.695     0.788     0.739        52
        Coffee      0.559     0.688     0.617        48
      pcgaming      0.788     0.897     0.839        29
   HydroHomies      0.837     0.818     0.828        44
          xbox      0.675     0.574     0.621        47
       antiMLM      0.902     0.881     0.892        42
          Soda      0.806     0.659     0.725        44

      accuracy                          0.775       400
     macro avg      0.787     0.778     0.779       400
  weighted avg      0.783     0.775     0.776       400


Confusion matrix:
 [[51  0  0  0  2  0  0  3  0]
 [ 0 30  2  1  1  3  1  0  0]
 [ 0  1 41  5  1  1  1  0  2]
 [ 0  0  5 33  1  0  6  0  3]
 [ 0  0  0  2 26  0  0  1  0]
 [ 2  2  0  2  0 36  1  0  1]
 [ 0  0  8 10  0  1 27  0  1]
 [ 2  0  0  0

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [70]:
from sklearn.pipeline import Pipeline

prediction_pipeline = Pipeline([
              ('Tf-idf', TfidfVectorizer()),
              ('logreg', LogisticRegression(random_state=101))
              ])

In [148]:
from sklearn.model_selection import GridSearchCV
params = {
   'Tf-idf__sublinear_tf': (True, False),
   'Tf-idf__max_features': [3000, 4000, 5400, 6300, 7000, 8000, 9000],
   'logreg__solver': ["sag", "saga", "lbfgs", "newton-cg"],
   'logreg__C': [1, 10, 1000, 1000, 10000]
}
grid_search = GridSearchCV(prediction_pipeline, param_grid=params, n_jobs=1, verbose=1, scoring='f1_weighted', cv=2)
print("Performing grid search...")
print("pipeline:", [name for name, _ in prediction_pipeline.steps])
print("parameters:")
print(params)
grid_search.fit(df_train['body'], train_labels)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(params.keys()):
  print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
pipeline: ['Tf-idf', 'logreg']
parameters:
{'Tf-idf__sublinear_tf': (True, False), 'Tf-idf__max_features': [3000, 4000, 5400, 6300, 7000, 8000, 9000], 'logreg__solver': ['sag', 'saga', 'lbfgs', 'newton-cg'], 'logreg__C': [1, 10, 1000, 1000, 10000]}
Fitting 2 folds for each of 280 candidates, totalling 560 fits


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logist

Best score: 0.741
Best parameters set:
	Tf-idf__max_features: 6300
	Tf-idf__sublinear_tf: True
	logreg__C: 1000
	logreg__solver: 'sag'




In [149]:
tuned_pipeline = Pipeline([
              ('Tf-idf', TfidfVectorizer(max_features=6300, sublinear_tf=True)),
              ('logreg', LogisticRegression(C=1000, solver='sag', random_state=101))
              ])

lr_model = tuned_pipeline.fit(df_train['body'], train_labels)

lr_predicted_labels = lr_model.predict(df_test['body'])
evaluation_summary("LR TF-IDF", test_labels, lr_predicted_labels)

Evaluation for: LR TF-IDF
                precision    recall  f1-score   support

           tea      0.862     0.893     0.877        56
NintendoSwitch      0.970     0.842     0.901        38
           PS4      0.750     0.750     0.750        52
        Coffee      0.571     0.667     0.615        48
      pcgaming      0.758     0.862     0.806        29
   HydroHomies      0.778     0.795     0.787        44
          xbox      0.725     0.617     0.667        47
       antiMLM      0.881     0.881     0.881        42
          Soda      0.732     0.682     0.706        44

      accuracy                          0.773       400
     macro avg      0.781     0.777     0.777       400
  weighted avg      0.778     0.772     0.773       400


Confusion matrix:
 [[50  0  0  0  4  0  0  2  0]
 [ 1 32  0  0  0  4  1  0  0]
 [ 0  0 39  8  0  0  1  0  4]
 [ 0  0  6 32  1  2  3  0  4]
 [ 1  1  1  1 25  0  0  0  0]
 [ 3  0  1  1  1 35  1  2  0]
 [ 0  0  4  8  0  2 29  1  3]
 [ 2  0  0  0



### Q2b:

## Q3:

### Q3a:In this task your goal is to add two features to (try to) improve subreddit classification performance obtained in Q2.
You must implement and describe two new classifier features and add them to the tuned model from Q2. Examples include adding other properties of the posts, leveraging embedding-based features, different vectorization approaches, etc, (This is your chance to be creative!). As before, report the results in terms of evaluation metrics on the test data. Additionally, include a well-labelled confusion matrix and discuss the result in reference to Q2 and what helped (or didn’t) and why you think so. In summary: 

I would use title as another training data as all the subreddits would discuss similar topics, and 

### Q3b:

In [73]:
df_train2 = pd.DataFrame(train_data,columns = labels)

df_train2["both"] = df_train2["body"] + df_train2["title"]

df_train2=df_train2.drop(['title', 'id','score','author'], axis = 1)
#print(df_train2['body'].iloc[0])
#print(df_train2['both'].iloc[0])

In [25]:
tfIdf_vectorizer = TfidfVectorizer()

train_features = tfIdf_vectorizer.fit_transform(df_train2['both'])

# This creates input features for our classification on all subsets of our collection.
validation_features = tfIdf_vectorizer.transform(df_validation['body'])
test_features = tfIdf_vectorizer.transform(df_test['body'])

lr = LogisticRegression(solver='saga', random_state=101)
lr_model = lr.fit(train_features, train_labels)

lr_predicted_labels = lr_model.predict(test_features)
evaluation_summary("LR TF-IDF", test_labels, lr_predicted_labels)
lr.get_params().keys()

Evaluation for: LR TF-IDF
                precision    recall  f1-score   support

           tea      0.962     0.911     0.936        56
NintendoSwitch      0.968     0.789     0.870        38
           PS4      0.745     0.788     0.766        52
        Coffee      0.500     0.688     0.579        48
      pcgaming      0.806     0.862     0.833        29
   HydroHomies      0.857     0.818     0.837        44
          xbox      0.600     0.638     0.619        47
       antiMLM      0.950     0.905     0.927        42
          Soda      0.875     0.636     0.737        44

      accuracy                          0.780       400
     macro avg      0.807     0.782     0.789       400
  weighted avg      0.803     0.780     0.786       400


Confusion matrix:
 [[51  1  0  0  2  0  1  1  0]
 [ 0 30  2  2  1  2  1  0  0]
 [ 0  0 41  5  1  1  3  0  1]
 [ 0  0  5 33  1  0  7  0  2]
 [ 0  0  0  3 25  0  0  1  0]
 [ 0  0  0  4  0 36  4  0  0]
 [ 0  0  4 11  0  1 30  0  1]
 [ 1  0  0  0

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

In [154]:
df_train3 = pd.DataFrame(train_data,columns = labels)
df_validation3 = pd.DataFrame(validation_data,columns = labels)
df_test3 = pd.DataFrame(test_data,columns = labels)

df_train3["both"] = df_train3["body"] + df_train3["author"]
df_validation3["both"] = df_validation3["body"] + df_validation3["author"]
df_test3["both"] = df_test3["body"] + df_test3["author"]
df_train3.head(5)

Unnamed: 0,author,body,id,score,subreddit,title,both
0,XC-XERZ,"Long story short, I saw ESO in my library, dow...",queqfu,0,PS4,Can I get banned for having a game that I didn...,"Long story short, I saw ESO in my library, dow..."
1,ZachTheKing,I have seen a video online where someone took ...,1eujoa,0,pcgaming,How to get a Kinect sensor to work with a PC?,I have seen a video online where someone took ...
2,BluePenguin2002,"Hi, hope this is the right place/way to post t...",m00bx7,5,NintendoSwitch,Switch Only Charges with GoPro Cable,"Hi, hope this is the right place/way to post t..."
3,100fluffyclouds,After buying a majority share in Limelight/Alc...,q13pvx,15,antiMLM,L’Occitane going down MLM route?,After buying a majority share in Limelight/Alc...
4,Epicskeleton53,Is it ok for me to drink coffee in the morning...,rxiv2g,2,HydroHomies,Guys i need your help,Is it ok for me to drink coffee in the morning...


In [156]:
tfIdf_vectorizer = TfidfVectorizer()

train_features = tfIdf_vectorizer.fit_transform(df_train3['both'])

# This creates input features for our classification on all subsets of our collection.
validation_features = tfIdf_vectorizer.transform(df_validation3['both'])
test_features = tfIdf_vectorizer.transform(df_test3['both'])

lr = LogisticRegression(solver='saga', random_state=101)
lr_model = lr.fit(train_features, train_labels)

lr_predicted_labels = lr_model.predict(test_features)
evaluation_summary("LR TF-IDF", test_labels, lr_predicted_labels)
lr.get_params().keys()

Evaluation for: LR TF-IDF
                precision    recall  f1-score   support

           tea      0.895     0.911     0.903        56
NintendoSwitch      0.935     0.763     0.841        38
           PS4      0.667     0.808     0.730        52
        Coffee      0.559     0.688     0.617        48
      pcgaming      0.806     0.862     0.833        29
   HydroHomies      0.833     0.795     0.814        44
          xbox      0.676     0.532     0.595        47
       antiMLM      0.860     0.881     0.871        42
          Soda      0.784     0.659     0.716        44

      accuracy                          0.765       400
     macro avg      0.780     0.767     0.769       400
  weighted avg      0.774     0.765     0.765       400


Confusion matrix:
 [[51  0  0  0  2  0  0  3  0]
 [ 0 29  2  2  0  3  1  1  0]
 [ 0  0 42  4  1  1  2  0  2]
 [ 0  0  6 33  1  0  5  0  3]
 [ 1  0  0  1 25  0  0  2  0]
 [ 2  2  0  2  0 35  1  0  2]
 [ 0  0  9 11  0  1 25  0  1]
 [ 2  0  0  0

dict_keys(['C', 'class_weight', 'dual', 'fit_intercept', 'intercept_scaling', 'l1_ratio', 'max_iter', 'multi_class', 'n_jobs', 'penalty', 'random_state', 'solver', 'tol', 'verbose', 'warm_start'])

### Q3c: