<a href="https://colab.research.google.com/github/ConanOReilly/Electricity_Production_Report/blob/main/Lab1_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [2]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head()

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0


#### __Test data:__

In [3]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [4]:
#Count entities in dataset
train_df["label"].value_counts()

#Count missing values
print(train_df.isna().sum())

#Text Processing
import nltk

nltk.download('punkt')
nltk.download('stopwords')

import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')

excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]

stop_words = [word for word in stop if word not in excluding]

snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list=[]
    for sent in texts:

        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence=[]

        sent = sent.lower()
        sent = sent.strip()
        sent = re.sub('\s+', ' ', sent)
        sent = re.compile('<.*?>').sub('', sent)

        for w in word_tokenize(sent):

            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):

                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence)

        final_text_list.append(final_string)

    return final_text_list

#Training - Validation Split
from sklearn.model_selection import train_test_split

X=train_df[["text"]]
Y=train_df["label"]
X_train, X_val, y_train, y_val = train_test_split(X,
                                                  Y,
                                                  test_size=0.10,
                                                  shuffle=True,
                                                  random_state=324
                                                 )

text     0
label    0
dtype: int64


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [5]:
print("Processing the text fields")
train_text_list = process_text(X_train["text"].tolist())
val_text_list = process_text(X_val["text"].tolist())


Processing the text fields


In [6]:
#Processing Data with Pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier


import gensim
from gensim.models import Word2Vec

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [7]:
from sklearn import set_config
set_config(display='diagram')
pipeline

#Train Classifier
X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

In [8]:
#Test Classifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[648 632]
 [555 665]]
              precision    recall  f1-score   support

           0       0.54      0.51      0.52      1280
           1       0.51      0.55      0.53      1220

    accuracy                           0.53      2500
   macro avg       0.53      0.53      0.53      2500
weighted avg       0.53      0.53      0.53      2500

Accuracy (validation): 0.5252


## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [9]:
# Task 1

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[648 632]
 [555 665]]
              precision    recall  f1-score   support

           0       0.54      0.51      0.52      1280
           1       0.51      0.55      0.53      1220

    accuracy                           0.53      2500
   macro avg       0.53      0.53      0.53      2500
weighted avg       0.53      0.53      0.53      2500

Accuracy (validation): 0.5252


In [10]:
# Task 1

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=False,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#When CountVectorizer(binary=True): f1-score = 0.53 and Accuracy (validation)= 0.5252
#When CountVectorizer(binary=False): f1-score = 0.51 and Accuracy (validation)= 0.5148

[[662 618]
 [595 625]]
              precision    recall  f1-score   support

           0       0.53      0.52      0.52      1280
           1       0.50      0.51      0.51      1220

    accuracy                           0.51      2500
   macro avg       0.51      0.51      0.51      2500
weighted avg       0.52      0.51      0.51      2500

Accuracy (validation): 0.5148


In [11]:
# Task 2
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=False,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[681 599]
 [588 632]]
              precision    recall  f1-score   support

           0       0.54      0.53      0.53      1280
           1       0.51      0.52      0.52      1220

    accuracy                           0.53      2500
   macro avg       0.53      0.53      0.53      2500
weighted avg       0.53      0.53      0.53      2500

Accuracy (validation): 0.5252


In [12]:
# Task 2

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=False,
    ( 'text_vect', TfidfVectorizer(use_idf=False,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#When TfidfVectorizer(binary=True): f1-score = 0.53 and Accuracy (validation)= 0.5252
#When TfidfVectorizer(binary=False): f1-score = 0.52 and Accuracy (validation)= 0.524

[[685 595]
 [595 625]]
              precision    recall  f1-score   support

           0       0.54      0.54      0.54      1280
           1       0.51      0.51      0.51      1220

    accuracy                           0.52      2500
   macro avg       0.52      0.52      0.52      2500
weighted avg       0.52      0.52      0.52      2500

Accuracy (validation): 0.524


In [14]:
# Task 3

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[648 632]
 [555 665]]
              precision    recall  f1-score   support

           0       0.54      0.51      0.52      1280
           1       0.51      0.55      0.53      1220

    accuracy                           0.53      2500
   macro avg       0.53      0.53      0.53      2500
weighted avg       0.53      0.53      0.53      2500

Accuracy (validation): 0.5252


In [15]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=100)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[783 497]
 [442 778]]
              precision    recall  f1-score   support

           0       0.64      0.61      0.63      1280
           1       0.61      0.64      0.62      1220

    accuracy                           0.62      2500
   macro avg       0.62      0.62      0.62      2500
weighted avg       0.63      0.62      0.62      2500

Accuracy (validation): 0.6244


In [13]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=1000)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[753 527]
 [395 825]]
              precision    recall  f1-score   support

           0       0.66      0.59      0.62      1280
           1       0.61      0.68      0.64      1220

    accuracy                           0.63      2500
   macro avg       0.63      0.63      0.63      2500
weighted avg       0.63      0.63      0.63      2500

Accuracy (validation): 0.6312


In [18]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
    #( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10000)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[770 510]
 [464 756]]
              precision    recall  f1-score   support

           0       0.62      0.60      0.61      1280
           1       0.60      0.62      0.61      1220

    accuracy                           0.61      2500
   macro avg       0.61      0.61      0.61      2500
weighted avg       0.61      0.61      0.61      2500

Accuracy (validation): 0.6104


In [16]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[681 599]
 [588 632]]
              precision    recall  f1-score   support

           0       0.54      0.53      0.53      1280
           1       0.51      0.52      0.52      1220

    accuracy                           0.53      2500
   macro avg       0.53      0.53      0.53      2500
weighted avg       0.53      0.53      0.53      2500

Accuracy (validation): 0.5252


In [17]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=100)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[841 439]
 [395 825]]
              precision    recall  f1-score   support

           0       0.68      0.66      0.67      1280
           1       0.65      0.68      0.66      1220

    accuracy                           0.67      2500
   macro avg       0.67      0.67      0.67      2500
weighted avg       0.67      0.67      0.67      2500

Accuracy (validation): 0.6664


In [19]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=1000)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[875 405]
 [299 921]]
              precision    recall  f1-score   support

           0       0.75      0.68      0.71      1280
           1       0.69      0.75      0.72      1220

    accuracy                           0.72      2500
   macro avg       0.72      0.72      0.72      2500
weighted avg       0.72      0.72      0.72      2500

Accuracy (validation): 0.7184


In [20]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10000)),
    ('knn', KNeighborsClassifier())
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#Best performing model is TfidfVectorizer(use_idf=True, max_features=10000))

[[ 914  366]
 [ 197 1023]]
              precision    recall  f1-score   support

           0       0.82      0.71      0.76      1280
           1       0.74      0.84      0.78      1220

    accuracy                           0.77      2500
   macro avg       0.78      0.78      0.77      2500
weighted avg       0.78      0.77      0.77      2500

Accuracy (validation): 0.7748


In [21]:
# Task 4

w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10000)),
    ('knn', KNeighborsClassifier(10))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[1000  280]
 [ 243  977]]
              precision    recall  f1-score   support

           0       0.80      0.78      0.79      1280
           1       0.78      0.80      0.79      1220

    accuracy                           0.79      2500
   macro avg       0.79      0.79      0.79      2500
weighted avg       0.79      0.79      0.79      2500

Accuracy (validation): 0.7908


In [22]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10000)),
    ('knn', KNeighborsClassifier(100))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[1035  245]
 [ 229  991]]
              precision    recall  f1-score   support

           0       0.82      0.81      0.81      1280
           1       0.80      0.81      0.81      1220

    accuracy                           0.81      2500
   macro avg       0.81      0.81      0.81      2500
weighted avg       0.81      0.81      0.81      2500

Accuracy (validation): 0.8104


In [23]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10000)),
    ('knn', KNeighborsClassifier(1000))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

[[1152  128]
 [ 376  844]]
              precision    recall  f1-score   support

           0       0.75      0.90      0.82      1280
           1       0.87      0.69      0.77      1220

    accuracy                           0.80      2500
   macro avg       0.81      0.80      0.80      2500
weighted avg       0.81      0.80      0.80      2500

Accuracy (validation): 0.7984


In [24]:
w2v = gensim.models.Word2Vec()
pipeline = Pipeline([
    #('text_vect', CountVectorizer(binary=True,
    ( 'text_vect', TfidfVectorizer(use_idf=True,
                                  max_features=10000)),
    ('knn', KNeighborsClassifier(10000))
                                ])

from sklearn import set_config
set_config(display='diagram')
pipeline

X_train = train_text_list
X_val = val_text_list

pipeline.fit(X_train, y_train.values)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

val_predictions = pipeline.predict(X_val)
print(confusion_matrix(y_val.values, val_predictions))
print(classification_report(y_val.values, val_predictions))
print("Accuracy (validation):", accuracy_score(y_val.values, val_predictions))

#Best performing model is TfidfVectorizer(use_idf=True, max_features=100))

[[1217   63]
 [ 613  607]]
              precision    recall  f1-score   support

           0       0.67      0.95      0.78      1280
           1       0.91      0.50      0.64      1220

    accuracy                           0.73      2500
   macro avg       0.79      0.72      0.71      2500
weighted avg       0.78      0.73      0.71      2500

Accuracy (validation): 0.7296
