<a href="https://colab.research.google.com/github/SirigineediDivya/FMML_LABS_PROJECTS_26/blob/main/Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-3 : Using KNN for Text Classification
#### Module Coordinator: Jashn Arora


---

## **Section 1: Understanding NLP tools**

In this lab we will be using KNN on a real world NLP application i.e. is text classification. But first look at some NLP techniques for text classification and tools that we use when we want to use python for NLP.

## Section 1.2: Data Cleaning and Preprocessing step

Raw text must be processed and converted into a form so that it is suitable to use with various machine-learning algorithms.  
In case of text, there are lots of things that need to be taken into account.  


1.   Removing numbers from the text
2.   Handling capitalization and punctuation.
3.   Stemming and Lemmatizing text.  

And most importantly, one can't just use words or images directly in algorithms; they need to be converted into vectors- a form that algorithms can understand.



### **NLTK**
NLTK (or Natural Language Tool Kit) is a commonly used library for processing text. We will use this tool in this lab. Lets first install it.


In [1]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


True

In [2]:
import re
import numpy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from bs4 import BeautifulSoup

def cleanText(text, lemmatize, stemmer):
    """Method for cleaning text from train and test data. Removes numbers, punctuation, and capitalization. Stems or lemmatizes text."""

    if isinstance(text, float):
        text = str(text)
    if isinstance(text, numpy.int64):
        text = str(text)
    try:
        text = text.decode()
    except AttributeError:
        pass

    soup = BeautifulSoup(text, "lxml")
    text = soup.get_text()
    text = re.sub(r"[^A-Za-z]", " ", text)
    text = text.lower()


    if lemmatize:
        wordnet_lemmatizer = WordNetLemmatizer()

        def get_tag(tag):
            if tag.startswith('J'):
                return wordnet.ADJ
            elif tag.startswith('V'):
                return wordnet.VERB
            elif tag.startswith('N'):
                return wordnet.NOUN
            elif tag.startswith('R'):
                return wordnet.ADV
            else:
                return ''

        text_result = []
        tokens = word_tokenize(text)  # Generate list of tokens
        tagged = pos_tag(tokens)
        for t in tagged:
            try:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0], get_tag(t[1][:2])))
            except:
                text_result.append(wordnet_lemmatizer.lemmatize(t[0]))
        return text_result

    if stemmer:
        text_result = []
        tokens = word_tokenize(text)
        snowball_stemmer = SnowballStemmer('english')
        for t in tokens:
            text_result.append(snowball_stemmer.stem(t))
        return text_result

In [44]:
sample_text = "Troubl
sample_text_result = cleanText(sample_text, lemmatize=False, stemmer=True)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text)
print(sample_text_result)
sample_text_result = cleanText(sample_text, lemmatize=True, stemmer=False)
sample_text_result = " ".join(str(x) for x in sample_text_result)
print(sample_text_result)

LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
**********************************************************************


## Section 1.2: BAG OF WORDS

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms.

The approach is very simple and flexible, and can be used in many ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document.
It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

In [4]:
# Functions to convert document(s) to a list of words, with the option of removing stopwords. Returns document-term matrix.

def createBagOfWords(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = CountVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer = CountVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    bag_of_words_train = vectorizer.fit_transform(clean_train).toarray()
    bag_of_words_test = vectorizer.transform(clean_test).toarray()
    return bag_of_words_train, bag_of_words_test

## Section 1.3: TF-IDF
TF-IDF technique is used to find meaning of sentences consisting of words and cancels out the incapabilities of Bag of Words technique which is good for text classification or for helping a machine read words in numbers.

The number of times a term occurs in a document is called its Term frequency (TF).

 Document frequency is the number of documents in which the word is present.  Inverse DF (IDF) is the inverse of the document frequency which measures the informativeness of term *t*.




In [38]:
def createTFIDF(train, test, remove_stopwords, lemmatize, stemmer):
    if remove_stopwords:
        vectorizer = TfidfVectorizer(analyzer='word', input='content', stop_words=stopwords.words('english'))
    else:
        vectorizer =  TfidfVectorizer(analyzer='word', input='content')

    clean_train = []
    for paragraph in train:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_train.append(paragraph)

    clean_test = []
    for paragraph in test:
        paragraph_result = cleanText(paragraph, lemmatize, stemmer)
        paragraph = " ".join(str(x) for x in paragraph_result)
        clean_test.append(paragraph)

    tfidf_train = vectorizer.fit_transform(clean_train).toarray()
    tfidf_test = vectorizer.transform(clean_test).toarray()
    return tfidf_train, tfidf_test

# **Section 2: UNDERSTANDING THE DATA : A REVIEWS DATASET**

Sentiment analysis is the interpretation and classification of emotions (such as positive, negative and neutral) within text data using text analysis techniques.  
Given below is a dataset consisting of reviews along with sentiment class (positive or negative).

In [6]:
# Upload the Reviews CSV file that has been shared with you.
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving spam.csv to spam.csv


In [29]:
import pandas as pd
df = pd.read_csv('reviews.csv')

In [28]:
df = df.dropna()

In [27]:
df.to_csv('reviews.csv', index=False)

# **Section 3: KNN MODEL**

Given below are two KNN models; in the first case we are using Bag-of-Words and in the second case we are using TF-IDF.
Note the different metrics and parameters used in each.

In [26]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 1: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('reviews.csv')
    X_train, X_test, y_train, y_test = train_test_split(training_data["sentence"], training_data["sentiment"],
                                                        test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    # print(X_train)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2,
                                         metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

Note: Cross-validation will be discussed in detail in the upcoming lab session.

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Define the function that uses TF-IDF and KNN
def tfidf_knn(X, y):
    # Step 1: Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Step 2: Initialize the TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')

    # Step 3: Fit and transform the training data using TF-IDF
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

    # Step 4: Transform the test data
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    # Step 5: Initialize the KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=5)

    # Step 6: Train the KNN model
    knn.fit(X_train_tfidf, y_train)

    # Step 7: Make predictions on the test set
    predicted = knn.predict(X_test_tfidf)

    # Step 8: Evaluate the model accuracy
    accuracy = accuracy_score(y_test, predicted)
    print(f"Accuracy: {accuracy:.4f}")

    return predicted, y_test

# Example usage:
# Assuming 'df' is your DataFrame and it has a 'sentence' column with text and 'label' column with the target labels
df = pd.read_csv('your_file.csv')  # Replace with the path to your file

# Ensure you check your column names
print(df.columns)

# Assign X (text data) and y (labels)
X = df['sentence']  # Replace 'sentence' with your actual text column name
y = df['label']     # Replace 'label' with your actual label column name

# Call the tfidf_knn function
predicted, y_test = tfidf_knn(X, y)

# You now have `predicted` (the model's predictions) and `y_test` (the actual test labels).

# Section 4: SPAM TEXT DATASET
Now let's use what we've learnt to classify texts as spam or not spam.

In [12]:
# Upload the spam text data CSV file that has been shared with you. You can also download the file from https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
# Run this cell, click on the 'Choose files' button and upload the file.
from google.colab import files
uploaded = files.upload()

Saving reviews.csv to reviews.csv


In [13]:
import pandas as pd
df = pd.read_csv('spam.csv')
df

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [14]:
df['Category'] = df['Category'].map({'ham': 0, 'spam': 1})

In [15]:
df.head(5)

Unnamed: 0,Category,Message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."


In [16]:
len(df)

5572

In [17]:
from sklearn import metrics, neighbors
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict

## TASK - 2: Tweak the models below and see results with different parameters and distance metrics.

def bow_knn():
    """Method for determining nearest neighbors using bag-of-words and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createBagOfWords(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='euclidean', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with BOW accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    print('\n')
    return predicted, y_test


def tfidf_knn():
    """Method for determining nearest neighbors using tf-idf and K-Nearest Neighbor algorithm"""

    training_data = pd.read_csv('spam.csv')
    training_data['Category'] = training_data['Category'].map({'ham': 0, 'spam': 1})
    X_train, X_test, y_train, y_test = train_test_split(training_data["Message"], training_data["Category"], test_size=0.2, random_state=5)
    X_train, X_test = createTFIDF(X_train, X_test, remove_stopwords=True, lemmatize=True, stemmer=False)
    knn = neighbors.KNeighborsClassifier(n_neighbors=5, weights='distance', algorithm='brute', leaf_size=30, p=2, metric='cosine', metric_params=None, n_jobs=1)

    knn.fit(X_train, y_train)
    predicted = knn.predict(X_test)
    acc = metrics.accuracy_score(y_test, predicted)
    print('KNN with TFIDF accuracy = ' + str(acc * 100) + '%')

    scores = cross_val_score(knn, X_train, y_train, cv=3)
    print("Cross Validation Accuracy: %0.2f" % (scores.mean()))
    print(scores)
    return predicted, y_test

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Assuming your DataFrame is loaded like this
df = pd.read_csv('your_file.csv')

# Check the columns to ensure 'sentence' exists
print(df.columns)

# Ensure you access the correct column, in this case 'sentence'
X = df['sentence']  # Assuming 'sentence' column contains the text data
y = df['label']  # Replace 'label' with the correct column for labels

# Step 1: Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english')

# Step 3: Fit and transform the training data
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

# Step 4: Transform the test data
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Step 5: Initialize KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)

# Step 6: Train the KNN model
knn.fit(X_train_tfidf, y_train)

# Step 7: Make predictions on the test set
predicted = knn.predict(X_test_tfidf)

# Step 8: Evaluate the model accuracy
accuracy = accuracy_score(y_test, predicted)
print(f"Accuracy: {accuracy:.4f}")

In [35]:
# This cell may take some time to run
predicted, y_test = bow_knn()

KeyError: 'sentence'

### Useful Resources for further reading
1. Stemming and Lemmatization: https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html
2. TF-IDF and BoW : https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/
3. TF-IDF: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html


### Questions to Think About and Answer
1. Why does the TF-IDF approach generally result in a better accuracy than Bag-of-Words ?
2. Can you think of techniques that are better than both BoW and TF-IDF ?
3. Read about Stemming and Lemmatization from the resources given below. Think about the pros/cons of each.

In [46]:
Why TF-IDF Generally Results in Better Accuracy Than Bag-of-Words

The TF-IDF (Term Frequency-Inverse Document Frequency) approach generally results in better accuracy than the Bag-of-Words (BoW) method because:

1. BoW (Bag-of-Words):

The BoW model simply counts the occurrence of each word in a document, without considering the frequency of a word across multiple documents.

Words that occur very frequently in the entire corpus, like common words (stopwords), are treated as highly significant.

This can lead to poor performance, as it doesn’t distinguish between important and unimportant words.



2. TF-IDF:

TF-IDF enhances BoW by incorporating two factors:

Term Frequency (TF): Measures how frequently a term occurs in a document.

Inverse Document Frequency (IDF): Measures how important a term is in the entire corpus. Words that occur in many documents have a lower IDF, meaning they are less significant.


This approach reduces the impact of frequently occurring words (like stopwords) and emphasizes terms that are unique and relevant to specific documents.




Thus, TF-IDF provides a better representation of the content by accounting for both the frequency of words in individual documents and their rarity across the whole corpus, leading to improved accuracy compared to BoW.

Techniques That Are Better Than Both BoW and TF-IDF

Some techniques that can perform better than both BoW and TF-IDF include:

1. Word Embeddings (e.g., Word2Vec, GloVe):

Word embeddings capture the semantic meaning of words by placing similar words close to each other in a continuous vector space.

Unlike BoW and TF-IDF, which treat words as independent features, embeddings allow words to retain contextual meaning and capture relationships (synonyms, antonyms, etc.).

Word2Vec and GloVe can also generalize better to unseen data and handle polysemy (words with multiple meanings).



2. Transformers (e.g., BERT, GPT):

Transformers like BERT and GPT are state-of-the-art models in NLP. They learn rich contextual representations of words and phrases in a given context.

These models don't rely on fixed word representations, but rather understand the meaning of a word in relation to the surrounding words.

Transformers can achieve high accuracy in various NLP tasks, including text classification, named entity recognition, and question answering.



3. Topic Modeling (e.g., LDA, NMF):

Latent Dirichlet Allocation (LDA) and Non-Negative Matrix Factorization (NMF) are unsupervised techniques for identifying topics within a collection of text.

These techniques can capture the thematic structure of the text and can be used to represent documents in terms of topics rather than individual words, potentially offering better results for some tasks.




Stemming vs Lemmatization

Stemming and Lemmatization are two important text preprocessing techniques used to reduce words to their base or root forms, and each has its own pros and cons.

1. Stemming:

Definition: Stemming removes prefixes and suffixes to reduce a word to its base or root form. It is often performed by simple rules (e.g., removing "ing" or "es").

Pros:

Speed: Stemming is computationally faster than lemmatization because it uses straightforward rules.

Simplicity: It’s easy to implement and doesn't require a dictionary of words.


Cons:

Over-simplification: Stemming can lead to incorrect words (e.g., "running" becomes "run," but "ran" might be stemmed to something like "ra").

Loss of Meaning: Since stemming doesn't consider context or part of speech, it may create words that are not valid.




2. Lemmatization:

Definition: Lemmatization reduces words to their lemma, which is the correct base form, using a dictionary and part-of-speech information (e.g., "better" becomes "good").

Pros:

Accuracy: Lemmatization results in valid words and is more accurate in understanding the context of a word.

Grammatical Integrity: It keeps the word grammatically correct, which helps preserve the meaning.


Cons:

Complexity: Lemmatization is computationally more expensive because it requires a dictionary and part-of-speech tagging.

Slower: It’s slower than stemming due to its more complex approach.





Summary

TF-IDF generally performs better than BoW because it accounts for the importance of words based on both their frequency in a document and rarity across the corpus.

More advanced techniques such as Word Embeddings (Word2Vec, GloVe) and Transformers (BERT, GPT) are often better than both BoW and TF-IDF for many NLP tasks.

Stemming is faster but less accurate, while Lemmatization is more accurate but slower and computationally more expensive. The choice depends on the trade-off between speed and accuracy in a specific application.

SyntaxError: invalid character '’' (U+2019) (<ipython-input-46-5ce109b52639>, line 11)

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Define the function that uses TF-IDF and KNN
def tfidf_knn(X, y):
    # Step 1: Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Step 2: Initialize the TF-IDF Vectorizer
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')

    # Step 3: Fit and transform the training data using TF-IDF
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)

    # Step 4: Transform the test data
    X_test_tfidf = tfidf_vectorizer.transform(X_test)

    # Step 5: Initialize the KNN Classifier
    knn = KNeighborsClassifier(n_neighbors=5)

    # Step 6: Train the KNN model
    knn.fit(X_train_tfidf, y_train)

    # Step 7: Make predictions on the test set
    predicted = knn.predict(X_test_tfidf)

    # Step 8: Evaluate the model accuracy
    accuracy = accuracy_score(y_test, predicted)
    print(f"Accuracy: {accuracy:.4f}")

    return predicted, y_test

# Example usage:
# Assuming 'df' is your DataFrame and it has a 'sentence' column with text and 'label' column with the target labels
df = pd.read_csv('your_file.csv')  # Replace with the path to your file

# Ensure you check your column names
print(df.columns)

# Assign X (text data) and y (labels)
X = df['sentence']  # Replace 'sentence' with your actual text column name
y = df['label']     # Replace 'label' with your actual label column name

# Call the tfidf_knn function
predicted, y_test = tfidf_knn(X, y)

# You now have `predicted` (the model's predictions) and `y_test` (the actual test labels).