# **The fifth in-class-exercise (40 points in total, 4/18/2023)**

(20 points) The purpose of the question is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training.

The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.

Algorithms:

(1) MultinominalNB

(2) SVM

(3) KNN

(4) Decision tree

(5) Random Forest

(6) XGBoost

(7) Word2Vec

(8) BERT

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison

(4) F-1 score

In [10]:
# Import pandas library for data manipulation
import pandas as pd

# Import matplotlib.pyplot library for data visualization
import matplotlib.pyplot as plt

# Import warnings module to suppress warnings
import warnings

# Suppress all warnings
warnings.filterwarnings("ignore")

# Import metrics from sklearn.metrics library for model evaluation
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score


## Training dataset

In [11]:
# Open the "stsa-train.txt" file in read mode
with open("stsa-train.txt") as txtf:

    # Create a list to store the lines of the file
    mylist = [line.rstrip('\n') for line in txtf]

    # Initialize empty lists to store labels and text
    labels = []
    text = []

    # Iterate through each line in the list
    for i, line in enumerate(mylist):

        # Extract the label (first character) and text (remaining characters)
        label = mylist[i][0]
        tex = mylist[i][1:]

        # Append the label and text to their respective lists
        labels.append(label)
        text.append(tex)

    # Create a DataFrame from the lists of labels and text
    dataset = pd.DataFrame(list(zip(labels, text)), columns=['Reviews', 'Text'])

    # Display the first 5 rows of the DataFrame
    dataset.head()



## Training data preprocessing

In [12]:
import nltk
import nltk

# Download stopwords
nltk.download('stopwords')

# Download WordNet
nltk.download('wordnet')

# Download Punkt tokenizer
nltk.download('punkt')

# Download movie reviews and Twitter samples for sentiment analysis
nltk.download('movie_reviews')
nltk.download('twitter_samples')

# Download averaged_perceptron_tagger for Part-of-Speech tagging
nltk.download('averaged_perceptron_tagger')

# Download conll2002 for Named Entity Recognition (NER)
nltk.download('conll2002')

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess(sentence):
    sentence=str(sentence)
    sentence = sentence.lower()
    sentence=sentence.replace('{html}',"")
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)
    rem_url=re.sub(r'http\S+', '',cleantext)
    rem_num = re.sub('[0-9]+', '', rem_url)
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(rem_num)
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]
    stem_words=[stemmer.stem(w) for w in filtered_words]
    lemma_words=[lemmatizer.lemmatize(w) for w in stem_words]
    return " ".join(filtered_words)


dataset['cleanText']=dataset['Text'].map(lambda s:preprocess(s))
dataset.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2002 to /root/nltk_data...
[nltk_data]   Package conll2002 is already up-to-date!


Unnamed: 0,Reviews,Text,cleanText
0,1,"a stirring , funny and finally transporting r...",stirring funny finally transporting imagining ...
1,0,apparently reassembled from the cutting-room ...,apparently reassembled cutting room floor give...
2,0,they presume their audience wo n't sit still ...,presume audience sit still sociology lesson ho...
3,1,this is a visually stunning rumination on lov...,visually stunning rumination love memory histo...
4,1,jonathan parker 's bartleby should have been ...,jonathan parker bartleby end modern office ano...


## Testing dataset

In [13]:
# Open the "stsa-test.txt" file in read mode
with open("stsa-test.txt") as txtf:

    # Create a list to store the lines of the test file
    mylist_test = [line.rstrip('\n') for line in txtf]

    # Initialize empty lists to store test labels and test text
    labels_test = []
    text_test = []

    # Iterate through each line in the test file list
    for i, line in enumerate(mylist_test):

        # Extract the test label (first character) and test text (remaining characters)
        label_test = mylist_test[i][0]
        tex_test = mylist_test[i][1:]

        # Append the test label and test text to their respective lists
        labels_test.append(label_test)
        text_test.append(tex_test)

    # Create a DataFrame from the lists of test labels and test text
    dataset_test = pd.DataFrame(list(zip(labels_test, text_test)), columns=['Reviews', 'Text'])

    # Display the first 5 rows of the test DataFrame
    dataset_test.head()


## Testing data preprocessing

In [15]:
# Import the nltk library for natural language processing tasks
import nltk


# Import the RegexpTokenizer class from nltk.tokenize for tokenization based on regular expressions
from nltk.tokenize import RegexpTokenizer

# Import the WordNetLemmatizer and PorterStemmer classes from nltk.stem for lemmatization and stemming, respectively
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Import the stopwords module from nltk to remove common words
from nltk.corpus import stopwords

# Create an instance of the WordNetLemmatizer class for lemmatization
lemmatizer = WordNetLemmatizer()

# Create an instance of the PorterStemmer class for stemming
stemmer = PorterStemmer()

# Define a function to preprocess the text in a given sentence
def preprocess(sentence):
    # Convert the sentence to a string
    sentence = str(sentence)

    # Convert the sentence to lowercase
    sentence = sentence.lower()

    # Remove HTML tags
    sentence = sentence.replace('{html}', "")

    # Remove HTML tags using regular expressions
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', sentence)

    # Remove URLs
    rem_url = re.sub(r'http\S+', '', cleantext)

    # Remove numbers
    rem_num = re.sub('[0-9]+', '', rem_url)

    # Create an instance of the RegexpTokenizer class for tokenization
    tokenizer = RegexpTokenizer(r'\w+')

    # Tokenize the sentence based on regular expressions
    tokens = tokenizer.tokenize(rem_num)

    # Filter out words with length less than or equal to 2 and stop words
    filtered_words = [w for w in tokens if len(w) > 2 if not w in stopwords.words('english')]

    # Stem the filtered words
    stem_words = [stemmer.stem(w) for w in filtered_words]

    # Lemmatize the stemmed words
    lemma_words = [lemmatizer.lemmatize(w) for w in stem_words]

    # Join the preprocessed words into a single string
    processed_sentence = " ".join(filtered_words)

    # Return the processed sentence
    return processed_sentence

# Apply the preprocess function to the 'Text' column in the test dataset and store the processed text in a new column named 'cleanText'
dataset_test['cleanText'] = dataset_test['Text'].map(lambda s: preprocess(s))

# Display the first 5 rows of the preprocessed test dataset
dataset_test.head()


Unnamed: 0,Reviews,Text,cleanText
0,0,"no movement , no yuks , not much of anything .",movement yuks much anything
1,0,"a gob of drivel so sickly sweet , even the ea...",gob drivel sickly sweet even eager consumers m...
2,0,"gangs of new york is an unapologetic mess , w...",gangs new york unapologetic mess whose saving ...
3,0,we never really feel involved with the story ...,never really feel involved story ideas remain ...
4,1,this is one of polanski 's best films .,one polanski best films


## TF-IDF Vectorization

In [16]:
# Import the TfidfVectorizer class from the sklearn.feature_extraction.text module
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the TfidfVectorizer class with lowercase set to False and analyzer set to 'word'
tfidf_vectorizer = TfidfVectorizer(lowercase=False, analyzer='word')

# Fit the TfidfVectorizer to the preprocessed training text and transform it into a TF-IDF matrix
train_tfidf = tfidf_vectorizer.fit_transform(dataset["cleanText"]).toarray()

# Transform the preprocessed test text into a TF-IDF matrix using the fitted TfidfVectorizer
test_tfidf = tfidf_vectorizer.transform(dataset_test["cleanText"]).toarray()


In [17]:
# Assign the preprocessed test TF-IDF matrix to the x_test variable
x_test = test_tfidf

# Assign the target labels for the test data to the y_test variable
y_test = dataset_test["Reviews"]


## Data partitioning

In [18]:
# Import the train_test_split function from sklearn.model_selection
from sklearn.model_selection import train_test_split

# Split the preprocessed training TF-IDF matrix and target labels into training and validation sets
x_train, x_valid, y_train, y_valid = train_test_split(train_tfidf, dataset["Reviews"], test_size=0.2, random_state=202)


## Algorithms

## 1. MultinominalNB

In [19]:
# Import the MultinomialNB class from sklearn.naive_bayes
from sklearn.naive_bayes import MultinomialNB

# Create an instance of the MultinomialNB classifier
classifier = MultinomialNB()

# Train the classifier using the training data
model = classifier.fit(x_train, y_train)

# Make predictions on the validation data
predictions_validation_set = classifier.predict(x_valid)

# Evaluate the performance of the classifier using various metrics
print("Accuracy of the Naive Bayes model on validation set is :", round(accuracy_score(y_valid, predictions_validation_set) * 100), "%")
print("Precision of the Naive Bayes model on validation set is :", round(precision_score(y_valid, predictions_validation_set, pos_label='0') * 100), "%")
print("Recall of the Naive Bayes model on validation set is :", round(recall_score(y_valid, predictions_validation_set, pos_label='0') * 100), "%")
print("F1 Score of the Naive Bayes model on validation set is :", round(f1_score(y_valid, predictions_validation_set, pos_label='0') * 100), "%")


Accuracy of the Naive Bayes model on validation set is : 78 %
Precision of the Naive Bayes model on validation set is : 83 %
Recall of the Naive Bayes model on validation set is : 69 %
F1 Score of the Naive Bayes model on validation set is : 76 %


In [20]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the Naive Bayes classifier on the validation set
cr_naive_validation = classification_report(y_valid, predictions_validation_set)

# Print the classification report
print("Classification Report: ", "\n", "\n", cr_naive_validation)


Classification Report:  
 
               precision    recall  f1-score   support

           0       0.83      0.69      0.76       667
           1       0.75      0.87      0.81       717

    accuracy                           0.78      1384
   macro avg       0.79      0.78      0.78      1384
weighted avg       0.79      0.78      0.78      1384



In [21]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the Naive Bayes classifier using the training data
naive_accuracies_validation = cross_val_score(estimator=classifier, X=x_train, y=y_train, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = naive_accuracies_validation.mean()

# Print the average cross-validation accuracy
print(f"Naive Bayes Model 10-fold cross-validation score on training set is :  {round(average_accuracy * 100)}%")


Naive Bayes Model 10-fold cross-validation score on training set is :  77%


In [22]:
# Make predictions on the test data using the trained Naive Bayes classifier
predictions_test_set = classifier.predict(x_test)

# Evaluate the performance of the classifier on the test data using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the Naive Bayes model on test set is :", round(accuracy_score(y_test, predictions_test_set) * 100), "%")

print("Precision of the Naive Bayes model on test set is :", round(precision_score(y_test, predictions_test_set, pos_label='0') * 100), "%")
print("Recall of the Naive Bayes model on test set is :", round(recall_score(y_test, predictions_test_set, pos_label='0') * 100), "%")
print("F1 Score of the Naive Bayes model on test set is :", round(f1_score(y_test, predictions_test_set, pos_label='0') * 100), "%")


Accuracy of the Naive Bayes model on test set is : 79 %
Precision of the Naive Bayes model on test set is : 86 %
Recall of the Naive Bayes model on test set is : 71 %
F1 Score of the Naive Bayes model on test set is : 78 %


In [23]:
# Generate a classification report for the Naive Bayes classifier on the test set
cr_naive_test = classification_report(y_test, predictions_test_set)

# Print the classification report
print("Classification Report: ", "\n", "\n", cr_naive_test)


Classification Report:  
 
               precision    recall  f1-score   support

           0       0.86      0.71      0.78       912
           1       0.75      0.88      0.81       909

    accuracy                           0.79      1821
   macro avg       0.80      0.79      0.79      1821
weighted avg       0.80      0.79      0.79      1821



In [24]:
# Perform 10-fold cross-validation on the Naive Bayes classifier using the testing data
naive_accuracies_test = cross_val_score(estimator=classifier, X=x_test, y=y_test, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = naive_accuracies_test.mean()

# Print the average cross-validation accuracy
print(f"Naive Bayes Model 10-fold cross validation score on testing set is :  {round(average_accuracy * 100)}%")


Naive Bayes Model 10-fold cross validation score on testing set is :  73%


SVM

In [25]:
# Import the SVC class from sklearn.svm
from sklearn import svm

# Create an instance of the SVC classifier
classifier_svm = svm.SVC()

# Train the classifier using the training data
model_svm = classifier_svm.fit(x_train, y_train)

# Make predictions on the validation data
svm_predictions_validation_set = classifier_svm.predict(x_valid)

# Evaluate the performance of the classifier using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the SVM model on validation set is :", round(accuracy_score(y_valid, svm_predictions_validation_set) * 100), "%")
print("Precision of the SVM model on validation set is :", round(precision_score(y_valid, svm_predictions_validation_set, pos_label='0') * 100), "%")
print("Recall of the SVM model on validation set is :", round(recall_score(y_valid, svm_predictions_validation_set, pos_label='0') * 100), "%")
print("F1 Score of the SVM model on validation set is :", round(f1_score(y_valid, svm_predictions_validation_set, pos_label='0') * 100), "%")


Accuracy of the SVM model on validation set is : 79 %
Precision of the SVM model on validation set is : 79 %
Recall of the SVM model on validation set is : 76 %
F1 Score of the SVM model on validation set is : 77 %


In [26]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the SVM classifier on the validation set
cr_svm_validation = classification_report(y_valid, svm_predictions_validation_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_svm_validation)


Classification Report:  

               precision    recall  f1-score   support

           0       0.79      0.76      0.77       667
           1       0.78      0.82      0.80       717

    accuracy                           0.79      1384
   macro avg       0.79      0.79      0.79      1384
weighted avg       0.79      0.79      0.79      1384



In [27]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the SVM classifier using the training data
svm_accuracies_validation = cross_val_score(estimator=classifier_svm, X=x_train, y=y_train, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = svm_accuracies_validation.mean()

# Print the average cross-validation accuracy
print(f"SVM Model 10-fold cross validation score on training set is :  {round(average_accuracy * 100)}%")


SVM Model 10-fold cross validation score on training set is :  77%


In [30]:
svm_predictions_test_set = classifier_svm.predict(x_test)
print ("Accuracy : ", round(accuracy_score(y_test, svm_predictions_test_set)*100),"%")
print ("Percision: ", round(precision_score(y_test, svm_predictions_test_set, pos_label='0')*100),"%")
print ("Recall : ", round(recall_score(y_test, svm_predictions_test_set, pos_label='0')*100),"%")
print ("F1 Score : ", round(f1_score(y_test, svm_predictions_test_set, pos_label='0')*100),"%")

Accuracy :  79 %
Percision:  82 %
Recall :  75 %
F1 Score :  78 %


In [31]:
cr_svm_test = classification_report(y_test, svm_predictions_test_set)
print("Classification Report: ", "\n", "\n",cr_svm_test)


Classification Report:  
 
               precision    recall  f1-score   support

           0       0.82      0.75      0.78       912
           1       0.77      0.84      0.80       909

    accuracy                           0.79      1821
   macro avg       0.80      0.79      0.79      1821
weighted avg       0.80      0.79      0.79      1821



In [32]:
# Perform 10-fold cross-validation on the SVM classifier using the testing data
svm_accuracies_test = cross_val_score(estimator=classifier_svm, X=x_test, y=y_test, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = svm_accuracies_test.mean()

# Print the average cross-validation accuracy
print(f"SVM Model 10-fold cross validation score on testing set is : {round(average_accuracy * 100)}%")


SVM Model 10-fold cross validation score on testing set is : 72%


## KNN

In [33]:
# Import the KNeighborsClassifier class from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier

# Create an instance of the KNeighborsClassifier classifier with k=15
classifier_knn = KNeighborsClassifier(n_neighbors=15)

# Train the classifier using the training data
model_knn = classifier_knn.fit(x_train, y_train)

# Make predictions on the validation data
knn_predictions_validation_set = classifier_knn.predict(x_valid)

# Evaluate the performance of the classifier using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the KNN model on validation set is :", round(accuracy_score(y_valid, knn_predictions_validation_set) * 100), "%")
print("Precision of the KNN model on validation set is :", round(precision_score(y_valid, knn_predictions_validation_set, pos_label='0') * 100), "%")
print("Recall of the KNN model on validation set is :", round(recall_score(y_valid, knn_predictions_validation_set, pos_label='0') * 100), "%")
print("F1 Score of the KNN model on validation set is :", round(f1_score(y_valid, knn_predictions_validation_set, pos_label='0') * 100), "%")


Accuracy of the KNN model on validation set is : 74 %
Precision of the KNN model on validation set is : 71 %
Recall of the KNN model on validation set is : 78 %
F1 Score of the KNN model on validation set is : 74 %


In [34]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the KNN classifier on the validation set
cr_knn_validation = classification_report(y_valid, knn_predictions_validation_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_knn_validation)


Classification Report:  

               precision    recall  f1-score   support

           0       0.71      0.78      0.74       667
           1       0.77      0.71      0.74       717

    accuracy                           0.74      1384
   macro avg       0.74      0.74      0.74      1384
weighted avg       0.74      0.74      0.74      1384



In [35]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the KNN classifier using the training data
knn_accuracies_validation = cross_val_score(estimator=classifier_knn, X=x_train, y=y_train, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = knn_accuracies_validation.mean()

# Print the average cross-validation accuracy
print(f"KNN Model 10-fold cross validation score on training set is :  {round(average_accuracy * 100)}%")


KNN Model 10-fold cross validation score on training set is :  70%


In [36]:
# Make predictions on the test data using the KNN classifier
knn_predictions_test_set = classifier_knn.predict(x_test)

# Evaluate the performance of the classifier on the test data using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the KNN model on test set is :", round(accuracy_score(y_test, knn_predictions_test_set) * 100), "%")
print("Precision of the KNN model on test set is :", round(precision_score(y_test, knn_predictions_test_set, pos_label='0') * 100), "%")
print("Recall of the KNN model on test set is :", round(recall_score(y_test, knn_predictions_test_set, pos_label='0') * 100), "%")
print("F1 Score of the KNN model on test set is :", round(f1_score(y_test, knn_predictions_test_set, pos_label='0') * 100), "%")


Accuracy of the KNN model on test set is : 73 %
Precision of the KNN model on test set is : 71 %
Recall of the KNN model on test set is : 77 %
F1 Score of the KNN model on test set is : 74 %


In [37]:
# Generate a classification report for the KNN classifier on the test set
cr_knn_test = classification_report(y_test, knn_predictions_test_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_knn_test)


Classification Report:  

               precision    recall  f1-score   support

           0       0.71      0.77      0.74       912
           1       0.75      0.69      0.72       909

    accuracy                           0.73      1821
   macro avg       0.73      0.73      0.73      1821
weighted avg       0.73      0.73      0.73      1821



In [38]:
# Perform 10-fold cross-validation on the KNN classifier using the testing data
knn_accuracies_test = cross_val_score(estimator=classifier_knn, X=x_test, y=y_test, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = knn_accuracies_test.mean()

# Print the average cross-validation accuracy
print(f"KNN Model 10-fold cross validation score on testing set is : {round(average_accuracy * 100)}%")


KNN Model 10-fold cross validation score on testing set is : 63%


##Decision Tree

In [40]:
# Import the DecisionTreeClassifier class from sklearn.tree
from sklearn.tree import DecisionTreeClassifier

# Create an instance of the DecisionTreeClassifier classifier
classifier_dt = DecisionTreeClassifier()

# Train the classifier using the training data
model_dt = classifier_dt.fit(x_train, y_train)

# Make predictions on the validation data
dt_predictions_validation_set = classifier_dt.predict(x_valid)

# Evaluate the performance of the classifier using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the Decison Tree Classifier model on validation set is :", round(accuracy_score(y_valid, dt_predictions_validation_set) * 100), "%")
print("Precision of the Decison Tree Classifier model on validation set is :", round(precision_score(y_valid, dt_predictions_validation_set, pos_label='0') * 100), "%")
print("Recall of the Decison Tree Classifier model on validation set is :", round(f1_score(y_valid, dt_predictions_validation_set, pos_label='0')*100),"%")

Accuracy of the Decison Tree Classifier model on validation set is : 66 %
Precision of the Decison Tree Classifier model on validation set is : 63 %
Recall of the Decison Tree Classifier model on validation set is : 67 %


In [41]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the Decision Tree Classifier on the validation set
cr_dt_validation = classification_report(y_valid, dt_predictions_validation_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_dt_validation)


Classification Report:  

               precision    recall  f1-score   support

           0       0.63      0.71      0.67       667
           1       0.69      0.61      0.65       717

    accuracy                           0.66      1384
   macro avg       0.66      0.66      0.66      1384
weighted avg       0.66      0.66      0.66      1384



In [42]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the Decision Tree Classifier using the training data
dt_accuracies_validation = cross_val_score(estimator=classifier_dt, X=x_train, y=y_train, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = dt_accuracies_validation.mean()

# Print the average cross-validation accuracy
print(f"Decision Tree Classifier Model 10-fold cross validation score on training set is : {round(average_accuracy * 100)}%")


Decision Tree Classifier Model 10-fold cross validation score on training set is : 65%


In [44]:
# Make predictions on the test data using the Decision Tree Classifier
dt_predictions_test_set = classifier_dt.predict(x_test)

# Evaluate the performance of the classifier on the test data using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the Decison Tree Classifier model on test set is :", round(accuracy_score(y_test, dt_predictions_test_set) * 100), "%")
print("Precision of the Decison Tree Classifier model on test set is :", round(precision_score(y_test, dt_predictions_test_set, pos_label='0') * 100), "%")
print("Recall of the Decison Tree Classifier model on test set is :", round(recall_score(y_test, dt_predictions_test_set, pos_label='0') * 100), "%")
print("F1 Score of the Decison Tree Classifier model on test set is :", round(f1_score(y_test, dt_predictions_test_set, pos_label='0')*100),"%")


Accuracy of the Decison Tree Classifier model on test set is : 67 %
Precision of the Decison Tree Classifier model on test set is : 66 %
Recall of the Decison Tree Classifier model on test set is : 71 %
F1 Score of the Decison Tree Classifier model on test set is : 68 %


In [45]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the Decision Tree Classifier on the test set
cr_dt_test = classification_report(y_test, dt_predictions_test_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_dt_test)


Classification Report:  

               precision    recall  f1-score   support

           0       0.66      0.71      0.68       912
           1       0.68      0.64      0.66       909

    accuracy                           0.67      1821
   macro avg       0.67      0.67      0.67      1821
weighted avg       0.67      0.67      0.67      1821



In [46]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the Decision Tree Classifier using the testing data
dt_accuracies_test = cross_val_score(estimator=classifier_dt, X=x_test, y=y_test, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = dt_accuracies_test.mean()

# Print the average cross-validation accuracy
print(f"Decision Tree Classifier Model 10-fold cross validation score on testing set is : {round(average_accuracy * 100)}%")


Decision Tree Classifier Model 10-fold cross validation score on testing set is : 63%


## Randomforest

In [47]:
# Import the RandomForestClassifier class from sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier

# Create an instance of the RandomForestClassifier classifier
classifier_rf = RandomForestClassifier()

# Train the classifier using the training data
model_rf = classifier_rf.fit(x_train, y_train)

# Make predictions on the validation data
rf_predictions_validation_set = classifier_rf.predict(x_valid)

# Evaluate the performance of the classifier using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the Random Forest Classifier model on validation set is :", round(accuracy_score(y_valid, rf_predictions_validation_set) * 100), "%")
print("Precision of the Random Forest Classifier model on validation set is :", round(precision_score(y_valid, rf_predictions_validation_set, pos_label='0') * 100), "%")
print("Recall of the Random Forest Classifier model on validation set is :", round(recall_score(y_valid, rf_predictions_validation_set, pos_label='0') * 100), "%")
print("F1 Score of the Random Forest Classifier model on validation set is :", round(f1_score(y_valid, rf_predictions_validation_set, pos_label='0') * 100), "%")


Accuracy of the Random Forest Classifier model on validation set is : 72 %
Precision of the Random Forest Classifier model on validation set is : 71 %
Recall of the Random Forest Classifier model on validation set is : 72 %
F1 Score of the Random Forest Classifier model on validation set is : 72 %


In [48]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the Random Forest Classifier on the validation set
cr_rf_validation = classification_report(y_valid, rf_predictions_validation_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_rf_validation)


Classification Report:  

               precision    recall  f1-score   support

           0       0.71      0.72      0.72       667
           1       0.74      0.73      0.73       717

    accuracy                           0.72      1384
   macro avg       0.72      0.72      0.72      1384
weighted avg       0.72      0.72      0.72      1384



In [49]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the Random Forest Classifier using the training data
rf_accuracies_validation = cross_val_score(estimator=classifier_rf, X=x_train, y=y_train, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = rf_accuracies_validation.mean()

# Print the average cross-validation accuracy
print(f"Random Forest Classifier Model 10-fold cross validation score on training set is : {round(average_accuracy * 100)}%")


Random Forest Classifier Model 10-fold cross validation score on training set is : 72%


In [50]:
# Make predictions on the test data using the Random Forest Classifier
rf_predictions_test_set = classifier_rf.predict(x_test)

# Evaluate the performance of the classifier on the test data using accuracy, precision, recall, and F1-score metrics
print("Accuracy of the Random Forest Classifier model on test set is :", round(accuracy_score(y_test, rf_predictions_test_set) * 100), "%")
print("Precision of the Random Forest Classifier model on test set is :", round(precision_score(y_test, rf_predictions_test_set, pos_label='0') * 100), "%")
print("Recall of the Random Forest Classifier model on test set is :", round(recall_score(y_test, rf_predictions_test_set, pos_label='0') * 100), "%")
print("F1 Score of the Random Forest Classifier model on test set is :", round(f1_score(y_test, rf_predictions_test_set, pos_label='0') * 100), "%")


Accuracy of the Random Forest Classifier model on test set is : 75 %
Precision of the Random Forest Classifier model on test set is : 74 %
Recall of the Random Forest Classifier model on test set is : 76 %
F1 Score of the Random Forest Classifier model on test set is : 75 %


In [51]:
# Import the classification_report function from sklearn.metrics
from sklearn.metrics import classification_report

# Generate a classification report for the Random Forest Classifier on the test set
cr_rf_test = classification_report(y_test, rf_predictions_test_set)

# Print the classification report
print("Classification Report: ", "\n\n", cr_rf_test)


Classification Report:  

               precision    recall  f1-score   support

           0       0.74      0.76      0.75       912
           1       0.75      0.73      0.74       909

    accuracy                           0.75      1821
   macro avg       0.75      0.75      0.75      1821
weighted avg       0.75      0.75      0.75      1821



In [52]:
# Import the cross_val_score function from sklearn.model_selection
from sklearn.model_selection import cross_val_score

# Perform 10-fold cross-validation on the Random Forest Classifier using the testing data
rf_accuracies_test = cross_val_score(estimator=classifier_rf, X=x_test, y=y_test, cv=10)

# Calculate the average cross-validation accuracy
average_accuracy = rf_accuracies_test.mean()

# Print the average cross-validation accuracy
print(f"Random Forest Classifier Model 10-fold cross validation score on testing set is : {round(average_accuracy * 100)}%")


Random Forest Classifier Model 10-fold cross validation score on testing set is : 65%


(20 points) The purpose of the question is to practice different machine learning algorithms for text clustering
Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

Apply the listed clustering methods to the dataset:

K-means

DBSCAN

Hierarchical clustering

Word2Vec

BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

##K-means

In [2]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('Amazon_Unlocked_Mobile.csv')

# Display the first 5 rows of the DataFrame
df.head()

Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,I feel so LUCKY to have found this used (phone...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice up grade from my pantach revu...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,Very pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,It works good but it goes slow sometimes but i...,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,Great phone to replace my lost phone. The only...,0.0


In [1]:
# Import the TfidfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of the TfidfVectorizer
tfidf_vect = TfidfVectorizer()

# Fit the TfidfVectorizer to the 'Reviews' column
tfidf_vects = tfidf_vect.fit_transform(df['Reviews'].values.astype('U'))

# Get the feature names (vocabulary)
names = tfidf_vect.get_feature_names()

# Display the first 5 rows of the DataFrame
df.head()


NameError: ignored

In [None]:
# Import the KMeans algorithm from scikit-learn
from sklearn.cluster import KMeans

# Initialize an empty list to store WCSS values
wcss = []

# Iterate over a range of cluster numbers from 2 to 11
for i in range(2, 12):
    # Create a KMeans object with the specified number of clusters and initialization method
    kmeans = KMeans(n_clusters=i, init="k-means++", random_state=101)

    # Fit the KMeans object to the TF-IDF vectors
    kmeans.fit(tfidf_vects)

    # Calculate the within-cluster sum of squares (WCSS)
    wcss.append(kmeans.inertia_)

# Create a plot to visualize the elbow method
plt.figure(figsize=(11, 6))
plt.plot(range(2, 12), wcss, marker="o")

# Add labels and title to the plot
plt.title("The Elbow Method")
plt.xlabel("Number of clusters")
plt.ylabel("WCSS")


In [None]:
# Import the KMeans algorithm from scikit-learn
from sklearn.cluster import KMeans

# Create a KMeans object with 6 clusters, k-means++ initialization, 10000 maximum iterations, and a random state of 50
model = KMeans(n_clusters=6, init='k-means++', max_iter=10000, random_state=50)

# Fit the KMeans model to the TF-IDF vectors
model.fit(tfidf_vects)

# Use the Counter class to count the occurrences of each cluster label
cluster_counts = Counter(model.labels_)

# Display the cluster counts
print(cluster_counts)


In [None]:
# Import the top features function from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Set the number of top words to display
top_words = 7

# Get the cluster centers
centroids = model.cluster_centers_

# Sort the cluster centers in descending order for each word
sorted_centroids = centroids.argsort()[:, ::-1]

# Iterate over each cluster
for cluster_num in range(6):
    # Extract the top words for the current cluster
    key_features = [names[i] for i in sorted_centroids[cluster_num, :top_words]]

    # Print the cluster number and top words
    print('Cluster', cluster_num + 1)
    print('Top Words:', key_features)


In [None]:
cluster_center=model.cluster_centers_
cluster_center

## DBSCAN

In [3]:
# Create an empty list to store reviews as word lists
reviews = []

# Iterate over the 'Reviews' column of the DataFrame and append each review as a list of words
for review in df['Reviews']:
    # Split the review string into a list of words using the `split()` method
    words = str(review).split()
    # Append the list of words to the 'reviews' list
    reviews.append(words)

# Import the Word2Vec module from Gensim
import gensim

# Create a Word2Vec model with a dimensionality of 100 and 4 workers
w2v_model = gensim.models.Word2Vec(reviews, size=100, workers=4)

# Create an empty list to store word vectors
vectors = []

# Iterate over the 'reviews' list
for review in reviews:
    # Create an empty vector of zeros with a dimension of 100
    vector = np.zeros(100)
    # Initialize a counter variable
    count = 0

    # Iterate over the words in the current review
    for word in review:
        try:
            # Get the word vector from the Word2Vec model
            vec = w2v_model.wv[word]

            # Add the word vector to the current vector
            vector += vec

            # Increment the counter
            count += 1
        except:
            # Skip the word if it's not in the vocabulary
            pass

    # Divide the vector by the count to get the average word vector
    vector /= count

    # Append the average word vector to the 'vectors' list
    vectors.append(vector)

# Convert the 'vectors' list to a NumPy array
vectors = np.array(vectors)

# Replace any NaN values in the vectors with 0
vectors = np.nan_to_num(vectors)


NameError: ignored

In [None]:
from sklearn.cluster import DBSCAN

minPts = 2*100
import numpy as np

# Lower bound function
def lower_bound(nums, target):
  l, r = 0, len(nums) - 1
  # Binary searching
  while l <= r:
    mid = int(l + (r - l) / 2)
    if nums[mid] >= target:
      r = mid - 1
    else:
      l = mid + 1
  return l

def compute200thnearestneighbour(x, data):
  dists = []
  for val in data:
    # computing distances
    dist = np.sum((x - val) **2 )
    if(len(dists) == 200 and dists[199] > dist):
      l = int(lower_bound(dists, dist))
      if l < 200 and l >= 0 and dists[l] > dist:
        dists[l] = dist
    else:
      dists.append(dist)
      dists.sort()

  # Dist 199 contains the distance of 200th nearest neighbour.
  return dists[199]

vectors.shape


In [None]:
# Define an empty list to store the 200th nearest neighbor distances
twohundrethneigh = []

# Iterate over the first 1000 vectors in the 'vectors' array
for val in vectors[:1000]:
    # Compute the 200th nearest neighbor distance for the current vector using the 'compute200thnearestneighbour' function
    distance = compute200thnearestneighbour(val, vectors[:1000])

    # Append the computed distance to the 'twohundrethneigh' list
    twohundrethneigh.append(distance)

# Sort the 'twohundrethneigh' list in ascending order
twohundrethneigh.sort()


In [None]:
# Import matplotlib for plotting
import matplotlib.pyplot as plt

# Set the figure size to 14x4 inches
plt.figure(figsize=(14, 4))

# Add a title to the plot
plt.title("Elbow Method for Finding the right Eps hyperparameter")

# Plot the 200th nearest neighbor distances against the number of points
plt.plot([x for x in range(len(twohundrethneigh))], twohundrethneigh)

# Add labels to the axes
plt.xlabel("Number of points")
plt.ylabel("Distance of 200th Nearest Neighbour")

# Display the plot
plt.show()


In [None]:
# Create a DBSCAN model with the optimal eps value of 5 and min_samples of minPts
model_dbs = DBSCAN(eps=5, min_samples=minPts)

# Fit the model to the word vectors
model_dbs.fit(vectors)


In [None]:
# Create a new DataFrame with a column named 'DBS Cluster Label'
df_dbs = df.copy()

# Add the cluster labels to the new DataFrame
df_dbs["DBS Cluster Label"] = model_dbs.labels_

# Display the new DataFrame
df_dbs


## Hierarchical clustering

In [None]:
import scipy
from scipy.cluster import hierarchy
import matplotlib.pyplot as plt

# Perform hierarchical clustering on the word vectors using the 'ward' linkage method
dendro = hierarchy.dendrogram(hierarchy.linkage(vectors, method='ward'))

# Draw a horizontal line at y=20 to indicate the threshold for cutting the dendrogram
plt.axhline(y=20)

# Display the dendrogram
plt.show()


In [None]:
# Import the AgglomerativeClustering class from scikit-learn
from sklearn.cluster import AgglomerativeClustering

# Create an AgglomerativeClustering object with 3 clusters, using Euclidean distance as the affinity measure, and Ward linkage as the linkage criterion
cluster = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')

# Fit the AgglomerativeClustering object to the word vectors and predict cluster labels for each vector
Agg = cluster.fit_predict(vectors)


In [None]:
# Add the cluster labels to the original DataFrame
df['AVG-W2V Clus Label'] = cluster.labels_

# Display the first 5 rows of the DataFrame
df.head()


In [None]:
# Create a copy of the original DataFrame
hier_df = df.copy()

# Add the cluster labels predicted by the AgglomerativeClustering model to the copy DataFrame
hier_df["Hierarchial Cluster Labels"] = cluster.labels_

# Group the DataFrame by the "Hierarchial Cluster Labels" column and count the number of reviews in each cluster
cluster_counts = hier_df.groupby(["Hierarchial Cluster Labels"])["Reviews"].count()

# Display the cluster counts
print(cluster_counts)


In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.

The outcomes of K-means, DBSCAN, Hierarchical clustering differ from Word2Vec and BERT due to their distinct applicative realms. The former group- data analysis techniques like K-means yield defined clusters based on similarity parameters but can be compromised by outliers or varying shapes; while DBSCAN's density-centric approach excels with unconventional cluster forms yet faces difficulty in managing noise within the data. Conversely, hierarchical clustering produces a sequential structure representing multi-layered relationships however it is computationally taxing for large datasets. In contrast are natural language processing tools such as  Word2Vec that generate word vectors capturing semantic contexts albeit without regard for lexical order; compared to BERT which employs an advanced model able to perceive context along with subtle linguistic variations making it suitable for intricate tasks such as answering queries or sentiment investigation despite its more demanding computational requirements.
