# CS410: Natural Language Processing, Fall 2023
## A1: Sentiment Analysis Text Classification, Dan Jang - 10/9/2023

#### Description of Assignment

##### Introduction

##### Data Preparation
A dataset containing product customer reviews, which is named the "Multilingual Amazon Reviews Corpus", in a json container format, with several columns. The assignment will focus on a smaller subset of the original dataset, where we will focus on __two (2) columns__:
* "review_title" - self-explanatory
* "stars" - an integer, either 1 or 5, where the former indicates "negative" and 5 indicates "positive."

There will be a training set & a test set.

We will load the dataset using Python & use respective libraries to implement our text-classification model.

Optionally, we will preprocess the data if needed, e.g. case-formating.
##### Feature Engineering
We will choose a set of classifiers to focus on in our text-classification model, e.g. *n*-grams, num words, cue words, repeated punctuation, etc.

##### Text Classification Model
To build our text-classification model, we will __follow these steps__:
* Any *two* chosen suitable algorithms for text classification.
* Vectorization of the text data (conversion of text for numerical features).
* Training of the text-classification model using the training dataset, "sentiment_train.json."
* Evaluation of our text-classification model using the testing dataset, "sentiment_test.json."

##### Results & Analysis
A detailed analysis of the model's performance by comparing the results from the output of our two algorithms, where we will __include the following__:
* *F1-score* or other relevant metrics.
* Confusion matrix.
* Any challenges or limitations of the text-classification model/task.
* Suggestions for improvement in the performance of the text-classification model.

#### Requirements


### Main Implementation: Text Classification

In [14]:
##### CS410: Natural Language Processing, Fall 2023 - 10/9/2023
##### A1: Sentiment Analysis Text Classification, Dan Jang
#### Objective: Exploring Natural Language Processing (NLP), by building a text-classifier
#### for a text classification task, predicting whether a piece of text is "positive" or "negative."

### 0.) Libraries
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import json
import pandas
import numpy as np
import matplotlib.pyplot as plot
import nltk

# ### 1.) Main Program Wrapper, a1_text_classifer
#class a1_text_classifer(object):

### 1.2.a) Gaussian Näive Bayes algorithm using sklearn.naive_bayes.GaussianNB
### https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html
### Returns four (4) thingys:
# I.) accuracy_score,
# II.) f1_score,
# III.) confusion_matrix,
# & IV.) classification_report.
def algo_one(xtrain, ytrain, xtest, ytest):
    gbayes = GaussianNB()
    
    gbayes.fit(xtrain, ytrain)
    predictionresults = gbayes.predict(xtest)
    
    return accuracy_score(ytest, predictionresults), f1_score(ytest, predictionresults), confusion_matrix(ytest, predictionresults), classification_report(ytest, predictionresults)
    
### 1.2.b) Logistic Regression algorithm using sklearn.linear_model.LogisticRegression
### https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
### Returns four (4) thingys:
# I.) accuracy_score,
# II.) f1_score,
# III.) confusion_matrix,
# & IV.) classification_report.
def algo_two(xtrain, ytrain, xtest, ytest):
    lreg = LogisticRegression()
    
    lreg.fit(xtrain, ytrain)
    predictionresults = lreg.predict(xtest)
    
    return accuracy_score(ytest, predictionresults), f1_score(ytest, predictionresults), confusion_matrix(ytest, predictionresults), classification_report(ytest, predictionresults)

def main(): #trainfile, testfile):
    print("Welcome, this is the main program for A1: Sentiment Analysis Text Classification.")
    print("Written by Dan J. for CS410: Natural Language Processing.")
    print("\nWe will use two classification algorithms, 1. Gaussian Näive Bayes & 2. Logistic Regression, to create a text-classifier to guess negative or positive sentimentiality based on various text-reviews of products.")
    
    ## For converting accuracy to percent
    percentness = float(100)
    
    ## 1.0.) Constants, Variables, & Datasets
    
    # trainfile = str(trainfile)
    # testfile = str(testfile)
    traindata = []
    testdata = []
    
    # 1.0.I.A) Debug Statements #1a for dataset loading times:
    print("Loading the training & testing datasets...")
    # with open(trainfile, "r") as trainfile:
    with open("sentiment_train.json", "r") as trainfile:
        #traindata = json.load(trainfile)
        for row in trainfile:
            traindata.append(json.loads(row))
        
    trainframe = pandas.DataFrame(traindata)
        
    # with open(testfile, "r") as testfile:
    with open("sentiment_test.json", "r") as testfile:
        #testdata = json.load(testfile)
        for row in testfile:
            testdata.append(json.loads(row))
        
    testframe = pandas.DataFrame(testdata)

    # 1.0.I.B) Debug Statements #1b for dataset loading times:
    print("Successfully loaded the training & testing datasets!\n")
    
    ## 1.0.1.) Initial Preprocessing of the training & testing data
    ## First, we isolate our two (2) columns, "review_title" & "stars."
    ## Second, we will convert values in the "stars" column so that 1 [negative] = 0 & 5 [positive] = 1.
    ## This will allow us to make the negative or positive sentiment a binary value-based thingy.
    trainframe = trainframe[['review_title', 'stars']]
    trainframe['stars'] = trainframe['stars'].apply(lambda x: 1 if x == 5 else 0)
    
    testframe = testframe[['review_title', 'stars']]
    testframe['stars'] = testframe['stars'].apply(lambda x: 1 if x == 5 else 0)
    
    ## 1.1.) Vectorization of the text-reviews in the datasets using sklearn.feature_extraction.text.CountVectorizer.
    ## As a core component of text-classification, the vectorization process of the text-review data is essential for feature engineering in natural language processing.
    ## https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    vectorization_machine_9000 = CountVectorizer()
    xtrain = vectorization_machine_9000.fit_transform(trainframe['review_title'])
    xtrain = xtrain.toarray()
    ytrain = trainframe['stars']
    
    xtest = vectorization_machine_9000.transform(testframe['review_title'])
    xtest = xtest.toarray()
    ytest = testframe['stars']
    
    ### 1.2.) Run Text-Classification Algorithms & Print the Model Results
    print("-----\n")
    print("Running algorithms on le training & testing datasets...")
    algo1accuracy, algo1f1, algo1cmatrix, algo1creport = algo_one(xtrain, ytrain, xtest, ytest)
    algo2accuracy, algo2f1, algo2cmatrix, algo2creport = algo_two(xtrain, ytrain, xtest, ytest)
    
    print("...Done!")
    print("-----\n")
    
    print("Here are le results...\n")
    print("Algorithm #1: Gaussian Näive Bayes Performance, Metrics, & Results:")
    print("Accuracy: ", algo1accuracy * percentness, "%")
    print("F1 Score: ", algo1f1)
    print("Confusion Matrix: \n", algo1cmatrix)
    print("Classification Report: \n", algo1creport)
    print("-----\n")
    
    print("Algorithm #2: Logistic Regression Performance, Metrics, & Results:")
    print("Accuracy: ", algo2accuracy * percentness, "%")
    print("F1 Score: ", algo2f1)
    print("Confusion Matrix: \n", algo2cmatrix)
    print("Classification Report: \n", algo2creport)
    print("-----\n")

#a1_program = a1_text_classifer("sentiment_train.json", "sentiment_test.json")

#### Commented out codez
# def main():
    
if __name__ == "__main__":
    main()

Welcome, this is the main program for A1: Sentiment Analysis Text Classification.
Written by Dan J. for CS410: Natural Language Processing.

We will use two classification algorithms, 1. Gaussian Näive Bayes & 2. Logistic Regression, to create a text-classifier to guess negative or positive sentimentiality based on various text-reviews of products.
Loading the training & testing datasets...
Successfully loaded the training & testing datasets!

-----

Running algorithms on le training & testing datasets...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


...Done!
-----

Here are le results...

Algorithm #1: Gaussian Näive Bayes Performance, Metrics, & Results:
Accuracy:  59.199999999999996 %
F1 Score:  0.3664596273291925
Confusion Matrix: 
 [[948  52]
 [764 236]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.55      0.95      0.70      1000
           1       0.82      0.24      0.37      1000

    accuracy                           0.59      2000
   macro avg       0.69      0.59      0.53      2000
weighted avg       0.69      0.59      0.53      2000

-----

Algorithm #2: Logistic Regression Performance, Metrics, & Results:
Accuracy:  92.7 %
F1 Score:  0.9272908366533865
Confusion Matrix: 
 [[923  77]
 [ 69 931]]
Classification Report: 
               precision    recall  f1-score   support

           0       0.93      0.92      0.93      1000
           1       0.92      0.93      0.93      1000

    accuracy                           0.93      2000
   macro avg       0.93    

### Text-Classification Model Performance Analysis & Discussion

#### Initial Data Results, Metrics, & Analysis

#### Comparative Analysis & Discussion

#### Text-Classification Challenges & Limitations

#### Discussion for Future Performance & Efficacy Improvements

### References & Resources

#### Libraries & Dependencies
    matplotlib.pyplot
    numpy
    pandas
[sklearn.naive_bayes.GaussianNB](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html)

[sklearn.linear_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

[sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)

[sklearn.feature_extraction.text.CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

[sklearn.metrics.f1_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

[sklearn.metrics.accuracy_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

[sklearn.metrics.confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

[sklearn.metrics.classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

#### References & Credits

[*NLP Tutorial for Text Classification in Python* by Vijaya Rani](https://medium.com/analytics-vidhya/nlp-tutorial-for-text-classification-in-python-8f19cd17b49e)

[*Using CountVectorizer to Extracting Features from Text* by *GeeksforGeeks*](https://www.geeksforgeeks.org/using-countvectorizer-to-extracting-features-from-text/#)

#### Special Thanks

[Fixing *sklearn ImportError: No module named _check_build*](https://stackoverflow.com/questions/23062524/sklearn-importerror-no-module-named-check-build)