In [29]:
import pandas as pd
import sys
from nltk.corpus import stopwords
import re
import emot
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.model_selection import train_test_split
import nltk
import numpy as np



In [30]:
data = pd.read_csv("master_data.csv").drop("Unnamed: 0", axis = 1)
data.head()

Unnamed: 0,review,rating,music_app
0,This is by far the best music app I have ever ...,5,Amazon
1,I really like this app but I have tried an tri...,4,Amazon
2,"This app is great, i've been using it for a co...",4,Amazon
3,Not a bad music app. Selection is good could b...,3,Amazon
4,"This is one of the most used app on my phone, ...",2,Amazon


### Approaches

We want to be able to see which classification works the best.
- We will first do the traditional approach with bag of words models, notably CountVectorize and TF-IDF vectorization
- Next, we will use spacy 

- The goal will be to beat the Bert model (Operation Beat BERT), which has a testing accuracy of 61%, and "1-off" accuracy of 85%. This means that our Bert model classifies 85% of reviews within 1 star of their actual score.

In [3]:
x = data.loc[1]["review"]

# code taken from https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b to remove emojis
# this doesn't have the most up to date emojis so we had to find an updated version for some unicode

def remove_emojis(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  
                               u"\u3030"
                               
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

print(x)
print(remove_emojis(x))

# henry package to emoji -> word

I really like this app but I have tried an tried to find out why every time I get a phone call or I go out of my app when I push pause on my music an go back within minutes it goes an refreshes the whole app. I’m literally in the middle of a song so what is the purpose of refreshing the app 🧐I tried to just start putting the music on an old phone I have that’s just for WiFi but even that if I’m closer to it an pause it, you can’t leave the app longer then a minute it seems without it just thinking your not coming back or something. An the other issue is it constantly takes random music that i do listen to often out of my music list. I’ll add a song to my music an sometimes a playlist too, an I’ll be listening to the song randomly later on a station that I choose an I’ll go an decide to add it to a playlist an see that it’s not even in my music. So I’ll add it again. I’ve only notice it since like I said I’ll decide to add it to a playlist an see it has the plus next to add music. But w

In [31]:
# stopwords 
# we know the reviews are about music, so we remove the stopword music
# we also remove punctuation
# overall, the revies are clean, although there are emojis present. 
# for now we will keep emojis as they could be an indication in our classification

stopword_list = set(stopwords.words('english') + ["music", ".", "!", "?", ",",":"])
stopword_list



{'!',
 ',',
 '.',
 ':',
 '?',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'music',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 '

In [5]:
#stemming, create an empty list, loop through and stem
stemmer = nltk.stem.porter.PorterStemmer()

reviews = data["review"].str.lower()

stemmed_list = []
for i in reviews:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        x = x + ' ' + stemmer.stem(word = j)
    stemmed_list.append(x)


In [6]:
stemmed_list[5:7]

[' i hate mani thing about thi app , but the featur most bothersom to me of late is the keep listen featur . the sleep timer is turn off , but everi hour i get ask am i still listen to music . did i stop the music ? ! i have my devic plug in across the room becaus it ’ s not compat with legrand digit audio ( fix thi too ! ) . so , everi hour when the music stop i have to go press keep listen . i ’ m not oppos to exercis , but i ’ m tri to work , and thi incess disrupt make me want to pay for a differ stream music servic . thi should onli happen if i engag the sleep timer . thi is a useless featur , especi if i am actual stream on wifi and have the stream onli on wifi featur engag or am in offlin mode listen to download music . get rid of thi . anoth huge annoy is the amount of data thi app use even in data saver mode . my daili commut ( 160 minut total stream time ) use 1 gb of data ! mayb it would use less data if the lyric featur could be disabl . final your app support wa useless fo

In [6]:
stemmed_df = pd.DataFrame(stemmed_list, columns = ["reviews"])
stemmed_df["ratings"] = data["rating"]
stemmed_df

Unnamed: 0,reviews,ratings
0,thi is by far the best music app i have ever ...,5
1,i realli like thi app but i have tri an tri t...,4
2,"thi app is great , i 've been use it for a co...",4
3,not a bad music app . select is good could be...,3
4,"thi is one of the most use app on my phone , ...",2
...,...,...
72046,𝐠𝐨𝐚𝐭,5
72047,i hate onli ad xd,5
72048,i ca n't search and play some song other than...,1
72049,just work well,5


In [8]:
stemmed_df_reviews = stemmed_df["reviews"]
stemmed_df_ratings = stemmed_df["ratings"]
X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                    test_size = 0.2, random_state = 25)

In [9]:
# for our vectorization, we start with CountVectorizer

vectorizer = CountVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # removes emojis btw
                            max_features = 2000,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)

print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')

# we get 66% accuracy on the training set and 45% on the testing set
# this was just a first pass, in a little we will loop through different parameters and find optimal values
# also it is important to note this was with a random state of 25 for train_test_split

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6649 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4464 accurate.


In [10]:
# what if we use tfidf?

vectorizer = TfidfVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                            max_features = 2000,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)

print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')

# this is okay even though we are overfitting, and our accuracy on the test set is 48%. 
# Although BERT accuracy is 61% so this really isn't bad!
# this model does better and improves our testing accuracy by about 3% compared to CountVectorize

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6585 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4801 accurate.


In [11]:
# lets loop and find our best model for CountVectorize

ngrams = [(1,1), (1,2), (2,2), (3,3)]
max_features = [100, 500,1000,2000]
min_df = [2, 3]

champion_model = ""
champion_model_test_score = 0

for i in ngrams:
    for j in max_features:
        for k in min_df:
                print(f'ngrams = {i}; max_features = {j}; min_df = {k}')
                vectorizer = CountVectorizer(ngram_range = i,
                                            stop_words = stopword_list,
                                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                                            max_features = j,
                                            min_df = k,
                                            binary = True)

                train_X = vectorizer.fit_transform(X_train)
                lr1 = LogisticRegression(multi_class='multinomial')
                lr1.fit(train_X, y_train)
                y_train_pred = lr1.predict(train_X)

                print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

                y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
                print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')
                if round(np.mean(y_test_pred == y_test),4) > champion_model_test_score:
                    champion_model_test_score = round(np.mean(y_test_pred == y_test),4)
                    champion_model = f'ngram = {i} ; max_features = {j} ; min_df = {k}'
                

ngrams = (1, 1); max_features = 100; min_df = 2
Without dimensionality reduction, on the training set, our model is 0.5698 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4547 accurate.
ngrams = (1, 1); max_features = 100; min_df = 3
Without dimensionality reduction, on the training set, our model is 0.5698 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4547 accurate.
ngrams = (1, 1); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6247 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4581 accurate.
ngrams = (1, 1); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6247 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4581 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6424 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4596 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6424 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4533 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6616 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4177 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6612 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4292 accurate.
ngrams = (1, 2); max_features = 100; min_df = 2
Without dimensionality reduction, on the training set, our model is 0.5683 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5079 accurate.
ngrams = (1, 2); max_features = 100; min_df = 3
Without dimensionality reduction, on the training set, our model is 0.5683 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5079 accurate.
ngrams = (1, 2); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6226 accurate.
Without dimensionality reduction, on the testing set, our model is 0.505 accurate.
ngrams = (1, 2); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6226 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4935 accurate.
ngrams = (1, 2); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6422 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4664 accurate.
ngrams = (1, 2); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6421 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4756 accurate.
ngrams = (1, 2); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6649 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4464 accurate.
ngrams = (1, 2); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6653 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4521 accurate.
ngrams = (2, 2); max_features = 100; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5091 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4522 accurate.
ngrams = (2, 2); max_features = 100; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5091 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4522 accurate.
ngrams = (2, 2); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5557 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4658 accurate.
ngrams = (2, 2); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5556 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4797 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5817 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4728 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5814 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4732 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6116 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4525 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6114 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4495 accurate.
ngrams = (3, 3); max_features = 100; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.4789 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4566 accurate.
ngrams = (3, 3); max_features = 100; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.4789 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4723 accurate.
ngrams = (3, 3); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.503 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4633 accurate.
ngrams = (3, 3); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.503 accurate.
Without dimensionality reduction, on the testing set, our model is 0.468 accurate.
ngrams = (3, 3); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5233 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4542 accurate.
ngrams = (3, 3); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5228 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4609 accurate.
ngrams = (3, 3); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5459 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4511 accurate.
ngrams = (3, 3); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5468 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4538 accurate.


In [12]:
print(f'For CountVectorization, our champion model is {champion_model} with an accuracy of \
{champion_model_test_score} on the testing set')

For CountVectorization, our champion model is ngram = (1, 2) ; max_features = 100 ; min_df = 2 with an accuracy of 0.5079 on the testing set


In [13]:
# lets loop and find our best model
ngrams = [(1,1), (1,2), (2,2), (3,3)]
max_features = [100, 500,1000,2000]
min_df = [2, 3]

champion_model = ""
champion_model_test_score = 0

for i in ngrams:
    for j in max_features:
        for k in min_df:
                print(f'ngrams = {i}; max_features = {j}; min_df = {k}')
                vectorizer = TfidfVectorizer(ngram_range = i,
                                            stop_words = stopword_list,
                                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                                            max_features = j,
                                            min_df = k,
                                            binary = True)

                train_X = vectorizer.fit_transform(X_train)
                lr1 = LogisticRegression(multi_class='multinomial')
                lr1.fit(train_X, y_train)
                y_train_pred = lr1.predict(train_X)

                print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

                y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
                print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')
                if round(np.mean(y_test_pred == y_test),4) > champion_model_test_score:
                    champion_model_test_score = round(np.mean(y_test_pred == y_test),4)
                    champion_model = f'ngram = {i} ; max_features = {j} ; min_df = {k}'


ngrams = (1, 1); max_features = 100; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5782 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4299 accurate.
ngrams = (1, 1); max_features = 100; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5782 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4299 accurate.
ngrams = (1, 1); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6287 accurate.
Without dimensionality reduction, on the testing set, our model is 0.471 accurate.
ngrams = (1, 1); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6287 accurate.
Without dimensionality reduction, on the testing set, our model is 0.471 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6416 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4592 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6418 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4324 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6526 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4318 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6533 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4353 accurate.
ngrams = (1, 2); max_features = 100; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5771 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5106 accurate.
ngrams = (1, 2); max_features = 100; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5771 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5106 accurate.
ngrams = (1, 2); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6259 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5029 accurate.
ngrams = (1, 2); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6259 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4905 accurate.
ngrams = (1, 2); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6434 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4519 accurate.
ngrams = (1, 2); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6434 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4838 accurate.
ngrams = (1, 2); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6585 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4801 accurate.
ngrams = (1, 2); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6587 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4747 accurate.
ngrams = (2, 2); max_features = 100; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5116 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4463 accurate.
ngrams = (2, 2); max_features = 100; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5116 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4463 accurate.
ngrams = (2, 2); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5603 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4656 accurate.
ngrams = (2, 2); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5602 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4809 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5849 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4698 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5851 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4635 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6102 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4485 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6095 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4486 accurate.
ngrams = (3, 3); max_features = 100; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.4787 accurate.
Without dimensionality reduction, on the testing set, our model is 0.459 accurate.
ngrams = (3, 3); max_features = 100; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.4787 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4714 accurate.
ngrams = (3, 3); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5045 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4658 accurate.
ngrams = (3, 3); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5045 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4688 accurate.
ngrams = (3, 3); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.523 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4546 accurate.
ngrams = (3, 3); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5226 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4646 accurate.
ngrams = (3, 3); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.543 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4512 accurate.
ngrams = (3, 3); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5431 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4576 accurate.


In [14]:
print(f'For TfIdfVectorization, our champion model is {champion_model} with an accuracy of \
{champion_model_test_score} on the testing set')

For TfIdfVectorization, our champion model is ngram = (1, 2) ; max_features = 100 ; min_df = 2 with an accuracy of 0.5106 on the testing set


## Takeaways:

- Both CountVectorize and TfIdf vectorization have accuracies revolving around 51% on the testing set.
- Honestly, this is pretty good considering BERT gave 61%!
- Both vectorizers had highest accuracy for ngram of (1,2), max_features = 100, and min_df = 2
- Let's now analyze our second accuracy metric, which is "off by 1" accuracy (if our model classifies a review as a 4 instead of 5 or a 2 instead of a 1 for example)

In [15]:
X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                    test_size = 0.2, random_state = 25)


vectorizer = CountVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                            max_features = 100,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)

print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')




Without dimensionality reduction, on the training set, our model is 0.5683 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5079 accurate.


In [16]:
# 1 off accuracy, Count Vectorization
test_frame = pd.DataFrame(X_test)
test_frame["actual_ratings"] = y_test
test_frame["predicted_ratings"] = y_test_pred

test_frame["difference"] = abs(test_frame["actual_ratings"]-test_frame["predicted_ratings"]) <= 1

print(f'The off by 1 accuracy for our testing set for our CountVectorizer champion model is \
{round(sum(test_frame["difference"])/len(test_frame["difference"]),5)}')

The off by 1 accuracy for our testing set for our CountVectorizer champion model is 0.67102


In [17]:
X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                    test_size = 0.2, random_state = 25)


vectorizer = TfidfVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                            max_features = 100,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)

print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')




STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.5771 accurate.
Without dimensionality reduction, on the testing set, our model is 0.5106 accurate.


In [18]:
# 1 off accuracy, TfIdf Vectorization
test_frame = pd.DataFrame(X_test)
test_frame["actual_ratings"] = y_test
test_frame["predicted_ratings"] = y_test_pred

test_frame["difference"] = abs(test_frame["actual_ratings"]-test_frame["predicted_ratings"]) <= 1

print(f'The off by 1 accuracy for our testing set for our CountVectorizer champion model is \
{round(sum(test_frame["difference"])/len(test_frame["difference"]),5)}')

The off by 1 accuracy for our testing set for our CountVectorizer champion model is 0.66921


### Takeaways:
- Our Count Vectorizer, on our testing set, had an accuracy of 50.79%. If we look at 1 off accuracy, it was 67.1% accurate, or about 2/3.
- Our TfIdf Vectorizer, on our testing set, had an accuracy of 51.06%. If we look at 1 off accuracy, it was 66.92% accurate, also about 2/3
#### The big takeaway here is that we generally classify about half correctly, and this jumps up to 2/3 if we look at "within 1" ratings. 


Before we jump in and look into confusion matrix and AUROC, let's first test different random states for our train_test_splits

In [19]:
total = 0
random_list = [1, 7, 19, 25, 31, 44, 67, 125, 177, 255]
for i in random_list:
    X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                        test_size = 0.2, random_state = i)

    vectorizer = CountVectorizer(ngram_range = (1,2),
                                stop_words = stopword_list,
                                token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                                max_features = 100,
                                min_df = 2,
                                binary = True)
    train_X = vectorizer.fit_transform(X_train)
    lr1 = LogisticRegression(multi_class='multinomial')
    lr1.fit(train_X, y_train)
    y_train_pred = lr1.predict(train_X)
    y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
    accuracy = round(np.mean(y_test_pred == y_test),4)
    print(f'With a random state of {i}, our test set accuracy is {accuracy}')
    total += accuracy

print(f'Our average accuracy is {round(total/len(random_list), 5)}')

# it looks like we got a little lucky choosing 25 as our random state, but we still have averaged 47.3% 
# accuracy on the testing set

# we select the closest value to this average for our final model, and use random state 255.

With a random state of 1, our test set accuracy is 0.424
With a random state of 7, our test set accuracy is 0.4368
With a random state of 19, our test set accuracy is 0.4796
With a random state of 25, our test set accuracy is 0.5079
With a random state of 31, our test set accuracy is 0.564
With a random state of 44, our test set accuracy is 0.4633
With a random state of 67, our test set accuracy is 0.4902
With a random state of 125, our test set accuracy is 0.4584
With a random state of 177, our test set accuracy is 0.4333
With a random state of 255, our test set accuracy is 0.4723
Our average accuracy is 0.47298


In [20]:
total = 0
random_list = [1, 7, 19, 25, 31, 44, 67, 125, 177, 255]
for i in random_list:
    X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                        test_size = 0.2, random_state = i)

    vectorizer = TfidfVectorizer(ngram_range = (1,2),
                                stop_words = stopword_list,
                                token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                                max_features = 100,
                                min_df = 2,
                                binary = True)
    train_X = vectorizer.fit_transform(X_train)
    lr1 = LogisticRegression(multi_class='multinomial')
    lr1.fit(train_X, y_train)
    y_train_pred = lr1.predict(train_X)
    y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
    accuracy = round(np.mean(y_test_pred == y_test),4)
    print(f'With a random state of {i}, our test set accuracy is {accuracy}')
    total += accuracy

print(f'Our average accuracy is {round(total/len(random_list), 5)}')

# again, it seems like originally picking 25 was a little lucky, but here we obtain an average 46.97% accuracy on
# the testing set

# we will take random seed of 255 because this value is closest to the average

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 1, our test set accuracy is 0.4285


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 7, our test set accuracy is 0.4254


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 19, our test set accuracy is 0.4826


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 25, our test set accuracy is 0.5106


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 31, our test set accuracy is 0.5708


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 44, our test set accuracy is 0.4582


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 67, our test set accuracy is 0.4893


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 125, our test set accuracy is 0.4474


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 177, our test set accuracy is 0.4142


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


With a random state of 255, our test set accuracy is 0.4697
Our average accuracy is 0.46967


## Next Steps
- Our final choice is random seed of 255, since it is closest to the average scores of the 10 random seeds we chose 
- Let's first look at accuracy for Count and TfIdf vectorization.
- Then, we will look at "1 off accuracy"
- Finally, we will look at the confusion matrix and AUROC curves for both of these models

In [9]:
X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                    test_size = 0.2, random_state = 255)

vectorizer = CountVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                            max_features = 100,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)
y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
accuracy = round(np.mean(y_test_pred == y_test),4)
print(f'Our test set accuracy is {accuracy}')

test_frame = pd.DataFrame(X_test)
test_frame["actual_ratings"] = y_test
test_frame["predicted_ratings"] = y_test_pred
test_frame["difference"] = abs(test_frame["actual_ratings"]-test_frame["predicted_ratings"]) <= 1
off1_accuracy = round(sum(test_frame["difference"])/len(test_frame["difference"]),5)
print(f'The off by 1 accuracy for our testing set for our CountVectorizer champion model is \
{off1_accuracy}')

Our test set accuracy is 0.4723
The off by 1 accuracy for our testing set for our CountVectorizer champion model is 0.64895


- This is actually pretty solid, about 2/3 of the reviews are classified within 1 of their actual score with our CountVectorizer

In [13]:
print(f'Our confusion matrix looks like \n \n  {confusion_matrix(y_test, y_test_pred)} \n')

x = confusion_matrix(y_test, y_test_pred)

stars = [1,2,3,4,5]

precision_list = []
recall_list = []
total_actual_list = []
total_predicted_list = []

for i in stars:
    precision = round(x[i-1,i-1]/sum(x[0:5,i-1]),4)
    precision_list.append(precision)
    recall = round(x[i-1,i-1]/sum(x[i-1,0:5]),4)
    recall_list.append(recall)
    total_actual_list.append(sum(x[i-1,0:5]))
    total_predicted_list.append(sum(x[0:5,i-1]))
    F1 = round(2*recall*precision/(precision+recall),4)
    print(f'For the {sum(x[i-1,0:5])} {i} star reviews, precision is {precision}, recall is {recall}, and our F1 score would be {F1} \n')

p = 0
for i in range(len(precision_list)):
    p += precision_list[i]*total_actual_list[i]
final_prec = round(p/sum(total_actual_list),4)

r = 0
for i in range(len(recall_list)):
    r += recall_list[i]*total_actual_list[i]
final_rec = round(r/sum(total_actual_list), 4)

final_F1 = round(2*final_rec*final_prec/(final_prec+final_rec),4)
print(f'\nOverall precision is {final_prec}; overall recall is {final_rec}; overall F1 is {final_F1}')

Our confusion matrix looks like 
 
  [[ 977   30  246  293 1749]
 [ 331   17  159  192  481]
 [ 262   16  229  281  626]
 [ 282    6  209  359 1014]
 [ 638   14  270  505 5225]] 

For the 3295 1 star reviews, precision is 0.3924, recall is 0.2965, and our F1 score would be 0.3378 

For the 1180 2 star reviews, precision is 0.2048, recall is 0.0144, and our F1 score would be 0.0269 

For the 1414 3 star reviews, precision is 0.2058, recall is 0.162, and our F1 score would be 0.1813 

For the 1870 4 star reviews, precision is 0.2202, recall is 0.192, and our F1 score would be 0.2051 

For the 6652 5 star reviews, precision is 0.5745, recall is 0.7855, and our F1 score would be 0.6636 


Overall precision is 0.4204; overall recall is 0.4724; overall F1 is 0.4449


- Looking at our confusion matrix, the most common error is our model classifying reviews as 5 star when they are in fact 1 star reviews, or 5 star when they are 4 star reviews. 
- We have really bad recall for 2 star reviews
- 5 star reviews have the highet F1 score as well as recall of 78.55%!


In [32]:
# what about AUROC?
#problem since not 2-D, ways around it but complicated
#print(f'Our AUROC is {round(roc_auc_score(y_test, y_test_pred),4)}\n')

In [14]:
# lets do the same for TfIdf
X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                    test_size = 0.2, random_state = 255)

vectorizer = TfidfVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                            max_features = 100,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)
y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
accuracy = round(np.mean(y_test_pred == y_test),4)
print(f'Our test set accuracy is {accuracy}')

test_frame = pd.DataFrame(X_test)
test_frame["actual_ratings"] = y_test
test_frame["predicted_ratings"] = y_test_pred
test_frame["difference"] = abs(test_frame["actual_ratings"]-test_frame["predicted_ratings"]) <= 1
off1_accuracy = round(sum(test_frame["difference"])/len(test_frame["difference"]),5)
print(f'The off by 1 accuracy for our testing set for our CountVectorizer champion model is \
{off1_accuracy}')

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Our test set accuracy is 0.4697
The off by 1 accuracy for our testing set for our CountVectorizer champion model is 0.63625


- This is actually pretty solid, about 2/3 of the reviews are classified within 1 of their actual score with our TfidfVectorizer

In [27]:
print(f'Our confusion matrix looks like \n \n  {confusion_matrix(y_test, y_test_pred)} \n')

x = confusion_matrix(y_test, y_test_pred)

stars = [1,2,3,4,5]

precision_list = []
recall_list = []
total_actual_list = []
total_predicted_list = []

for i in stars:
    precision = round(x[i-1,i-1]/sum(x[0:5,i-1]),4)
    precision_list.append(precision)
    recall = round(x[i-1,i-1]/sum(x[i-1,0:5]),4)
    recall_list.append(recall)
    total_actual_list.append(sum(x[i-1,0:5]))
    total_predicted_list.append(sum(x[0:5,i-1]))
    F1 = round(2*recall*precision/(precision+recall),4)
    print(f'For the {sum(x[i-1,0:5])} {i} star reviews, precision is {precision}, recall is {recall}, and our F1 score would be {F1} \n')

p = 0
for i in range(len(precision_list)):
    p += precision_list[i]*total_actual_list[i]
final_prec = round(p/sum(total_actual_list),4)

r = 0
for i in range(len(recall_list)):
    r += recall_list[i]*total_actual_list[i]
final_rec = round(r/sum(total_actual_list), 4)

final_F1 = round(2*final_rec*final_prec/(final_prec+final_rec),4)

print(f'\nOverall precision is {final_prec}; overall recall is {final_rec}; overall F1 is {final_F1}')

[0.39093484 0.01642036 0.13834847 0.18013572 0.65128406]
Our confusion matrix looks like 
 
  [[1311    8  214  244 1518]
 [ 420   10  109  180  461]
 [ 366    7  160  224  657]
 [ 371    7  142  292 1058]
 [ 944    6  274  432 4996]] 

For the 3295 1 star reviews, precision is 0.3842, recall is 0.3979, and our F1 score would be 0.3909 

For the 1180 2 star reviews, precision is 0.2632, recall is 0.0085, and our F1 score would be 0.0165 

For the 1414 3 star reviews, precision is 0.178, recall is 0.1132, and our F1 score would be 0.1384 

For the 1870 4 star reviews, precision is 0.2128, recall is 0.1561, and our F1 score would be 0.1801 

For the 6652 5 star reviews, precision is 0.5749, recall is 0.7511, and our F1 score would be 0.6513 


Overall precision is 0.4198; overall recall is 0.4697; overall F1 is 0.4434


## Count and TfIdf Vectorization final comments

While we didn't achieve our goal of beating our BERT model (61% accuracy, 85% "1-off" accuracy), we managed to build a couple of models that get close.

- Our Count Vectorizer model, on the test set, achieves 47.23% accuracy with "1-off" accuracy of 64.9%. 
- Our TfIdf Vectorizer model, on the test set, achieves 46.97% accuracy, with "1-off" accuracy of 63.63%. 
- Also, it turns out there is an f1_score library in sklearn.metrics

Not too shabby!

# Spacy

In [44]:
#!spacy download en_core_web_md

In [32]:
import spacy
nlp = spacy.load("en_core_web_sm")

spacy_review = data["review"]
spacy_rating = data["rating"]

X_train, X_test, y_train, y_test = train_test_split(spacy_review, spacy_rating,
                                                    test_size = 0.2, random_state = 199)


In [33]:
helper_X_train = list(X_train.apply(lambda x: nlp(x).vector))

In [34]:
spacy_lr = LogisticRegression(multi_class='multinomial')

spacy_lr.fit(helper_X_train, y_train)
train_prediction = spacy_lr.predict(helper_X_train)

train_acc = round(np.mean(train_prediction == y_train), 4)
print(f'Using Spacy, our model is {train_acc} accurate on our training set')

Using Spacy, our model is 0.526 accurate on our training set


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [35]:
helper_X_test = list(X_test.apply(lambda x: nlp(x).vector))
test_prediction = spacy_lr.predict(helper_X_test)
test_acc = round(np.mean(test_prediction == y_test), 4)
print(f'Using Spacy, our model is {test_acc} accurate on our testing set')

Using Spacy, our model is 0.5274 accurate on our testing set


In [36]:
confusion_matrix(y_test, test_prediction)

array([[1488,    3,   11,    6, 1728],
       [ 428,    1,    2,    6,  704],
       [ 345,    1,    5,    5, 1068],
       [ 280,    0,    3,   20, 1644],
       [ 548,    1,   12,   16, 6086]])

In [41]:
print()
f1_score(y_test, test_prediction, average = "weighted")




0.42369706071540564

### Takeaways

- very close training and testing accuracy, sitting at around 53%
- this is pretty good considering BERT achieves 61%.
- the confusion matrix looks a lot cleaner than the bag of words models, however we are predicting mostly 1/5 star reviews

In [84]:
print(f'Our confusion matrix looks like \n \n  {confusion_matrix(y_test, test_prediction)} \n')

x = confusion_matrix(y_test, test_prediction)
print('here1')
stars = [1,2,3,4,5]

precision_list = []
recall_list = []
total_actual_list = []
total_predicted_list = []

for i in stars:
    precision = round(x[i-1,i-1]/sum(x[0:5,i-1]),4)
    precision_list.append(precision)
    recall = round(x[i-1,i-1]/sum(x[i-1,0:5]),4)
    recall_list.append(recall)
    total_actual_list.append(sum(x[i-1,0:5]))
    total_predicted_list.append(sum(x[0:5,i-1]))
    F1 = round(2*recall*precision/(precision+recall),4)
    print(f'For the {sum(x[i-1,0:5])} {i} star reviews, precision is {precision}, recall is {recall}, and our F1 score would be {F1} \n')

p = 0
for i in range(len(precision_list)):
    p += precision_list[i]*total_actual_list[i]
final_prec = round(p/sum(total_actual_list),4)

r = 0
for i in range(len(recall_list)):
    r += recall_list[i]*total_actual_list[i]
final_rec = round(r/sum(total_actual_list), 4)

final_F1 = round(2*final_rec*final_prec/(final_prec+final_rec),4)
print(f'\nOverall precision is {final_prec}; overall recall is {final_rec}; overall F1 is {F1}')

Our confusion matrix looks like 
 
  [[1488    3   11    6 1728]
 [ 428    1    2    6  704]
 [ 345    1    5    5 1068]
 [ 280    0    3   20 1644]
 [ 548    1   12   16 6086]] 

here1
For the 3236 1 star reviews, precision is 0.4817, recall is 0.4598, and our F1 score would be 0.4705 

For the 1141 2 star reviews, precision is 0.1667, recall is 0.0009, and our F1 score would be 0.0018 

For the 1424 3 star reviews, precision is 0.1515, recall is 0.0035, and our F1 score would be 0.0068 

For the 1947 4 star reviews, precision is 0.3774, recall is 0.0103, and our F1 score would be 0.0201 

For the 6663 5 star reviews, precision is 0.5419, recall is 0.9134, and our F1 score would be 0.6802 


Overall precision is 0.5273; overall recall is 0.5274; overall F1 is 0.6802


In [None]:
# the overall precision above is wrong, but don't want to change it as it would take 30 minutes to run the model. 

- **This model performs slightly better on the testing set than our Count/TFidf vectorization models. However, if we look at the "1-off" metric:**

In [90]:
print("For some reason I need to print this for the cell to run, probably due to spacy taking hella long to run \n")

test_frame = pd.DataFrame(X_test)


test_frame["actual_ratings"] = y_test
test_frame["predicted_ratings"] = test_prediction
test_frame["difference"] = abs(test_frame["actual_ratings"]-test_frame["predicted_ratings"]) <= 1

print(test_frame.head(10))

print(f'The off by 1 accuracy for our testing set for our Spacy model is \
{round(sum(test_frame["difference"])/len(test_frame["difference"]),5)}')

For some reason I need to print this for the cell to run, probably due to spacy taking hella long to run 

                                                  review  actual_ratings  \
10754  It didn’t take me long to decide to upgrade to...               5   
11304  It could be better on mobile for sure. I liste...               4   
20726  I have made my prime member subscription for 1...               1   
27304  While exiting the car mode, the app keeps comi...               3   
13028  I enjoy the rest of the app entirely, really. ...               5   
9520       Some of the music stops working after a month               5   
19218  So this is probably gonna be the shortest revi...               4   
26910  Good if you want to pay a subscription, terrib...               2   
10514  The playlists are great, suggestions are gener...               2   
50843  First of all its not ad free you have to pay f...               1   

       predicted_ratings  difference  
10754            

### Takeaways:

While we didn't achieve our goal of beating our BERT model (61% accuracy, 85% "1-off" accuracy), we managed to build a couple of models that get close.
Our Count Vectorizer model, on the test set, achieves 47.23% accuracy with "1-off" accuracy of 64.9%. The final F1-score is 0.664.
Our TfIdf Vectorizer model, on the test set, achieves 46.97% accuracy, with "1-off" accuracy of 63.63%. The final F1-score is 0.651
Not too shabby!

- All in all, our Spacy model performs slightly better than our Count/Tfidf vectorizer models. It achieves an accuracy of 52.74% on the testing set, with a "1-off" accuracy of 67.32%.

|  Model         | Testing Accuracy  | 1-off Accuracy |
| :---           |      :----:       |            ---:|
| CountVectorize |   47.23%          |     64.9%      |
| TfIdfVectorize |    46.97%         |     63.63%     |
| Spacy (sm)     |     52.74%        |      67.32%    |
| BERT           |     61%           |         85%    |