In [1]:
import pandas as pd
import sys
from nltk.corpus import stopwords
import re
import emot
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
import nltk
import numpy as np



In [2]:
data = pd.read_csv("master_data.csv").drop("Unnamed: 0", axis = 1)
data.head()

Unnamed: 0,review,rating,music_app
0,This is by far the best music app I have ever ...,5,Amazon
1,I really like this app but I have tried an tri...,4,Amazon
2,"This app is great, i've been using it for a co...",4,Amazon
3,Not a bad music app. Selection is good could b...,3,Amazon
4,"This is one of the most used app on my phone, ...",2,Amazon


### Approaches

We want to be able to see which classification works the best.
- We will first do the traditional approach with bag of words models, notably CountVectorize and TF-IDF vectorization
- Next, we will do a spacy approach

- The goal will be to beat the Bert model, which has a testing set accuracy of _

In [3]:
x = data.loc[1]["review"]

# code taken from https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b to remove emojis
# this doesn't have the most up to date emojis so we had to find an updated version for some unicode

def remove_emojis(string):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  
                               u"\u3030"
                               
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)

print(x)
print(remove_emojis(x))

# henry package to emoji -> word

I really like this app but I have tried an tried to find out why every time I get a phone call or I go out of my app when I push pause on my music an go back within minutes it goes an refreshes the whole app. I’m literally in the middle of a song so what is the purpose of refreshing the app 🧐I tried to just start putting the music on an old phone I have that’s just for WiFi but even that if I’m closer to it an pause it, you can’t leave the app longer then a minute it seems without it just thinking your not coming back or something. An the other issue is it constantly takes random music that i do listen to often out of my music list. I’ll add a song to my music an sometimes a playlist too, an I’ll be listening to the song randomly later on a station that I choose an I’ll go an decide to add it to a playlist an see that it’s not even in my music. So I’ll add it again. I’ve only notice it since like I said I’ll decide to add it to a playlist an see it has the plus next to add music. But w

In [4]:
# stopwords 
# we know the reviews are about music, so we remove the stopword music
# we also remove punctuation
# overall, the revies are clean, although there are emojis present. 
# for now we will keep emojis as they could be an indication in our classification

stopword_list = set(stopwords.words('english') + ["music", ".", "!", "?", ",",":"])
stopword_list



{'!',
 ',',
 '.',
 ':',
 '?',
 'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 'her',
 'here',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'i',
 'if',
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it's",
 'its',
 'itself',
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'music',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'only',
 'or',
 'other',
 'our',
 'ours',
 '

In [7]:
#stemming, create an empty list, loop through and stem
stemmer = nltk.stem.porter.PorterStemmer()

reviews = data["review"].str.lower()

stemmed_list = []
for i in reviews:
    tokens = nltk.word_tokenize(i)
    x = ''
    for j in tokens:
        x = x + ' ' + stemmer.stem(word = j)
    stemmed_list.append(x)


In [12]:
stemmed_list[5:7]

[' i hate mani thing about thi app , but the featur most bothersom to me of late is the keep listen featur . the sleep timer is turn off , but everi hour i get ask am i still listen to music . did i stop the music ? ! i have my devic plug in across the room becaus it ’ s not compat with legrand digit audio ( fix thi too ! ) . so , everi hour when the music stop i have to go press keep listen . i ’ m not oppos to exercis , but i ’ m tri to work , and thi incess disrupt make me want to pay for a differ stream music servic . thi should onli happen if i engag the sleep timer . thi is a useless featur , especi if i am actual stream on wifi and have the stream onli on wifi featur engag or am in offlin mode listen to download music . get rid of thi . anoth huge annoy is the amount of data thi app use even in data saver mode . my daili commut ( 160 minut total stream time ) use 1 gb of data ! mayb it would use less data if the lyric featur could be disabl . final your app support wa useless fo

In [26]:
stemmed_df = pd.DataFrame(stemmed_list, columns = ["reviews"])
stemmed_df["ratings"] = data["rating"]
stemmed_df

Unnamed: 0,reviews,ratings
0,thi is by far the best music app i have ever ...,5
1,i realli like thi app but i have tri an tri t...,4
2,"thi app is great , i 've been use it for a co...",4
3,not a bad music app . select is good could be...,3
4,"thi is one of the most use app on my phone , ...",2
...,...,...
72046,𝐠𝐨𝐚𝐭,5
72047,i hate onli ad xd,5
72048,i ca n't search and play some song other than...,1
72049,just work well,5


In [65]:
stemmed_df_reviews = stemmed_df["reviews"]
stemmed_df_ratings = stemmed_df["ratings"]
X_train, X_test, y_train, y_test = train_test_split(stemmed_df_reviews, stemmed_df_ratings,
                                                    test_size = 0.2, random_state = 17)
# we now have training and testing data
X_train


30871                  love to hear regga and r & b here .
27029                                       solid interfac
17779     i ’ ve been use premium for year and i love i...
42078     amaz real thi is realli amaz music i have but...
67074     man thi app is goat . i like it a lot befor ,...
                               ...                        
37332     veri good and clear app . recent updat of dol...
25631     doe it requir amazon prime to play song ? i '...
42297     ca n't sign in to facebook with pearl pearl o...
34959                  as soon as i search a song boom ! !
64753                                                  👍👍👍
Name: reviews, Length: 57640, dtype: object

In [69]:
# for our vectorization, we start with CountVectorizer

vectorizer = CountVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # removes emojis btw
                            max_features = 2000,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)

print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')

# this is really not good, we are overfitting and our accuracy is really low. 
train_X

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6637 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4183 accurate.


<57640x2000 sparse matrix of type '<class 'numpy.int64'>'
	with 891652 stored elements in Compressed Sparse Row format>

In [71]:
train_X.toarray()

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [58]:
# what if we use tfidf?

vectorizer = TfidfVectorizer(ngram_range = (1,2),
                            stop_words = stopword_list,
                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                            max_features = 2000,
                            min_df = 2,
                            binary = True)

train_X = vectorizer.fit_transform(X_train)
lr1 = LogisticRegression(multi_class='multinomial')
lr1.fit(train_X, y_train)
y_train_pred = lr1.predict(train_X)

print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')

# this is really not good, we are overfitting and our accuracy is really low. 
# this model does better and improves our testing accuracy by over 5%

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6579 accurate.
Without dimensionality reduction, on the testing set, our model is 0.4761 accurate.


In [62]:
# lets loop and find our best model
ngrams = [(1,1), (2,2), (3,3)]
max_features = [500,1000,1500,2000,2500]
min_df = [2, 3, 5]

for i in ngrams:
    for j in max_features:
        for k in min_df:=
                print(f'ngrams = {i}; max_features = {j}; min_df = {k}')
                vectorizer = TfidfVectorizer(ngram_range = i,
                                            stop_words = stopword_list,
                                            token_pattern = '(?u)\\b[a-zA-Z][a-zA-Z]+\\b', # this will remove emojis btw
                                            max_features = j,
                                            min_df = k,
                                            binary = True)

                train_X = vectorizer.fit_transform(X_train)
                lr1 = LogisticRegression(multi_class='multinomial')
                lr1.fit(train_X, y_train)
                y_train_pred = lr1.predict(train_X)

                print(f'Without dimensionality reduction, on the training set, our model is {round(np.mean(y_train_pred == y_train),4)} accurate.')

                y_test_pred = lr1.predict(vectorizer.fit_transform(X_test))
                print(f'Without dimensionality reduction, on the testing set, our model is {round(np.mean(y_test_pred == y_test),4)} accurate.')

    

ngrams = (1, 1); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 500; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 1000; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 1500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 1500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 1500; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 2000; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 2500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 2500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (1, 1); max_features = 2500; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6518 accurate.
Without dimensionality reduction, on the testing set, our model is 0.3921 accurate.
ngrams = (2, 2); max_features = 500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 500; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 1000; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 1500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 1500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 1500; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 2000; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 2500; min_df = 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 2500; min_df = 3


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
ngrams = (2, 2); max_features = 2500; min_df = 5


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Without dimensionality reduction, on the training set, our model is 0.6089 accurate.
Without dimensionality reduction, on the testing set, our model is 0.434 accurate.
