### Review Classification Model

- The MultinomialNB classifier is suitable for classifying text data where features represent counts of words or other discrete elements. 
- Where as Bernoulli naive bayes is used when you only when the features represent the presence or absence of the words as in case of one hot encoding.

In [1]:
#!python -m spacy download en_core_web_sm 

In [2]:
import nltk
import pandas as pd
import numpy as np
import spacy
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score, classification_report
import spacy

# Load the English language model with lemmatization capabilities

nlp = spacy.load("en_core_web_sm")

In [3]:
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

In [4]:
# Download NLTK resources
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/asthapuri/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/asthapuri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
# read the dataset
train = pd.read_csv('train.csv')
validation = pd.read_csv('validation.csv')
test = pd.read_csv('test.csv')

In [6]:
train.shape, validation.shape, test.shape

((17877, 5), (3831, 5), (3831, 5))

In [7]:
# view the train data
train.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,460,Black Squad,2018.0,"Early Access ReviewVery great shooter, that ha...",1
1,2166,Tree of Savior (English Ver.),2016.0,I love love love playing this game!Super 100%!...,1
2,17242,Eternal Card Game,2016.0,Early Access ReviewAs a fan of MTG and Hearths...,1
3,6959,Tactical Monsters Rumble Arena,2018.0,Turn based strategy game similiar to FF Tactic...,1
4,8807,Yu-Gi-Oh! Duel Links,2017.0,This game has an insanely huge download for be...,0


In [8]:
# view the test data
test.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,12053,Infestation: The New Z,2016.0,Unbelievable that this rehash copy and paste t...,0
1,12536,SMITE®,2015.0,I can't recommened this game in its current st...,0
2,747,Heroes & Generals,2016.0,Early Access ReviewThis game is constantly evo...,0
3,3214,World of Warships,2018.0,I play this game because it scratches an itch....,0
4,4036,World of Guns: Gun Disassembly,2016.0,"Finally, a game for people like us to enjoy! P...",1


In [9]:
# view the valodation data
validation.head()

Unnamed: 0,review_id,title,year,user_review,user_suggestion
0,8604,Dungeon Defenders II,2015.0,Early Access Review* Ok Played the first DD lo...,1
1,20407,Minion Masters,2017.0,Product received for freeEarly Access ReviewSo...,1
2,636,Magic Duels,2018.0,Game is extremely unfun to play unless you wan...,0
3,10217,Robocraft,2016.0,Early Access ReviewThis used to be an amazing ...,0
4,9564,Realm of the Mad God,2014.0,"With stunning visuals, an immersive storyline,...",1


In [10]:
test['user_suggestion'].value_counts()

user_suggestion
1    2187
0    1644
Name: count, dtype: int64

### Preprocessing

In [11]:
# Load the SpaCy English model
# Since we are not using NER, we can disable it to speedup
nlp = spacy.load("en_core_web_sm", disable='ner')

In [12]:
def preprocess_text(texts):
    # lemmatize the tokens and store them in a list
    processed_texts = []
    for doc in nlp.pipe(texts, n_process=-1):
        lemmatized_tokens = [token.lemma_.lower() for token in doc if token.is_alpha and token.lemma_ not in nlp.Defaults.stop_words]
        
        # Join the lemmatized tokens into a string
        processed_text = " ".join(lemmatized_tokens)
        
        processed_texts.append(processed_text)
        
    return processed_texts

### Explanation
This function preprocesses a list of texts using spaCy. Here's a breakdown:

1. Initialization: It creates an empty list processed_texts to store the processed versions of the input texts.
2. Looping: It iterates through the input texts using spaCy's nlp.pipe method with n_process=-1 for multi-core processing (if available).
3. Lemmatization and Stopword Removal: For each text (as a spaCy doc object):
    - It extracts the tokens (words).
    - It lemmatizes each token, converting words to their base form ("running" becomes "run").
    - It converts all tokens to lowercase.
    - The is_alpha attribute makes sure that only alpha numeric tokens are considered.
    - It removes stop words like "the" and "a" (using nlp.Defaults.stop_words).
4. Text Joining: It joins the remaining lemmatized tokens back into a single string.
5. Storage: It appends the processed text to the processed_texts list.
6. Return: Finally, the function returns the list of preprocessed texts.

In [13]:
# apply preprcoess_text function to user_review column
train['user_review'] = preprocess_text(train['user_review'])
validation['user_review'] = preprocess_text(validation['user_review'])
test['user_review'] = preprocess_text(test['user_review'])

### Vectorization

#### One hot encoding

In [14]:
# any word which does not appear in more than 0.1% documents or reviews will not be considered 
# in the internal vocabulary being created by count vectorizer
count_vectorizer_ohe = CountVectorizer(min_df=0.001, binary=True)

In [15]:
#fit_transform user_review
count_vectorizer_ohe_train = count_vectorizer_ohe.fit_transform(train['user_review'])

#### Naive Bayes Model

In [16]:
# Naive Bayes Classifier
naive_bayes_classifier = BernoulliNB()

In [17]:
#create the naive bayes model for the train data
naive_bayes_classifier.fit(count_vectorizer_ohe_train, train['user_suggestion'])
naive_bayes_classifier.score(count_vectorizer_ohe_train, train['user_suggestion'])

0.825585948425351

In [18]:
##create the naive bayes model for the validation data
count_vectorizer_ohe_val = count_vectorizer_ohe.transform(validation['user_review'])
naive_bayes_classifier.score(count_vectorizer_ohe_val, validation['user_suggestion'])

0.8107543722265727

#### Count Vectorizer

In [19]:
# initialize count_vectorizer and name it count_vectorizer
count_vectorizer = CountVectorizer(min_df=0.001)

In [20]:
#fit_transform user_review
count_vectorizer_train = count_vectorizer.fit_transform(train['user_review'])

#### Naive Bayes Model

In [21]:
# Naive Bayes Classifier
naive_bayes_classifier = MultinomialNB()

In [22]:
#create the naive bayes model for the train data
naive_bayes_classifier.fit(count_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(count_vectorizer_train, train['user_suggestion'])

0.8392347709347205

In [23]:
##create the naive bayes model for the validation data
count_vectorizer_val = count_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(count_vectorizer_val, validation['user_suggestion'])

0.8274601931610546

#### TF-IDF

In [24]:
# import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [25]:
# initialize tfifd vectorizer
tfidf_vectorizer = TfidfVectorizer(min_df=0.001)

In [26]:
#create the naive bayes model for the train data using tfidf
tfidf_vectorizer_train = tfidf_vectorizer.fit_transform(train['user_review'])
naive_bayes_classifier.fit(tfidf_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(tfidf_vectorizer_train, train['user_suggestion'])

0.8413604072271634

In [27]:
#create the naive bayes model for the validation data using tfidf
tfidf_vectorizer_val = tfidf_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(tfidf_vectorizer_val, validation['user_suggestion'])

0.822239624119029

#### Using n-gram with TF-IDF

In [28]:
tfidf_ngram_vectorizer = TfidfVectorizer(min_df=0.001, ngram_range=(1, 3))

#### Naive Bayes

In [29]:
#create the naive bayes model for the train data using tfidf and ngram
tfidf_ngram_vectorizer_train = tfidf_ngram_vectorizer.fit_transform(train['user_review'])
naive_bayes_classifier.fit(tfidf_ngram_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(tfidf_ngram_vectorizer_train, train['user_suggestion'])

0.8582536219723668

In [30]:
#create the naive bayes model for the validation data using tfidf and ngram
tfidf_ngram_vectorizer_val = tfidf_ngram_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(tfidf_ngram_vectorizer_val, validation['user_suggestion'])

0.8277212216131559

#### Using n-gram with Count Vectorizer

In [31]:
count_ngram_vectorizer = CountVectorizer(min_df=0.001, ngram_range=(1, 3))

In [32]:
#create the naive bayes model for the train data using count vectorizer and ngram
count_ngram_vectorizer_train = count_ngram_vectorizer.fit_transform(train['user_review'])
naive_bayes_classifier.fit(count_ngram_vectorizer_train, train['user_suggestion'])
naive_bayes_classifier.score(count_ngram_vectorizer_train, train['user_suggestion'])

0.8501426413827824

In [33]:
#create the naive bayes model for the validation data using count vectorizer and ngram
count_ngram_vectorizer_val = count_ngram_vectorizer.transform(validation['user_review'])
naive_bayes_classifier.score(count_ngram_vectorizer_val, validation['user_suggestion'])

0.8277212216131559

### POS tagging and NER

In [34]:
def preprocess_text_spacy(processed_texts):
    # Tokenization and POS tagging
    pos_texts = []
    for doc in nlp.pipe(processed_texts):
        pos_tags = [token.pos_ for token in doc]
        pos_text = " ".join(pos_tags)
        pos_texts.append(pos_text)

    # Named Entity Recognition (NER)
    ner_texts = []
    for doc in nlp.pipe(processed_texts):     
        ner_tags = [token.ent_type_ if token.ent_type_ else "O" for token in doc]
        ner_text = " ".join(ner_tags)
        ner_texts.append(ner_text)
    
    return [pos_texts, ner_texts]

In [44]:
#applying preprocess_text_spacy function to user_review column for train data
pos_texts, ner_texts = preprocess_text_spacy(train['user_review'])

In [45]:
# adding the lists as column to the dataset
train['pos_tags'] = pos_texts
train['ner_tags'] = ner_texts

In [46]:
del train['pos_tags']
del train['ner_tags']

In [47]:
def remove_noun(df):

  nlp = spacy.load("en_core_web_sm")  

  # Process user_review column
  filtered_reviews = []
  for review in df['user_review']:
    filtered_review = " ".join([token.text for token in nlp(review) if token.pos_ not in ['NOUN', 'PROPN']])
    filtered_reviews.append(filtered_review)
  
  return filtered_reviews

In [48]:
w_noun_train  = remove_noun(train)

#### TF-IDF with nouns removed

In [49]:
tfidf_wnoun_vectorizer = TfidfVectorizer(min_df=0.001)
tfidf_wnoun_vectorizer_train = tfidf_wnoun_vectorizer .fit_transform(w_noun_train)

#### Naive Bayes

In [50]:
# tfidf_wnoun_vectorizer_train =tfidf_wnoun_vectorizer .fit_transform(train['user_review'])
naive_bayes_classifier.fit(tfidf_wnoun_vectorizer_train , train['user_suggestion'])
naive_bayes_classifier.score(tfidf_wnoun_vectorizer_train, train['user_suggestion'])

0.7995748727415114