# Data mining Term Project
## Hengchao Wang

### Reference. 
BoardGameGeek Reviews Baseline Model
https://www.kaggle.com/ellpeeaxe/boardgamegeek-reviews-baseline-model

Word2vec In Supervised NLP Tasks. Shortcut
https://www.kaggle.com/vladislavkisin/word2vec-in-supervised-nlp-tasks-shortcut/comments

Cuz the scale of the dataset is super big. Cannot use one hot expression to exprese words and sentences. It will cause the curse of dimensionality. Which means the matrix is big and sparse to be compute. So I decide to use Word2Vec word embedding model to reduce dimension of matrix. I have two references. The link is shown above. 

The based task of this question is a regression problem. The imput data is 300-dimensional word vector, output is the prediction of rate for each review.

In [None]:
import numpy as np 
import pandas as pd 
import nltk
import re,string,unicodedata
import seaborn as sns
import gensim
import sklearn

from pandas import Series
from wordcloud import WordCloud,STOPWORDS
from bs4 import BeautifulSoup
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from gensim.models import word2vec, Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, BayesianRidge
import joblib
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn import ensemble

# Get data from csv

In [None]:
# get review and rating columns
review_path = 'bgg-13m-reviews.csv'

data = pd.read_csv(review_path, usecols=[2,3])
data.head()

In [None]:
# remove null comment
def remove_nan(data):
    data['comment']=data['comment'].fillna('null')
    data = data[~data['comment'].isin(['null'])]
    data = data.reset_index(drop=True)
    return data
data = remove_nan(data)
data.head()

**This is data describtion. The number of review is 2.637756e+06**

In [None]:
data.describe()

# Data preprocessing

For data preprocessing I using **tokenizer()** from **NLTK** library to tokenize the words. Load stopword from **NLTK** and load html strips from **beautifulsoup4** library. Use regular expression to remove them and some special characters.

In [None]:
#Tokenization of text
tokenizer=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

In [None]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
data['comment']=data['comment'].apply(remove_between_square_brackets)

In [None]:
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
#Apply function on review column
data['comment']=data['comment'].apply(remove_special_characters)

In [None]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
#Apply function on review column
data['comment']=data['comment'].apply(simple_stemmer)

In [None]:
#set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

#removing the stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
#Apply function on review column
data['comment']=data['comment'].apply(remove_stopwords)

**After we remove the stopword we need to remove the empty review again because come short review after remove stopword will change into empty.**

In [None]:
data = remove_nan(data)
data.to_csv('data_after_remove_st.csv', header=False, index=False, encoding = 'utf-8')

In [None]:
columns = ['rate', 'comment']
data = pd.read_csv('data_after_remove_st.csv',names = columns)

In [None]:
data['comment'] = data.comment.str.lower()
data['document_sentences'] = data.comment.str.split('.') 
# data['tokenized_sentences'] = data['document_sentences']
data['tokenized_sentences'] = list(map(lambda sentences:list(map(nltk.word_tokenize, sentences)),data.document_sentences))  
data['tokenized_sentences'] = list(map(lambda sentences: list(filter(lambda lst: lst, sentences)), data.tokenized_sentences))

In [None]:
data.head()

### Challenge 1
**Here is a hint: Because the String[] cannot save as csv. The tokenized_sentences after save into csv will change the format into String and cannot load again.** This is one of a challenge I met. At the first few round of training Word2Vec model. The final accuracy is super low. I check the word expression of each word. The output from Word2Vec is less than 0.0001. That means that these word almost doesn't appear in the dataset. That doesn't make sence. So I check the model. The model.wv.vocab.keys() is small too and the vocabelory are latters, not words. So it must be the split problem or the format problem. So I check the type of each variable. The type of "tokenized_sentences" is changes. After google the issue. I found the point is you cannot save string[] in csv.

I wrote the wrong code as a comment in next 2 cells.

In [None]:
# data.to_csv("data_after_pre.csv",sep=',',index=False, encoding = 'utf-8')

In [None]:
# data = pd.read_csv('data_after_pre.csv')

The next cell will not be run when I train the Word2Vec. I train the Word2Vec model by using the whole Dataset.
The next cell will be run when I train the regression model. Cuz the computation I have only can use 50k reviews to train the regression model. So I use 10k and 50k reviews and compare them.

In [None]:
# Take the top 10k after random ordering
data = data.reindex(np.random.permutation(data.index))[:100000]

In [None]:
# split the data into training data and test data.
train, test, y_train, y_test = train_test_split(data, data['rate'], test_size=.2)

In [None]:
type(train.tokenized_sentences[993001])

In [None]:
#Collecting a vocabulary
voc = []
for sentence in train.tokenized_sentences:
    voc.extend(sentence)
#     print(sentence)

print("Number of sentences: {}.".format(len(voc)))
print("Number of rows: {}.".format(len(train)))

In [None]:
voc[:10]

# Word2Vec model train save and load

The number of feature in my Word2Vec model is 300. The matrix using one-hot expression is about 150k * 2.6M. Curse of dimensionality is gone.

In [None]:
# word2vector
num_features = 300    
min_word_count = 3     # Frequency < 3 will not be count in.
num_workers = 16       
context = 8           
downsampling = 1e-3   

# Initialize and train the model
W2Vmodel = Word2Vec(sentences=voc, sg=1, hs=0, workers=num_workers, size=num_features, min_count=min_word_count, window=context,
                    sample=downsampling, negative=5, iter=6)

In [None]:
model_voc = set(W2Vmodel.wv.vocab.keys()) 
print(len(model_voc))

In [None]:
# model save
W2Vmodel.save("Word2Vec2")

In [None]:
# model load
W2Vmodel = Word2Vec.load('Word2Vec2')

### Challenge 2

Train the model sentence by sentence is more accurate than the whole review. Cuz the length of the sentence are similar so that the feature of each input is similar. So I did not remove '.' when I remove noise character. That's come from comparison.

In [None]:
def sentence_vectors(model, sentence):
    #Collecting all words in the text
#     print(sentence)
    sent_vector = np.zeros(model.vector_size, dtype="float32")
    if sentence == [[]] or sentence == []  :
        return sent_vector
    words=np.concatenate(sentence)
#     words = sentence
    #Collecting words that are known to the model
    model_voc = set(model.wv.vocab.keys()) 
#     print(len(model_voc))

    # Use a counter variable for number of words in a text
    nwords = 0
    # Sum up all words vectors that are know to the model
    for word in words:
        if word in model_voc: 
            sent_vector += model[word]
            nwords += 1.

    # Now get the average
    if nwords > 0:
        sent_vector /= nwords
    return sent_vector

In [None]:
train['sentence_vectors'] = list(map(lambda sen_group:
                                      sentence_vectors(W2Vmodel, sen_group),
                                      train.tokenized_sentences))
test['sentence_vectors'] = list(map(lambda sen_group:
                                    sentence_vectors(W2Vmodel, sen_group), 
                                    test.tokenized_sentences))

In [None]:
def vectors_to_feats(df, ndim):
    index=[]
    for i in range(ndim):
        df[f'w2v_{i}'] = df['sentence_vectors'].apply(lambda x: x[i])
        index.append(f'w2v_{i}')
    return df[index]

In [None]:
X_train = vectors_to_feats(train, 300)
X_test = vectors_to_feats(test, 300)

In [None]:
train = pd.concat([X_train, y_train], axis=1)
test = pd.concat([X_test, y_test], axis=1)
train.to_csv('train_w2v_100k.csv')
test.to_csv('test_w2v_100k.csv')

In [None]:
train = pd.read_csv('train_w2v_1000k.csv').drop(columns = 'Unnamed: 0')
test = pd.read_csv('test_w2v_1000k.csv').drop(columns = 'Unnamed: 0')
X_train = train.drop(columns = 'rate')
X_test = test.drop(columns = 'rate')
y_train = train.rate
y_test = test.rate

In [None]:
X_test

# Implement different regression model
I implement 4 regression model and compare them with Root Mean Square Error (RMSE) and Mean absolute error(MAE).

**RMSE:** Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction errors). Residuals are a measure of how far from the regression line data points are; RMSE is a measure of how spread out these residuals are. It can tells you how concentrated the data is around the line of best fit. 

**MAE:**  Mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon. It is thus an arithmetic average of the absolute errors |ei|=|yi-xi|, where yi is the prediction and xi the true value. 

### Linear regression model
Linear regression is a basic and commonly used type of predictive analysis. Parameter calculation of linear equation using least squares method.

[Linear regression introduction](https://machinelearningmastery.com/linear-regression-for-machine-learning/)

In [None]:
model_lr = LinearRegression()
model_lr.fit(X_train, y_train)

In [None]:
lr_y_predict=model_lr.predict(X_test)
y_test = np.array(y_test)

In [None]:
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,lr_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, lr_y_predict)

print('linear_regression_rmse = ', rmse)
print('linear_regression_mae = ', mae)

In [None]:
joblib.dump(model_lr, 'save/model_lr.pkl')

# model_lr = joblib.load('save/model_lr_1000k.pkl')

### SVR model
Support vector regression(SVR) is an application of support vector machine(SVM) to regression problem.

Regression is like looking for the internal relationship of a bunch of data. Regardless of whether the pile of data consists of several categories, a formula is obtained to fit these data. When a new coordinate value is given, a new value can be obtained. So for SVR, it is to find a face or a function, and you can fit all the data (that is, all data points, regardless of the type, the closest distance from the data point to the face or function)

[SVR introduction introduction](https://towardsdatascience.com/an-introduction-to-support-vector-regression-svr-a3ebc1672c2)

In [None]:
model_svm = SVR()
model_svm.fit(X_train, y_train)

In [None]:
svm_y_predict=model_svm.predict(X_test)

In [None]:
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,svm_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, svm_y_predict)

print('svm_rmse = ', rmse)
print('svm_mae = ', mae)

In [None]:
joblib.dump(model_lr, 'save/model_svm.pkl')


### Bayesian Ridge model
In the Bayesian viewpoint, we formulate linear regression using probability distributions rather than point estimates. The response, y, is not estimated as a single value, but is assumed to be drawn from a probability distribution.

The output, y is generated from a normal (Gaussian) Distribution characterized by a mean and variance. The mean for linear regression is the transpose of the weight matrix multiplied by the predictor matrix. The variance is the square of the standard deviation σ (multiplied by the Identity matrix because this is a multi-dimensional formulation of the model).

[Bayesian Ridge introduction](https://towardsdatascience.com/introduction-to-bayesian-linear-regression-e66e60791ea7)

In [None]:
model_bayes_ridge = BayesianRidge()
model_bayes_ridge.fit(X_train, y_train)

In [None]:
bayes_y_predict = model_bayes_ridge.predict(X_test)

In [None]:
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,bayes_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, bayes_y_predict)

print('BayesianRidge_rmse = ', rmse)
print('BayesianRidge_mae = ', mae)

In [None]:
joblib.dump(model_bayes_ridge, 'save/model_bayes.pkl')


### Random Forest Regression model

Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees.

The throught of Random Forest Regression is using the Boosting and ensemble in decision tree. In the lecture mentioned.

[Random Forest Regression introduction](https://towardsdatascience.com/random-forest-and-its-implementation-71824ced454f)

In [None]:
model_random_forest_regressor = ensemble.RandomForestRegressor(n_estimators=20)
model_random_forest_regressor.fit(X_train, y_train)

In [None]:
random_forest_y_predict = model_random_forest_regressor.predict(X_test)

In [None]:
# (RMSE)
rmse = np.sqrt(mean_squared_error(y_test,random_forest_y_predict))

# (MAE)
mae = mean_absolute_error(y_test, random_forest_y_predict)

print('BayesianRidge_rmse = ', rmse)
print('BayesianRidge_mae = ', mae)

In [None]:
joblib.dump(model_random_forest_regressor, 'save/model_random_forest.pkl')


### Predict function for one review with four model

In [None]:
def predict(text):
    model_lr = joblib.load('save/model_lr.pkl')
    model_svm = joblib.load('save/model_svm.pkl')
    model_random_forest_regressor = joblib.load('save/model_random_forest.pkl')
    model_bayes_ridge = joblib.load('save/model_bayes.pkl')
    data = {'comment': Series(text)}
    data = pd.DataFrame(data)
    print(data)
    data['comment'] = data['comment'].apply(remove_between_square_brackets)
    data['comment'] = data['comment'].apply(remove_special_characters)
    data['comment'] = data['comment'].apply(simple_stemmer)
    data['comment'] = data['comment'].apply(remove_stopwords)

    data['comment'] = data.comment.str.lower()
    data['document_sentences'] = data.comment.str.split('.')
    data['tokenized_sentences'] = data['document_sentences']
    data['tokenized_sentences'] = list(
        map(lambda sentences: list(map(nltk.word_tokenize, sentences)), data.document_sentences))
    data['tokenized_sentences'] = list(
        map(lambda sentences: list(filter(lambda lst: lst, sentences)), data.tokenized_sentences))
    print(data)
    # sentence = data['tokenized_sentences'][0]
    W2Vmodel = Word2Vec.load("Word2Vec2")

    data['sentence_vectors'] = list(map(lambda sen_group:
                                        sentence_vectors(W2Vmodel, sen_group),
                                        data.tokenized_sentences))
    text = vectors_to_feats(data, 300)
    print(text)
    lr_y_predict = model_lr.predict(text)
    svm_y_predict = model_svm.predict(text)
    bayes_y_predict = model_bayes_ridge.predict(text)
    random_forest_y_predict = model_random_forest_regressor.predict(text)

    return lr_y_predict, svm_y_predict, random_forest_y_predict, bayes_y_predict


In [None]:
print(predict(["This is a great game.  I've even got a number of non game players enjoying it.  Fast to learn and always changing.",
        "This is a great game.  I've even got a number of non game players enjoying it.  Fast to learn and always changing."]))