In [1]:
import pandas as pd 
import numpy as np 

path = '../Datasets/cleaned_data.csv'
df = pd.read_csv(path)

df.head()

Unnamed: 0,rating,feedback,clean_reviews,Positive,Negative,Neutral,reviews_length
0,5,1,love echo,0.808,0.0,0.192,9
1,5,1,love,1.0,0.0,0.0,4
2,4,1,sometim play game answer question correct alex...,0.223,0.141,0.636,99
3,5,1,lot fun thing yr old learn dinosaur control l...,0.564,0.0,0.436,101
4,5,1,music,0.0,0.0,1.0,5


Splitting data to calculate accuracy of the model

In [2]:
from sklearn.model_selection import train_test_split

X = df.loc[:,'clean_reviews']
y = df.loc[:,'rating']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=.8, random_state=0)
X_train.shape

(2520,)

### Vectorization

<img src='https://alvinntnu.github.io/NTNU_ENC2045_LECTURES/_images/text-representation-bow.gif' width='300' height='200' style="float: right;margin:5px 20px 5px 1px">  

Bag-of-words model is the simplest way (i.e., easy to be automated) to vectorize texts into numeric representations. In short, it is a method to represent a text using its word frequency list.  

Issues with Bag-of-Words Text Representation  
- Word order is ignored.  
- Raw absolute frequency counts of words do not necessarily represent the meaning of the text properly.  
- Marginal frequencies play important roles. (Row and Columns)  

In [3]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1.)
X_train_vector = cv.fit_transform(X_train.values.astype('U'))
X_test_vector = cv.transform(X_test)
X_train_vector.shape, X_test_vector.shape

((2520, 2880), (630, 2880))

In [4]:
from sklearn.feature_extraction.text import TfidfTransformer

tfidfTransformer = TfidfTransformer()
tfidfTransformer.fit(X_train_vector)
tfidfTransformer.transform(X_train_vector)

<2520x2880 sparse matrix of type '<class 'numpy.float64'>'
	with 28880 stored elements in Compressed Sparse Row format>

### Building Machine Learning Model
In this part, I'll try *"Random Forest Classifier"* machine learning model, but in `../Notebooks/ml_boosting_search.ipynb` notebook I'll examine the machine learning models and we can see the final model in that notebook.

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

In [5]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=160, max_features='log2', max_depth=4, criterion='gini')

Fitting and predicting the model

In [6]:
%time rf.fit(X_train_vector, y_train)

CPU times: total: 219 ms
Wall time: 232 ms


In [7]:
y_preds = rf.predict(X_test_vector)

Calculating accuracy and working with the error metrics

In [8]:
from sklearn import metrics

accuracy = metrics.accuracy_score(y_test, y_preds)
mae = metrics.mean_absolute_error(y_test, y_preds)
mape = metrics.mean_absolute_percentage_error(y_test, y_preds)

print(f'''
Sklearn Accuracy Score: {(accuracy*100):.2f} \n
Mean Absolute Root Error: {np.sqrt(mae*100):.2f} \n
Mean Absolute Percentage Error: {(mape*100):.2f}
''')


Sklearn Accuracy Score: 70.95 

Mean Absolute Root Error: 7.59 

Mean Absolute Percentage Error: 33.65



In [9]:
metrics.confusion_matrix(y_test, y_preds)

array([[  0,   0,   0,   0,  33],
       [  0,   0,   0,   0,  21],
       [  0,   0,   0,   0,  39],
       [  0,   0,   0,   0,  90],
       [  0,   0,   0,   0, 447]], dtype=int64)