# Part 6 Machine Learning and Deep NLP

*Christina Brockway*

### Tasks:

**Machine Learning Model**
-  Drop any reviews that don't have a rating
-  Use original review column as X and classification as y
-  Use modeling pipelines with text vectorizer and model in same pipeline
    -  Select a sklearn vectorizer
    -  Select a Classificatation model
-  Fit and evaluate the model
-  Document observations from results 

**Improve Model GridSearch Text Vectorization**
-  Construct a grid of parameters
-  Fit and evaluate grid search results
-  Document:
    -  What were best parameters?
    -  How does the best estimator perform? 

**Deep NLP (RNN)**
-  Create train/test/validation datasets
    -  Convert target categories to integers using LabelEncoder
    -  Create tensorflow dataset using X and y
    -  Split data
-  Create Keras Text Vectorization Layer
    -  Create Keras Text Vectorization layer for RNN model
    -  Fit/Adapt on training text
    -  Save Vocabulary size to use in embedding layer
-  Build RNN
-  Fit and Evaluate the model
-  Document observations

#### Load Data and Imports

In [3]:
!pip install Unidecode

Collecting Unidecode
  Obtaining dependency information for Unidecode from https://files.pythonhosted.org/packages/84/b7/6ec57841fb67c98f52fc8e4a2d96df60059637cba077edc569a302a8ffc7/Unidecode-1.3.8-py3-none-any.whl.metadata
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
   ---------------------------------------- 0.0/235.5 kB ? eta -:--:--
   ------------- -------------------------- 81.9/235.5 kB 1.5 MB/s eta 0:00:01
   ---------------------------------------- 235.5/235.5 kB 2.4 MB/s eta 0:00:00
Installing collected packages: Unidecode
Successfully installed Unidecode-1.3.8


In [47]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
pd.set_option('display.max_colwidth',300)

import nltk
from nltk.tokenize import word_tokenize
from nltk import TweetTokenizer
from nltk import ngrams
from nltk.probability import FreqDist
from nltk.stem import WordNetLemmatizer

import spacy
nlp=spacy.load('en_core_web_sm')

from wordcloud import WordCloud
from wordcloud import STOPWORDS

import joblib
import my_functions as mf


In [50]:
import sys, os
# Check sys.path for python path
sys.path
#Get the absolute file path of parent directory
os.path.abspath('../')
#Add parent directory to python path
sys.path.append( os.path.abspath('../'))


In [17]:
df = joblib.load('data-NLP/processed_data.joblib')
df.head(1)

Unnamed: 0,review,rating,html,length,tokens,lemmas,tokens-joined,lemmas-joined
1,"a guilty pleasure for me personally, as i love...",9.0,,251,"[guilty, pleasure, personally, love, great, es...","[guilty, pleasure, personally, love, great, es...",guilty pleasure personally love great escape w...,guilty pleasure personally love great escape w...


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2422 entries, 1 to 8647
Data columns (total 8 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   review         2422 non-null   object 
 1   rating         2422 non-null   float64
 2   html           160 non-null    object 
 3   length         2422 non-null   int64  
 4   tokens         2422 non-null   object 
 5   lemmas         2422 non-null   object 
 6   tokens-joined  2422 non-null   object 
 7   lemmas-joined  2422 non-null   object 
dtypes: float64(1), int64(1), object(6)
memory usage: 170.3+ KB


In [22]:
#drop reviews without a rating and irrelevent columns
#There are no rows without ratings
df=df.drop(columns=(['html', 'length','tokens','lemmas','tokens-joined','lemmas-joined']))
df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2422 entries, 1 to 8647
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   review  2422 non-null   object 
 1   rating  2422 non-null   float64
dtypes: float64(1), object(1)
memory usage: 121.3+ KB


In [31]:
#Split into high and low reviews
dff = df.copy()
def rating_groups(x):
    if x>=8.5:
        return "high"
    elif x <=4.0:
        return "low"
    else: 
        return None

In [32]:
dff['label'] = df['rating'].map(rating_groups)
dff['label'].value_counts()

low     1223
high    1199
Name: label, dtype: int64

In [33]:
dff.head(2)

Unnamed: 0,review,rating,label
1,"a guilty pleasure for me personally, as i love...",9.0,high
6,"the first underwhelmed me, but this one straig...",3.0,low


In [None]:
dff=dff.drop(columns=('rating'))

In [38]:
#Define X and y
X=dff['review']
y=dff['label']

#Train Test Split
X_train, X_test, y_train, y_test=train_test_split(X,y, random_state=42)
y_train.value_counts(normalize=True)

high    0.503855
low     0.496145
Name: label, dtype: float64

In [39]:
len(X_test)

606

In [40]:
len(X_train)

1816

In [46]:
#Create pipeline

# Select a sklearn vectorizer
tfidf_vectorizer = TfidfVectorizer(lowercase=True, stop_words ='english')

#Use RandomForestClassifier
rfc=RandomForestClassifier(random_state=42)

rfc_pipe = Pipeline([('vectorizer', tfidf_vectorizer), ('classifier', rfc)])


In [45]:
rfc_pipe.fit(X_train, y_train)

In [52]:
mf.evaluate_classification_network(rfc_pipe, X_train,y_train, X_test, y_test)


- Evaluating Network...


AttributeError: 'Series' object has no attribute 'as_numpy_iterator'