# Using  the scraped reviews with a trained model

Now that you have your data you can use it for whatever you want. Here I provide a small example using a trained model

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

In [2]:
data = pd.read_csv('../data/scraped_reviews.csv') # load the data

In [3]:
data.head()

Unnamed: 0,reviews
0,"\r\n Jack Torrance needs to get away, anywher..."
1,\r\n Having watched Stanley Kubrick's film 'T...
2,\r\n I really enjoyed the 1980's film version...
3,\r\n I had never had the chance to pick up th...
4,\r\n This is my sixth (and the last) book of ...


As you can see there are some characters we don't want, due to the html code. We remove them

To keep things simple, I'll be using just 500 reviews

In [4]:
data = data[0:500] 

Now I declare a function to remove unwanted chartacters.

In [5]:
def clean_html(text):
    text = re.sub('[^A-Za-z0-9]+', ' ', text) # removes anything that is not a word
    return text

In [6]:
data.reviews = data.reviews.map(clean_html)

In [7]:
data.head() # clean!

Unnamed: 0,reviews
0,Jack Torrance needs to get away anywhere will...
1,Having watched Stanley Kubrick s film The Shi...
2,I really enjoyed the 1980 s film version star...
3,I had never had the chance to pick up the nov...
4,This is my sixth and the last book of Stephen...


Now I load my trained model, a logistic regression. It has been trained over movie reviews to analyse sentiment, so it can be used as an example here

In [8]:
path = '../models/trained_models/logit_optimizada.sav'

logistic_regression = pickle.load(open(path, 'rb')) # loading the model ...

In [9]:
logistic_regression  # it comes from a grid search, it's an optimized model

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('vectorizador', TfidfVectorizer()),
                                       ('clasificador', LogisticRegression())]),
             n_jobs=4,
             param_grid={'clasificador__C': [0.001, 0.01, 0.1, 1, 10, 100],
                         'clasificador__max_iter': [100, 250, 500, 1000],
                         'clasificador__penalty': ['l1', 'l2', 'elasticnet'],
                         'clasificador__solver': ['newton-cg', 'lbfgs',
                                                  'liblinear', 'sag', 'saga']},
             scoring='roc_auc', verbose=1)

And now we are ready to precit! In this case, 1 means positive sentiment, and 0 negative sentiment

In [10]:
data['predictions'] = logistic_regression.predict(data['reviews'])

data.head()

Unnamed: 0,reviews,predictions
0,Jack Torrance needs to get away anywhere will...,1
1,Having watched Stanley Kubrick s film The Shi...,0
2,I really enjoyed the 1980 s film version star...,1
3,I had never had the chance to pick up the nov...,0
4,This is my sixth and the last book of Stephen...,1


Translate the coded prediction

In [11]:
data.predictions = data.predictions.map(lambda x: 'Positve' if x == 1 else 'Negative')

data

Unnamed: 0,reviews,predictions
0,Jack Torrance needs to get away anywhere will...,Positve
1,Having watched Stanley Kubrick s film The Shi...,Negative
2,I really enjoyed the 1980 s film version star...,Positve
3,I had never had the chance to pick up the nov...,Negative
4,This is my sixth and the last book of Stephen...,Positve
...,...,...
495,Stephen King always delivers So much better t...,Negative
496,I started reading this book because of the am...,Negative
497,Amazing book everyone who likes reading shoul...,Positve
498,I can see why King objected so much to the Ku...,Positve


Done! As you can see, once you have your trained model for text, building a scraper to gather new data is not big deal.  

Keep in mind that this model has been trained with a completly different kind of text, so the precitions will be really poor. It's just an example to show how easy it is to get the data from the wild into your model