<a href="https://colab.research.google.com/github/ArunKoundinya/DeepLearning/blob/main/posts/deep-learning-project-msis/AmazonReviews_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Amazon Reviews Sentiment Analysis - Part 2

In this file we will perform the following steps

-  Loading the Data
-  Traditional ML of Random Forest on 1 Lac Dataset using Bi-Grams along with Hyper-Tuning Parameters
-  Traditional ML of Random Forest on 20K Dataset using Uni-Grams, Bi-Grams along with TF-IDF along with Hyper-Tuning Parameters

## Table of Contents
- [1 - Packages](#1)
- [2 - Loading the Dataset](#2)
- [3 - Traditional ML](#3)
    - [Random Forest with 1Lac Dataset](#best-model)
    - [Random & SVM with 20K Dataset](#best-model-1)

<a name='1'></a>
## 1 - Packages

In [None]:
!pip install pandarallel

Collecting pandarallel
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill>=0.3.1 (from pandarallel)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pandarallel
  Building wheel for pandarallel (setup.py) ... [?25l[?25hdone
  Created wheel for pandarallel: filename=pandarallel-1.6.5-py3-none-any.whl size=16673 sha256=826f449636a5923f88b6df7b1281db0266c5b15779adaae2b48879de97c44033
  Stored in directory: /root/.cache/pip/wheels/50/4f/1e/34e057bb868842209f1623f195b74fd7eda229308a7352d47f
Successfully built pandarallel
Installing collected packages: dill, pandarallel
Successfully installed dill-0.3.8 pandarallel-1.6.5


In [None]:
from google.colab import drive
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from itertools import product

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, classification_report

from pandarallel import pandarallel


In [None]:
# Initialize pandarallel
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


<a name='2'></a>
## 2 - Loading the Dataset

We are loading the saved dataset from Part-1 Computation

In [None]:
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

testdata = pd.read_csv('test_data_sample_complete.csv')
traindata = pd.read_csv('train_data_sample_complete.csv')


Mounted at /content/drive



Here we are taking the sample of 1 Lac rows with random state `42`


In [None]:
train_data = traindata.sample(n=100000, random_state=42)
test_data = testdata.sample(n=10000, random_state=42)

train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})
test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})

In [None]:
#del traindata,testdata

In [None]:
train_data['class_index'].value_counts()

class_index
0    50013
1    49987
Name: count, dtype: int64

In [None]:
train_data.head(1)

Unnamed: 0,class_index,review_combined_lemma
2079998,0,expensive junk product consists piece thin fle...


In [None]:
train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')
test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')

<a name='3'></a>
## 3- Traditional ML

In [None]:
X_train = train_data.review_combined_lemma
y_train = train_data.class_index

X = test_data.review_combined_lemma
y = test_data.class_index

In [None]:
X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
print(X_train.shape)
print(X_dev.shape)
print(X_test.shape)

(100000,)
(5000,)
(5000,)


In [None]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, 2079998 to 100143
Data columns (total 2 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   class_index            100000 non-null  int64 
 1   review_combined_lemma  100000 non-null  object
dtypes: int64(1), object(1)
memory usage: 2.3+ MB



Instead of using cross validation dataset we have took the dev test from the same distribution of test dataset and then we have optimized the training model along with different hyper parameters.

This is ideal way of defining train, test and dev datasets.


<a name='best-model'></a>
## 3.1 Random Forest with Hyper Paramter Tuning

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5, 10]
}

best_model = None
best_params = None
best_dev_accuracy = 0.0


for rf_params in product(*param_grid.values()):
    #print(rf_params)
    pipeline = Pipeline([
        ('bigram', CountVectorizer(ngram_range=(2, 2))),
        ('randomforestclassifier', RandomForestClassifier(random_state=42, **dict(zip(param_grid.keys(), rf_params))))
    ])

    pipeline.fit(X_train, y_train)

    dev_predictions = pipeline.predict(X_dev)
    dev_accuracy = accuracy_score(y_dev, dev_predictions)

    if dev_accuracy > best_dev_accuracy:
        best_dev_accuracy = dev_accuracy
        best_model = pipeline
        best_params = dict(zip(param_grid.keys(), rf_params))

print("Best parameters:", best_params)
print("Development set accuracy:", best_dev_accuracy)

dev_predictions = best_model.predict(X_dev)
accuracy_score(y_dev, dev_predictions)

Best parameters: {'n_estimators': 200, 'max_depth': 20, 'min_samples_split': 2}
Development set accuracy: 0.7908


0.7908

In [None]:
print("Best Model Hyperparameters:")
print(best_params)
print(f"Development Accuracy: {best_dev_accuracy}")

Best Model Hyperparameters:
{'n_estimators': 200, 'max_depth': 20, 'min_samples_split': 2}
Development Accuracy: 0.7908


In [None]:
train_predictions = best_model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Training Accuracy: {train_accuracy}")

test_predictions = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.82103
Test Accuracy: 0.7934


In [None]:
best_model

The above code on GPU/CPU Computer took 10 minutes;

If we run on the whole sample it would take more 4 hours and it was getting disconneted in between when tried.


<a name='best-model-1'></a>
## 3.2 Random Forest & SVM with Hyper Paramter Tuning

Now, let us execute the same with both basic bi-gram and uni-gram with tf-idf and vectorizer

with first 2K sample and then to 10 K sample

Along side with SVM & RF.

I assume this should take longer time

In [None]:
train_data = traindata.sample(n=2000, random_state=42)
test_data = testdata.sample(n=100, random_state=42)

train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})
test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})

train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')
test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')

X_train = train_data.review_combined_lemma
y_train = train_data.class_index

X = test_data.review_combined_lemma
y = test_data.class_index

X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
# Defining the variations of vectorizers
vectorizers = {
    'CountVectorizer': CountVectorizer(ngram_range=(1, 2)),
    'TfidfVectorizer': TfidfVectorizer(ngram_range=(1, 2))
}

# Defining the classifiers along with their respective hyperparameters
classifiers = {
    'RandomForestClassifier': {
        'model': RandomForestClassifier,
        'params': {
            'n_estimators': [200],
            'max_depth': [10, 20],
            'min_samples_split': [2, 5, 10]
        }
    },
    'SVC': {
        'model': SVC,
        'params': {
            'kernel': ['linear', 'rbf']
        }
    }
}

# Initialize variables to store the best model and its performance
best_model_new = None
best_dev_accuracy_new = 0.0

# Iterate over all combinations of vectorizers and classifiers
for vectorizer_name, vectorizer in vectorizers.items():
    for classifier_name, classifier_data in classifiers.items():
        classifier_model = classifier_data['model']
        classifier_params = classifier_data['params']

        for params in product(*classifier_params.values()):
            if classifier_model == RandomForestClassifier:
                pipeline = Pipeline([
                    ('vectorizer', vectorizer),
                    ('classifier', classifier_model(random_state=42, **dict(zip(classifier_params.keys(), params))))
                ])
            elif classifier_model == SVC:
                pipeline = Pipeline([
                    ('vectorizer', vectorizer),
                    ('classifier', classifier_model(**dict(zip(classifier_params.keys(), params))))
                ])

            pipeline.fit(X_train, y_train)

            dev_predictions = pipeline.predict(X_dev)
            dev_accuracy = accuracy_score(y_dev, dev_predictions)

            if dev_accuracy > best_dev_accuracy_new:
                best_dev_accuracy_new = dev_accuracy
                best_model_new = pipeline

print(f"Best Development Accuracy: {best_dev_accuracy_new}")

print("Best Model:")
print(best_model_new)

dev_predictions = best_model_new.predict(X_dev)
print(f"Accuracy on Development Set: {accuracy_score(y_dev, dev_predictions)}")


Best Development Accuracy: 0.94
Best Model:
Pipeline(steps=[('vectorizer', CountVectorizer(ngram_range=(1, 2))),
                ('classifier',
                 RandomForestClassifier(max_depth=20, n_estimators=200,
                                        random_state=42))])
Accuracy on Development Set: 0.94



Above code took `5` minutes

Even if we run for 10K it be estimated to run for `50` minutes which as well depends on larger vocabulary

Also, because of iterations my computations resources would like to use for Deep Learning models comared to this.

However, we wil use the above 10K miniature dataset for baseline for NN modeling at this point of time, considering the constraint of the execution time and computational units.

In [None]:
train_data = traindata.sample(n=10000, random_state=42)
test_data = testdata.sample(n=1000, random_state=42)

train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})
test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})

train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')
test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')

X_train = train_data.review_combined_lemma
y_train = train_data.class_index

X = test_data.review_combined_lemma
y = test_data.class_index

X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
# Defining the variations of vectorizers
vectorizers = {
    'CountVectorizer': CountVectorizer(ngram_range=(1, 2)),
    'TfidfVectorizer': TfidfVectorizer(ngram_range=(1, 2))
}

# Defining the classifiers along with their respective hyperparameters
classifiers = {
    'RandomForestClassifier': {
        'model': RandomForestClassifier,
        'params': {
            'n_estimators': [200],
            'max_depth': [10, 20],
            'min_samples_split': [2, 5, 10]
        }
    },
    'SVC': {
        'model': SVC,
        'params': {
            'kernel': ['linear', 'rbf']
        }
    }
}

# Initialize variables to store the best model and its performance
best_model_final = None
best_dev_accuracy_final = 0.0

# Iterate over all combinations of vectorizers and classifiers
for vectorizer_name, vectorizer in vectorizers.items():
    for classifier_name, classifier_data in classifiers.items():
        classifier_model = classifier_data['model']
        classifier_params = classifier_data['params']

        for params in product(*classifier_params.values()):
            if classifier_model == RandomForestClassifier:
                pipeline = Pipeline([
                    ('vectorizer', vectorizer),
                    ('classifier', classifier_model(random_state=42, **dict(zip(classifier_params.keys(), params))))
                ])
            elif classifier_model == SVC:
                pipeline = Pipeline([
                    ('vectorizer', vectorizer),
                    ('classifier', classifier_model(**dict(zip(classifier_params.keys(), params))))
                ])

            pipeline.fit(X_train, y_train)

            dev_predictions = pipeline.predict(X_dev)
            dev_accuracy = accuracy_score(y_dev, dev_predictions)

            if dev_accuracy > best_dev_accuracy_final:
                best_dev_accuracy_final = dev_accuracy
                best_model_final = pipeline

print(f"Best Development Accuracy: {best_dev_accuracy_final}")

print("Best Model:")
print(best_model_final)

dev_predictions = best_model_new.predict(X_dev)
print(f"Accuracy on Development Set: {accuracy_score(y_dev, dev_predictions)}")


Best Development Accuracy: 0.884
Best Model:
Pipeline(steps=[('vectorizer', CountVectorizer(ngram_range=(1, 2))),
                ('classifier', SVC(kernel='linear'))])
Accuracy on Development Set: 0.832


In [None]:
best_model_final

In [None]:
train_predictions = best_model_final.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Training Accuracy: {train_accuracy}")

test_predictions = best_model_final.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 1.0
Test Accuracy: 0.87


### Below Part is Ambitious which will check for execution

In [None]:
!pip install pandarallel

Collecting pandarallel
  Downloading pandarallel-1.6.5.tar.gz (14 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting dill>=0.3.1 (from pandarallel)
  Downloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: pandarallel
  Building wheel for pandarallel (setup.py) ... [?25l[?25hdone
  Created wheel for pandarallel: filename=pandarallel-1.6.5-py3-none-any.whl size=16673 sha256=285320d2d27c09674bf6e96749766240eca39b53b40127ef6bdcfa3730504d65
  Stored in directory: /root/.cache/pip/wheels/50/4f/1e/34e057bb868842209f1623f195b74fd7eda229308a7352d47f
Successfully built pandarallel
Installing collected packages: dill, pandarallel
Successfully installed dill-0.3.8 pandarallel-1.6.5


In [None]:
from google.colab import drive
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from itertools import product

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, classification_report

from pandarallel import pandarallel


In [None]:
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

testdata = pd.read_csv('test_data_sample_complete.csv')
traindata = pd.read_csv('train_data_sample_complete.csv')


Mounted at /content/drive


In [None]:
train_data = traindata.sample(n=100000, random_state=42)
test_data = testdata.sample(n=10000, random_state=42)

train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})
test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})

train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')
test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')

X_train = train_data.review_combined_lemma
y_train = train_data.class_index

X = test_data.review_combined_lemma
y = test_data.class_index

X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5]
}

best_model = None
best_params = None
best_dev_accuracy = 0.0


for rf_params in product(*param_grid.values()):
    #print(rf_params)
    pipeline = Pipeline([
        ('bigram', TfidfVectorizer(ngram_range=(1, 2))),
        ('randomforestclassifier', RandomForestClassifier(random_state=42, **dict(zip(param_grid.keys(), rf_params))))
    ])

    pipeline.fit(X_train, y_train)

    dev_predictions = pipeline.predict(X_dev)
    dev_accuracy = accuracy_score(y_dev, dev_predictions)

    if dev_accuracy > best_dev_accuracy:
        best_dev_accuracy = dev_accuracy
        best_model = pipeline
        best_params = dict(zip(param_grid.keys(), rf_params))

print("Best parameters:", best_params)
print("Development set accuracy:", best_dev_accuracy)

dev_predictions = best_model.predict(X_dev)
accuracy_score(y_dev, dev_predictions)

Best parameters: {'n_estimators': 200, 'max_depth': 20, 'min_samples_split': 5}
Development set accuracy: 0.8316


0.8316

In [None]:
train_predictions = best_model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Training Accuracy: {train_accuracy}")

test_predictions = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.86478
Test Accuracy: 0.8368


In [None]:
best_model

Experiment

In [None]:
from google.colab import drive
import os
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.pipeline import Pipeline
from itertools import product

from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression


from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, classification_report

from pandarallel import pandarallel


In [None]:
# Initialize pandarallel
pandarallel.initialize(progress_bar=True)

INFO: Pandarallel will run on 4 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [None]:
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

testdata = pd.read_csv('test_data_sample_complete.csv')
traindata = pd.read_csv('train_data_sample_complete.csv')


Mounted at /content/drive


In [None]:
train_data = traindata.sample(n=100000, random_state=42)
test_data = testdata.sample(n=10000, random_state=42)

train_data['class_index'] = train_data['class_index'].map({1:0, 2:1})
test_data['class_index'] = test_data['class_index'].map({1:0, 2:1})

train_data['review_combined_lemma'] = train_data['review_combined_lemma'].fillna('')
test_data['review_combined_lemma'] = test_data['review_combined_lemma'].fillna('')

X_train = train_data.review_combined_lemma
y_train = train_data.class_index

X = test_data.review_combined_lemma
y = test_data.class_index

X_dev, X_test, y_dev, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
param_grid = {
    'n_estimators': [ 200],
    'max_depth': [20],
    'min_samples_split': [5]
}

best_model = None
best_params = None
best_dev_accuracy = 0.0


for rf_params in product(*param_grid.values()):
    #print(rf_params)
    pipeline = Pipeline([
        ('bigram', TfidfVectorizer(ngram_range=(1, 2))),
        ('randomforestclassifier', RandomForestClassifier(random_state=42, **dict(zip(param_grid.keys(), rf_params))))
    ])

    pipeline.fit(X_train, y_train)

    dev_predictions = pipeline.predict(X_dev)
    dev_accuracy = accuracy_score(y_dev, dev_predictions)

    if dev_accuracy > best_dev_accuracy:
        best_dev_accuracy = dev_accuracy
        best_model = pipeline
        best_params = dict(zip(param_grid.keys(), rf_params))

print("Best parameters:", best_params)
print("Development set accuracy:", best_dev_accuracy)

dev_predictions = best_model.predict(X_dev)
accuracy_score(y_dev, dev_predictions)

Best parameters: {'n_estimators': 200, 'max_depth': 20, 'min_samples_split': 5}
Development set accuracy: 0.829


0.829

In [None]:
best_model

In [None]:
train_predictions = best_model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_predictions)
print(f"Training Accuracy: {train_accuracy}")

test_predictions = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, test_predictions)
print(f"Test Accuracy: {test_accuracy}")

Training Accuracy: 0.86982
Test Accuracy: 0.8374


In [None]:
import pickle

with open('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/best_model_traditional.pkl', 'wb') as f:
    pickle.dump(best_model, f)


Streamlit

In [None]:
pip install streamlit

Collecting streamlit
  Downloading streamlit-1.33.0-py2.py3-none-any.whl (8.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.1/8.1 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl (207 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m207.3/207.3 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.0b1-py2.py3-none-any.whl (5.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.8/5.8 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m
Collecting watchdog>=2.1.5 (from streamlit)
  Downloading watchdog-4.0.0-py3-none-manylinux2014_x86_64.whl (82 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m83.0/83.0 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
Collecting gitdb<5,>=4.0.1 (from gitpython!=3.1.19,<4,>=3.0.7->streamlit)
  Downloading gitdb-4.

In [None]:
import pickle
from google.colab import drive
import os
drive.mount('/content/drive')
os.chdir('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/')

with open('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/best_model_traditional.pkl', 'rb') as f:
    loaded_model = pickle.load(f)


Mounted at /content/drive


In [None]:
import pandas as pd
review_title = "Greasy"
review_text = "I thought this was a very greasy lotion. I didn't care for it, but that's my opinion."
cols= ['review_title','review_text']
data = {'review_title': str(review_title),'review_text': str(review_text)}
#print(data)
df=pd.DataFrame([list(data.values())], columns=cols)

In [None]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize text into words
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join the words back into a single string
    text = ' '.join(words)
    return text

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [None]:
stop_words = set(stopwords.words('english')) - { 'not', 'no', 'couldn', "couldn't", "wouldn't", "shouldn't", "isn't",
                                                "aren't", "wasn't", "weren't", "don't", "doesn't", "hadn't", "hasn't",
                                                 "won't", "can't", "mightn't","needn't","nor","shouldn","should've","should",
                                                 "weren","wouldn","mustn't","mustn","didn't","didn","doesn","did","does","hadn",
                                                 "hasn","haven't","haven","needn","shan't"}

In [None]:
df['review_combined'] = df['review_title'] + " " + df['review_text']
df['review_combined_lemma'] = df['review_combined'].apply(preprocess)

In [None]:
df.values.tolist()

[['Greasy',
  "I thought this was a very greasy lotion. I didn't care for it, but that's my opinion.",
  "Greasy I thought this was a very greasy lotion. I didn't care for it, but that's my opinion.",
  'greasy thought greasy lotion didnt care thats opinion']]

In [None]:
loaded_model.predict(df['review_combined_lemma'].values)

array([0])

In [None]:
%%writefile app.py
import numpy as np
import pandas as pd
import streamlit as st
import pickle
import warnings
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

model = pickle.load(open('/content/drive/My Drive/MSIS/IntroductiontoDeepLearning/Project/best_model_traditional.pkl', 'rb'))
cols= ['review_title','review_text']


def preprocess(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    # Tokenize text into words
    words = word_tokenize(text)
    # Remove stopwords
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    # Join the words back into a single string
    text = ' '.join(words)
    return text

stop_words = set(stopwords.words('english')) - { 'not', 'no', 'couldn', "couldn't", "wouldn't", "shouldn't", "isn't",
                                                "aren't", "wasn't", "weren't", "don't", "doesn't", "hadn't", "hasn't",
                                                 "won't", "can't", "mightn't","needn't","nor","shouldn","should've","should",
                                                 "weren","wouldn","mustn't","mustn","didn't","didn","doesn","did","does","hadn",
                                                 "hasn","haven't","haven","needn","shan't"}

def main():
    st.title("Sentiment Predictor")
    html_temp = """
    <div style="background:#025246 ;padding:10px">
    <h2 style="color:white;text-align:center;">Sentiment Prediction App </h2>
    </div>
    """
    st.markdown(html_temp, unsafe_allow_html = True)

    review_title = st.text_area('REVIEW TITLE')
    review_text = st.text_area('REVIEW TEXT')
    features = [[review_title,review_text]]

    data = {'review_title': str(review_title),'review_text': str(review_text)}

    df=pd.DataFrame([list(data.values())], columns=cols)

    df['review_combined'] = df['review_title'] + " " + df['review_text']
    df['review_combined_lemma'] = df['review_combined'].apply(preprocess)

    if st.button("Predict"):
        #print(data)

        prediction = model.predict(df['review_combined_lemma'].values)

        if prediction == 1:
            st.success('Postive!!')
        else:
            st.success('Negative!!')

if __name__=='__main__':
    main()

Overwriting app.py


In [None]:
! wget -q -O - ipv4.icanhazip.com

35.229.33.35


In [None]:
! streamlit run app.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://35.229.33.35:8501[0m
[0m
[K[?25hnpx: installed 22 in 3.407s
your url is: https://poor-ties-throw.loca.lt
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_dat