<a href="https://colab.research.google.com/github/Que1Pereza2/Mr.CrabsAnalyzer/blob/main/CanYouFeelItNow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Imports block

In [31]:
import re
import nltk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from imblearn.under_sampling import RandomUnderSampler

This block reads the None2775.csv file and loads it's contents into review.

In [32]:
reviews = pd.read_csv("None2775.csv")

This function converts the scores from strings to ints, uses regex to clean the data and prepares it for the neural network to train on  and creates the label and features arrays.

In [33]:
reviews['score'] = reviews['score'].str.replace('"', '').astype(int)

majority_class = reviews[reviews.score == 1]
minority_class = reviews[reviews.score == 0]

# Downsample majority class
majority_downsampled = majority_class.sample(n=len(minority_class), random_state=42)

# Combine minority class with downsampled majority class
balanced_df = pd.concat([majority_downsampled, minority_class])

# Shuffle the resulting DataFrame
balanced_df = balanced_df.sample(frac=1, random_state=42).reset_index(drop=True)


features = balanced_df.iloc[:,0].values
labels = balanced_df.iloc[:,1].values


                                                  review  score
7      "plot carol danvers becomes one of the univers...      0
10     "i've liked brie larson in other films  but sh...      0
31     "i was left with the general feeling  that  ca...      0
34     "i got bored after 30 min  the story line is b...      0
41     "this is a very controversial marvel film  whi...      0
...                                                  ...    ...
10047  "season 2 took a nosedive  extremely boring fi...      0
10048  "first season was ok  the plot was interesting...      0
10049  "i am going to be vague on the plot and focus ...      0
10050  "as great as the series is and krysten ritter ...      0
10051  "i only watched this show for kilgrave  with o...      0

[2774 rows x 2 columns]
                                                 review  score
0     "while it doesn't reinvent the superhero genre...      1
1     "i am upset at all the flack this movie has go...      1
2     "enjoyed thi

This block handles the undersampling of the positive reviews so the scores appear in equal quantity.

In [34]:
processed_features = []

for sentence in range(0, len(features)):
    # Remove all the special characters
    processed_feature = re.sub(r'\W', ' ', str(features[sentence]))

    # remove all single characters
    processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

    # Remove single characters from the start
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

    # Substituting multiple spaces with single space
    processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

    # Removing prefixed 'b'
    processed_feature = re.sub(r'^b\s+', '', processed_feature)

    # Converting to Lowercase
    processed_feature = processed_feature.lower()

    processed_features.append(processed_feature)

vectorizer = TfidfVectorizer (max_features=2500, min_df=7, max_df=0.8)
processed_features = vectorizer.fit_transform(processed_features).toarray()

{'10': 3.329803870767581, '100': 5.689547377601575, '11': 5.9324935562119645, '12': 5.478238283934369, '13': 5.036405531655329, '14': 6.676934031159461, '15': 5.5438355664201815, '16': 6.325536144321572, '1st': 5.65108109677378, '20': 5.214653763061648, '2008': 6.066024948836487, '2011': 6.788159666269685, '2012': 5.446985740430264, '2019': 6.4024971854577, '2021': 6.289168500150697, '2023': 6.84878428808612, '2d': 6.731001252429737, '2nd': 5.446985740430264, '30': 5.372877768276542, '3d': 5.121563339995636, '3rd': 6.187385805840755, '40': 6.09501248570974, '45': 6.443319179977956, '50': 6.187385805840755, '60': 6.576850572602479, '80': 6.530330556967585, '90': 5.771225408615843, '90s': 6.155637107526175, 'abilities': 5.65108109677378, 'ability': 5.670129291744474, 'able': 4.6797305877165964, 'about': 2.433715846210945, 'above': 5.29063967003957, 'absolute': 5.29063967003957, 'absolutely': 4.19202738137146, 'absurd': 6.220175628663746, 'accent': 5.729552712215274, 'accents': 5.68954737

This block splits the data and feeds the training data to the Neural Network

In [35]:
X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)

text_classifier = RandomForestClassifier(criterion="entropy",n_estimators=200, random_state=42)

text_classifier.fit(X_train, y_train)

predictions = text_classifier.predict(X_test)

[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 ...
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
[0 0 1 ... 1 0 1]



Neural Network stats

In [36]:
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

[[489  85]
 [102 434]]
              precision    recall  f1-score   support

           0       0.83      0.85      0.84       574
           1       0.84      0.81      0.82       536

    accuracy                           0.83      1110
   macro avg       0.83      0.83      0.83      1110
weighted avg       0.83      0.83      0.83      1110

0.8315315315315316


Using the AI

In [37]:
reviewToPredict = input(f"Please provide a review!\n ")
if text_classifier.predict(vectorizer.transform([reviewToPredict]).toarray()) == 1:
    print("The review is positive")
else:
    print("The review is negative")

KeyboardInterrupt: Interrupted by user