<a href="https://colab.research.google.com/github/Kevinlodaya/Sentiment-Analysis-using-linear-classifiers-and-unsupervised-clustering/blob/main/Linear_%26_Classification_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import required packages

In [1]:
# Importing standard libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns
import pandas as pd
import scipy
import math
import random

# Importing linear classification algorithms
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.ensemble import VotingClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB

# Importing the clustering algorithms
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans

# Importing preprocessing functions
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import PCA
from sklearn.decomposition import TruncatedSVD

# Importing metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

from sklearn.preprocessing import MinMaxScaler


# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

## How does the dataset look like?
Lets use a standard dataset from Amazon which contains reviews and ratings from the customer. The original dataset has three features: name(name of the products), review(Customer reviews of the products), and rating(rating of the customer of a product ranging from 1 to 5). The review column will be the input column and the rating column will be used to understand the sentiments of the review. Here are some important data preprocessing steps:
The dataset has about 183,500 rows of data. There are 1147 null values which will be removed.
As the dataset is pretty big, it takes a lot of time to run some machine learning algorithms. We will use 30% of the data in this project which is still 54,000+ data points! The sample will be representative of the whole dataset.
If the rating is 1 and 2 that will be considered a negative review. And if the review is 3, 4, and 5, the review will be considered as a  positive review. We add a new column named ‘sentiments’ to the dataset that will use 1 for the positive reviews and 0 for the negative reviews. We read and display the contents of the dataset down below.

In [2]:
!wget #Add your dataset link/Url
!unzip #Unzip the folder you have added on your life.
data = pd.read_csv('amazon_reviews.csv')
data.head()

--2023-10-03 19:55:31--  https://cdn.iisc.talentsprint.com/ADSMI/Datasets/amazon_reviews.zip
Resolving cdn.iisc.talentsprint.com (cdn.iisc.talentsprint.com)... 172.105.52.210
Connecting to cdn.iisc.talentsprint.com (cdn.iisc.talentsprint.com)|172.105.52.210|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 29949034 (29M) [application/zip]
Saving to: ‘amazon_reviews.zip.2’


2023-10-03 19:55:36 (8.02 MB/s) - ‘amazon_reviews.zip.2’ saved [29949034/29949034]

Archive:  amazon_reviews.zip
replace amazon_reviews.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: amazon_reviews.csv      


Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


# Exploratory Data Analysis and Preprocessing

In [3]:
# Check the number of rows and columns
num_rows, num_columns = data.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

# Summary of the dataset
data.info()

# Statistical description of the features
print(data.describe())

# Check for duplicate values
duplicates = data.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

# Show the top 5 and the last 5 rows of the data
print("Top 5 rows:")
print(data.head())
print("\nLast 5 rows:")
print(data.tail())
print('Unique Ratings:', sorted(list(data['rating'].unique())))

Number of rows: 183531
Number of columns: 3
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183531 entries, 0 to 183530
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   name    183213 non-null  object
 1   review  182702 non-null  object
 2   rating  183531 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 4.2+ MB
              rating
count  183531.000000
mean        4.120448
std         1.285017
min         1.000000
25%         4.000000
50%         5.000000
75%         5.000000
max         5.000000
Number of duplicate rows: 62
Top 5 rows:
                                                name  \
0                           Planetwise Flannel Wipes   
1                              Planetwise Wipe Pouch   
2                Annas Dream Full Quilt with 2 Shams   
3  Stop Pacifier Sucking without tears with Thumb...   
4  Stop Pacifier Sucking without tears with Thumb...   

                                            

In [4]:
# 2. Preprocessing
# Check for null values
null_values = data.isnull().sum()
print("Null values per column:")
print(null_values)

# Handle null values (if any)
data = data.dropna()  # Remove rows with null values

# Create a new column 'sentiments' based on the 'rating' column
data['sentiments'] = data['rating'].apply(lambda x: 1 if x >= 3 else 0)

# Display the first few rows after preprocessing
print("\nData after preprocessing:")
print(data.head())

Null values per column:
name      318
review    829
rating      0
dtype: int64

Data after preprocessing:
                                                name  \
0                           Planetwise Flannel Wipes   
1                              Planetwise Wipe Pouch   
2                Annas Dream Full Quilt with 2 Shams   
3  Stop Pacifier Sucking without tears with Thumb...   
4  Stop Pacifier Sucking without tears with Thumb...   

                                              review  rating  sentiments  
0  These flannel wipes are OK, but in my opinion ...       3           1  
1  it came early and was not disappointed. i love...       5           1  
2  Very soft and comfortable and warmer than it l...       5           1  
3  This is a product well worth the purchase.  I ...       5           1  
4  All of my kids have cried non-stop when I trie...       5           1  


# Split train and test data

In [5]:
# Split the data into features (reviews) and target (sentiments)
X = data['review']
y = data['sentiments']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train = X_train[:5000]
y_train = y_train[:5000]
X_test = X_test[:500]
y_test = y_test[:500]
print(X_train.shape,X_test.shape,y_train.shape,y_test.shape)

# Tokenize and vectorize the text data (you may need to install a text vectorization library)
# Example using TF-IDF vectorization
tfidf_vectorizer = TfidfVectorizer(max_features=5000)  # You can adjust max_features as needed
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
print(X_train_tfidf.shape,X_test_tfidf.shape)
print('Training value')
print(y_train.value_counts())
print('Test value')
print(y_test.value_counts())

(5000,) (500,) (5000,) (500,)
(5000, 5000) (500, 5000)
Training value
1    4276
0     724
Name: sentiments, dtype: int64
Test value
1    415
0     85
Name: sentiments, dtype: int64


# Implementation using K-Nearest Neighbor (KNN) Classifier

In [6]:
# Initialize the KNN classifier
k = 5
knn_classifier = KNeighborsClassifier(n_neighbors=k)

# Train the KNN classifier on the training data
knn_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the test data
y_pred = knn_classifier.predict(X_test_tfidf)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Generate a classification report
report = classification_report(y_test, y_pred)
print("Classification Report:\n", report)

#Generate a classification report for Train data
y_train_pred = knn_classifier.predict(X_train_tfidf)
accuracy_train = accuracy_score(y_train, y_train_pred)
report_train = classification_report(y_train, y_train_pred)

print(f"Accuracy on Training Data: {accuracy_train}")
print("Classification Report:\n", report_train)

Accuracy: 0.83
Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.02      0.04        85
           1       0.83      1.00      0.91       415

    accuracy                           0.83       500
   macro avg       0.67      0.51      0.48       500
weighted avg       0.78      0.83      0.76       500

Accuracy on Training Data: 0.8656
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.07      0.13       724
           1       0.86      1.00      0.93      4276

    accuracy                           0.87      5000
   macro avg       0.93      0.54      0.53      5000
weighted avg       0.88      0.87      0.81      5000



# Implementation using Support Vector Machines (SVM) Classifier:

**Implementation using Support Vector Machines (SVM) Classifier**:  (3 points)
  - First Reduce the features using PCA
  - use Hard-Margin Classifier
  - use Soft-Margin Classifier
  - use Kernel SVM Classifier



Background:
The next classifier we look into are support vector machines.

![wget](https://cdn.talentsprint.com/aiml/aiml_2020_b14_hyd/experiment_details_backup/linear_data.png)

While the other classifiers such as the perceptron and the logistic regression uses a similar concept of finding a boundary between two classes using a straight line, SVMs aim to maximize this boundary. Therefore, not only the SVM tries to find a boundary, it tries to find the best boundary that separates the two classes. Again, with very simple tricks the two class classification can be easily extended to a multiclass classification. The formal formulation of a SVM is,

$g(x) = w^Tx + b$, is the equation of the line we want to find with weights $w$ and a bias $b$.

Now as seen from the figure, $g(x) = k$ and $g(x) = -k$ will give two worst lines for classification as they are right at the boundary of one of the classes. We need to maximize the distance of the line from both of the classes.

Therefore,

Maximize $k$ such that :

$-w^Tx + b \geq k \: for \: d_i == 1$

$-w^Tx + b \leq k \: for \: d_i == -1$

We keep $g(x) \geq 1$ and minimize $||w||$.

We finally write the final minimization function (uses Lagrangians to come to this solution).

Minimize: $J(w, b, \alpha) = \frac{1}{2}w^Tw - \Sigma_{i=1}^{N}(\alpha_id_i(w^Tx_i + b)) + \Sigma_{i=1}^{N}(\alpha_i)$

There are multiple types of SVM. We first use the standard linear SVM and check the performance of the model. However, SVM cannot be directly used on this dataset.   

The data is too large and the normal SVM function from `sklearn` will take a lot of time to run. Therefore, we first apply a PCA based dimensionality reduction technique on the input data. This will be followed by different types of SVM techniques and the performance can be compared. Since, dimensionality reduction is applied, a slight drop in performance is expected. However, with the improvement in the time taken for training a SVM in mind, it is important we first apply PCA based dimensionality reduction.

In principal component analysis, this relationship is quantified by finding a list of the principal axes in the data, and using those axes to describe the dataset.Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components, resulting in a lower-dimensional projection of the data that preserves the maximal data variance.


**Hints**
- Define the PCA model using sklearn's **TruncatedSVD**
- Fit the training data using **model.fit**
- Reduce the dimensions of the training data using **model.transform**
- Reduce the dimensions of the testing data using **model.transform**


- Use sklearn's **svm.SVC**. Appropriately choose the arguments - *kernel*, *gamma*, and *C* for hard-margin, soft-margin and kernel SVM classifiers.



In [7]:
# Reduce the number of PCA components
n_components = 100
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_tfidf.toarray())
X_test_pca = pca.transform(X_test_tfidf.toarray())

# Optimize SVM parameters
# 1. Hard-Margin Classifier
hard_margin_classifier = SVC(kernel='linear', C=10.0)  # Adjust C as needed for faster convergence

# Train the hard-margin SVM classifier on the training data
hard_margin_classifier.fit(X_train_pca, y_train)

# Make predictions on the test data
y_pred_hard_margin = hard_margin_classifier.predict(X_test_pca)

# Evaluate the hard-margin SVM classifier
accuracy_hard_margin = accuracy_score(y_test, y_pred_hard_margin)
print("Hard-Margin SVM Classifier:")
print(f"Accuracy: {accuracy_hard_margin}")
report_hard_margin = classification_report(y_test, y_pred_hard_margin)
print("Classification Report:\n", report_hard_margin)

# 2. Soft-Margin Classifier
soft_margin_classifier = SVC(kernel='linear', C=1.0)  # Adjust C as needed for faster convergence

# Train the soft-margin SVM classifier on the training data
soft_margin_classifier.fit(X_train_pca, y_train)

# Make predictions on the test data
y_pred_soft_margin = soft_margin_classifier.predict(X_test_pca)

# Evaluate the soft-margin SVM classifier
accuracy_soft_margin = accuracy_score(y_test, y_pred_soft_margin)
print("\nSoft-Margin SVM Classifier:")
print(f"Accuracy: {accuracy_soft_margin}")
report_soft_margin = classification_report(y_test, y_pred_soft_margin)
print("Classification Report:\n", report_soft_margin)

# 3. Kernel SVM Classifier (RBF Kernel)
kernel_svm_classifier = SVC(kernel='rbf', C=1.0)  # Adjust kernel and C as needed for faster convergence

# Train the kernel SVM classifier on the training data
kernel_svm_classifier.fit(X_train_pca, y_train)

# Make predictions on the test data
y_pred_kernel_svm = kernel_svm_classifier.predict(X_test_pca)

# Evaluate the kernel SVM classifier
accuracy_kernel_svm = accuracy_score(y_test, y_pred_kernel_svm)
print("\nKernel SVM Classifier:")
print(f"Accuracy: {accuracy_kernel_svm}")
report_kernel_svm = classification_report(y_test, y_pred_kernel_svm)
print("Classification Report:\n", report_kernel_svm)



Hard-Margin SVM Classifier:
Accuracy: 0.85
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.14      0.24        85
           1       0.85      1.00      0.92       415

    accuracy                           0.85       500
   macro avg       0.85      0.57      0.58       500
weighted avg       0.85      0.85      0.80       500


Soft-Margin SVM Classifier:
Accuracy: 0.83
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00        85
           1       0.83      1.00      0.91       415

    accuracy                           0.83       500
   macro avg       0.41      0.50      0.45       500
weighted avg       0.69      0.83      0.75       500


Kernel SVM Classifier:
Accuracy: 0.848
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.13      0.22        85
           1       0.85      1.00     

In [8]:
# Print the accuracy for each classifier on the training data
train_accuracy_hard_margin = hard_margin_classifier.score(X_train_pca, y_train)
train_accuracy_soft_margin = soft_margin_classifier.score(X_train_pca, y_train)
train_accuracy_kernel_svm = kernel_svm_classifier.score(X_train_pca, y_train)

print("\nTraining Data Accuracy:")
print(f"Hard-Margin SVM Classifier: {train_accuracy_hard_margin}")
print(f"Soft-Margin SVM Classifier: {train_accuracy_soft_margin}")
print(f"Kernel SVM Classifier: {train_accuracy_kernel_svm}")

# Print the accuracy for each classifier on the test data
print("\nTest Data Accuracy:")
print(f"Hard-Margin SVM Classifier: {accuracy_hard_margin}")
print(f"Soft-Margin SVM Classifier: {accuracy_soft_margin}")
print(f"Kernel SVM Classifier: {accuracy_kernel_svm}")


Training Data Accuracy:
Hard-Margin SVM Classifier: 0.8826
Soft-Margin SVM Classifier: 0.8552
Kernel SVM Classifier: 0.9176

Test Data Accuracy:
Hard-Margin SVM Classifier: 0.85
Soft-Margin SVM Classifier: 0.83
Kernel SVM Classifier: 0.848


# Implementation using Decision Trees

   **Implementation using Decision Trees**:  (1 point)

Decision Trees are supervised Machine Learning algorithms that can perform both classification and regression tasks and even multioutput tasks. They can handle complex datasets. As the name shows, it uses a tree-like model to make decisions in order to classify or predict according to the problem. It is an ML algorithm that progressively divides datasets into smaller data groups based on a descriptive feature until it reaches sets that are small enough to be described by some label.

The most important part of a decision tree is its explainability!

The importance of decision tree algorithm is that it has many applications in the real world. For example:

1. In the Healthcare sector: To develop Clinical Decision Analysis tools which allow decision-makers to apply for evidence-based medicine and make objective clinical decisions when faced with complex situations.
2. Virtual Assistants (Chatbots): To develop chatbots that provide information and assistance to customers in any required domain.
3. Retail and Marketing: Sentiment analysis detects the pulse of customer feedback and emotions and allows organizations to learn about customer choices and drives decisions.

**Hint**
Use sklearn's **DecisionTreeClassifier** function

In [9]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report

# Initialize the Decision Tree classifier
decision_tree_classifier = DecisionTreeClassifier(random_state=42)

# Train the Decision Tree classifier on the training data
decision_tree_classifier.fit(X_train_tfidf, y_train)

# Make predictions on the training data
y_pred_train_decision_tree = decision_tree_classifier.predict(X_train_tfidf)

# Calculate accuracy on the training data
accuracy_train_decision_tree = accuracy_score(y_train, y_pred_train_decision_tree)
print("Decision Tree Classifier (Training Data):")
print(f"Accuracy: {accuracy_train_decision_tree}")

# Make predictions on the test data
y_pred_test_decision_tree = decision_tree_classifier.predict(X_test_tfidf)

# Calculate accuracy on the test data
accuracy_test_decision_tree = accuracy_score(y_test, y_pred_test_decision_tree)
print("\nDecision Tree Classifier (Testing Data):")
print(f"Accuracy: {accuracy_test_decision_tree}")

# Generate a classification report for the test data
report_decision_tree = classification_report(y_test, y_pred_test_decision_tree)
print("\nClassification Report (Testing Data):\n", report_decision_tree)

Decision Tree Classifier (Training Data):
Accuracy: 1.0

Decision Tree Classifier (Testing Data):
Accuracy: 0.81

Classification Report (Testing Data):
               precision    recall  f1-score   support

           0       0.44      0.41      0.42        85
           1       0.88      0.89      0.89       415

    accuracy                           0.81       500
   macro avg       0.66      0.65      0.66       500
weighted avg       0.81      0.81      0.81       500



# Implementation using Ensemble Classifier:

In [10]:
# Reduce the number of PCA components
n_components = 100
pca = PCA(n_components=n_components)
X_train_pca = pca.fit_transform(X_train_tfidf.toarray())
X_test_pca = pca.transform(X_test_tfidf.toarray())

# Instantiate a MinMaxScaler
scaler = MinMaxScaler()

# Scale your data (replace X_train_pca and X_test_pca with your data)
X_train_pca_scaled = scaler.fit_transform(X_train_pca)
X_test_pca_scaled = scaler.transform(X_test_pca)

# Define individual classifiers
logistic_classifier = LogisticRegression()
knn_classifier = KNeighborsClassifier()
svm_classifier = SVC(kernel='linear', C=1.0, probability=True)
naive_bayes_classifier = MultinomialNB()

# Create a Voting Classifier with 'soft' voting
ensemble_classifier = VotingClassifier(estimators=[
    ('logistic', logistic_classifier),
    ('knn', knn_classifier),
    ('svm', svm_classifier),
    ('naive_bayes', naive_bayes_classifier)
], voting='soft')

# Train the ensemble classifier on the training data
ensemble_classifier.fit(X_train_pca_scaled, y_train)

# Make predictions on the test data
y_pred_ensemble = ensemble_classifier.predict(X_test_pca_scaled)

# Evaluate the ensemble classifier
accuracy_ensemble = accuracy_score(y_test, y_pred_ensemble)
print("Ensemble Classifier:")
print(f"Accuracy: {accuracy_ensemble}")
report_ensemble = classification_report(y_test, y_pred_ensemble)
print("Classification Report:\n", report_ensemble)

Ensemble Classifier:
Accuracy: 0.834
Classification Report:
               precision    recall  f1-score   support

           0       0.67      0.05      0.09        85
           1       0.84      1.00      0.91       415

    accuracy                           0.83       500
   macro avg       0.75      0.52      0.50       500
weighted avg       0.81      0.83      0.77       500



In [11]:
#Evaluate the ensemble classifier on the training data
train_accuracy_ensemble = ensemble_classifier.score(X_train_pca_scaled, y_train)
train_report_ensemble = classification_report(y_train, ensemble_classifier.predict(X_train_pca_scaled))

# Print the results
print("Ensemble Classifier - Training Data:")
print(f"Accuracy: {train_accuracy_ensemble}")
print("Classification Report:\n", train_report_ensemble)


Ensemble Classifier - Training Data:
Accuracy: 0.8764
Classification Report:
               precision    recall  f1-score   support

           0       0.88      0.17      0.28       724
           1       0.88      1.00      0.93      4276

    accuracy                           0.88      5000
   macro avg       0.88      0.58      0.61      5000
weighted avg       0.88      0.88      0.84      5000



# Exercise 6: Implementation using Clustering

In [12]:
# Define the number of clusters
n_clusters = 2

# Without PCA
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
kmeans.fit(X_train_tfidf)  # Assuming X_train_tfidf is your TF-IDF feature matrix

# Make predictions on test data
test_predictions = kmeans.predict(X_test_tfidf)

# A helper function to help labeling the test predictions
def label(n_clusters, real_labels, labels):
    permutation = []
    for i in range(n_clusters):
        idx = labels == i
        new_label = scipy.stats.mode(real_labels[idx])[0]  # Choose the most common label among data points in the cluster
        permutation.append(new_label)
    return permutation

# Use the label function to map cluster labels to binary labels (0 or 1)
cluster_labels = label(n_clusters, y_test, test_predictions)

# Print cluster labels
print("Cluster Labels without PCA:", cluster_labels)

# With PCA (assuming X_train_pca and X_test_pca are PCA-transformed data)
kmeans_pca = KMeans(n_clusters=n_clusters, random_state=42)
kmeans_pca.fit(X_train_pca)  # Assuming X_train_pca is PCA-transformed training data

# Make predictions on PCA-transformed test data
test_predictions_pca = kmeans_pca.predict(X_test_pca)

# Use the label function to map cluster labels to binary labels (0 or 1)
cluster_labels_pca = label(n_clusters, y_test, test_predictions_pca)

# Print cluster labels with PCA
print("Cluster Labels with PCA:", cluster_labels_pca)

Cluster Labels without PCA: [1, 1]
Cluster Labels with PCA: [1, 1]


In [13]:
# Make predictions on PCA-transformed train data
train_predictions_pca = kmeans_pca.predict(X_test_pca)

# Use the label function to map cluster labels to binary labels (0 or 1)
cluster_labels_pca_train = label(n_clusters, y_test, train_predictions_pca)

# Print cluster labels with PCA - Train data
print("Cluster Labels with PCA:", cluster_labels_pca_train)

Cluster Labels with PCA: [1, 1]


# Testing my sentence using best trained model.
- Ensemble classifier based on precion and recall

In [14]:
def sentiment_label(predicted_sentiment):
    if predicted_sentiment[0] == 0:
        return "Negative sentiment"
    else:
        return "Positive sentiment"

# Input your own sentence
positive_sentence = "This product exceeded my expectations. I love it!"

# Vectorize your sentence using the same TF-IDF vectorizer
positive_sentence_tfidf = tfidf_vectorizer.transform([positive_sentence])
positive_sentence_pca = pca.transform(positive_sentence_tfidf.toarray())
positive_sentence_pca_scaled = scaler.transform(positive_sentence_pca)

# Predict the sentiment of your sentence using the best trained model
predicted_sentiment = sentiment_label(ensemble_classifier.predict(positive_sentence_pca_scaled))
print(f"{positive_sentence} --> {predicted_sentiment}")

This product exceeded my expectations. I love it! --> Positive sentiment


In [15]:
# Input your own sentence
negative_sentence = "This product is an absolute disaster. It's a complete waste of money, and I regret buying it. I wouldn't wish this on my worst enemy."

# Vectorize your sentence using the same TF-IDF vectorizer
negative_sentence_tfidf = tfidf_vectorizer.transform([negative_sentence])
negative_sentence_pca = pca.transform(negative_sentence_tfidf.toarray())
negative_sentence_scaled = scaler.transform(negative_sentence_pca)

# Predict the sentiment of your sentence using the best trained model
predicted_sentiment = sentiment_label(ensemble_classifier.predict(negative_sentence_scaled))
print(f"{negative_sentence} --> {predicted_sentiment}")

This product is an absolute disaster. It's a complete waste of money, and I regret buying it. I wouldn't wish this on my worst enemy. --> Positive sentiment
