**Step One:** Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


**Step Two:** Read the Amazon Reviews in the CSV file into a pandas dataframe as follows:

In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/A Sentiment Analysis Example-Classifying Product/AmazonReviews.csv')

**Step Kwa Raha Zangu:** A look at the total number of reviews and the first few reviews loaded in the DataFrame

In [None]:
print('The number of reviews: ', len(df))
print(df[['title', 'rating']].head(10))

The number of reviews:  338
                                               title              rating
0  Recommend getting, but also get proof of Pytho...  5.0 out of 5 stars
1        Easy to Follow, Good Intro for Self Learner  5.0 out of 5 stars
2  Great for starting your journey on learning py...  5.0 out of 5 stars
3  Great inner content! Not that great outer qual...  4.0 out of 5 stars
4                                 Tips for beginners  5.0 out of 5 stars
5                                              Great  5.0 out of 5 stars
6  Can be used both as a reference and a teaching...  5.0 out of 5 stars
7  I have a whole bookshelf on Python books. Hand...  5.0 out of 5 stars
8    simple y con el detalle de aprendizaje adecuado  5.0 out of 5 stars
9                Very detailed and clearly explained  5.0 out of 5 stars


**Step Four:** **Data cleansing**
to filter out reviews that aren't written in English

In [None]:
%pip install google_trans_new



**Step Five:** Check that google_trans_new has fixed the known bug that raises a JSONDecodeError exception during language detection. Do this by installing the `langdetect` library and detect the language of each review.

In [None]:
%pip install langdetect

Collecting langdetect
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m122.9/981.5 kB[0m [31m3.5 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m972.8/981.5 kB[0m [31m13.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: langdetect
  Building wheel for langdetect (setup.py) ... [?25l[?25hdone
  Created wheel for langdetect: filename=langdetect-1.0.9-py3-none-any.whl size=993223 sha256=a03abc0c0876904005dde335db7464cf8904ac8f5ff9e83a65aaa5bc161c91e9
  Stored in directory: /root/.cache/pip/wheels/0a/f2/b2/e5ca405801e05eb7c8ed5b3b4bcf1fcabcd62

In [None]:
from langdetect import detect, DetectorFactory
import pandas as pd

# Set seed for reproducibility
DetectorFactory.seed = 0

# Function to detect language, returns 'unknown' if detection fails
def safe_detect(title):
    try:
        return detect(title)
    except:
        return 'unknown'

# Apply the language detection to the 'text' column
df['lang'] = df['title'].apply(safe_detect)

# Display the rows with the new 'language' column
display(df[['title', 'rating', 'lang']])

Unnamed: 0,title,rating,lang
0,"Recommend getting, but also get proof of Pytho...",5.0 out of 5 stars,en
1,"Easy to Follow, Good Intro for Self Learner",5.0 out of 5 stars,en
2,Great for starting your journey on learning py...,5.0 out of 5 stars,en
3,Great inner content! Not that great outer qual...,4.0 out of 5 stars,en
4,Tips for beginners,5.0 out of 5 stars,no
...,...,...,...
333,Bad,1.0 out of 5 stars,so
334,Just published for making money,1.0 out of 5 stars,en
335,Not genuine book,1.0 out of 5 stars,af
336,Not bad,1.0 out of 5 stars,id


**Step Six:** Filter the dataset, keeping only those reviews that are written in English

In [None]:
df = df[df['lang'] == 'en']

# This operation should reduce the total number of rows in the dataset.
# To verify that it worked, count the number of rows in the updated DataFrame:

print(len(df))

248


**Step Seven:** Split the reviews into a training set for
developing the model and a testing set for evaluating its accuracy. Also transform the natural language of the review titles into numerical data
that the model can understand.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
reviews = df['title'].values
ratings = df['rating'].values
reviews_train, reviews_test, y_train, y_test = train_test_split(reviews, ratings, test_size=0.2, random_state=1000)
vectorizer = CountVectorizer()
vectorizer.fit(reviews_train)
x_train = vectorizer.transform(reviews_train)
x_test = vectorizer.transform(reviews_test)

An explanation of the structures in Step Seven:
*   **reviews_train** An array containing the review titles chosen for training
*   **reviews_test** An array containing the review titles chosen for testing
*   **y_train** An array containing the star ratings corresponding to the reviews in
*   **y_test** An array containing the star ratings corresponding to the reviews in
*   **x_train** A matrix containing the set of feature vectors for the review titles found in the reviews_train array
*   **x_test** A matrix containing the set of feature vectors for the review titles found in the reviews_test array













In [None]:
# Check the number of rows in the matrix generated from the reviews_train array
print(len(x_train.toarray()))

198


In [None]:
# Check the number of rows in the matrix generated from the reviews_test array
print(len(x_test.toarray()))

50


In [None]:
# Check the length of the feature vectors in the training matrix
print(len(x_train.toarray()[0]))

413


In [None]:
# This means that 413 unique words occur in the training set’s review titles.
# This collection of words is called the vocabulary dictionary of the dataset

In [None]:
print(x_train.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


**Step Eight : Training the Model** using scikit-learn’s
LogisticRegression classifier

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier.fit(x_train, y_train)

**Step Nine : Evaluating the Model** using the classifier's `predict()` method

In [None]:
import numpy as np
predicted = classifier.predict(x_test)
accuracy = np.mean(predicted == y_test)
print("Accuracy:", round(accuracy,2))

Accuracy: 0.46


**The Model's Confusion Matrix** - A grid that compares predicted
classifications with actual classifications. A confusion matrix can help reveal
the model’s accuracy within each individual class, as well as show whether the
model is likely to confuse two classes (mislabel one class as another).

In [48]:
from sklearn import metrics
import numpy as np

# Function to convert rating string to integer
def convert_rating_to_int(rating_str):
    try:
        # Extract the numerical part before " out of"
        return int(float(rating_str.split(' out of')[0]))
    except:
        return np.nan # Handle potential errors

# Convert the string ratings to integers
y_test_int = np.array([convert_rating_to_int(rating) for rating in y_test])
predicted_int = np.array([convert_rating_to_int(rating) for rating in predicted])

# Remove NaN values that might have resulted from conversion errors
valid_indices = ~np.isnan(y_test_int) & ~np.isnan(predicted_int)
y_test_int = y_test_int[valid_indices].astype(int)
predicted_int = predicted_int[valid_indices].astype(int)


print(metrics.confusion_matrix(y_test_int, predicted_int, labels = [1,2,3,4,5]))

[[ 3  1  0  2  5]
 [ 1  0  0  1  6]
 [ 0  0  1  2  2]
 [ 0  0  1  1  4]
 [ 0  1  0  1 18]]


In [50]:
# get the number of entries for each rating group
print(df.groupby('rating').size())

rating
1.0 out of 5 stars     36
2.0 out of 5 stars     21
3.0 out of 5 stars     29
4.0 out of 5 stars     58
5.0 out of 5 stars    104
dtype: int64


**Classification Metrics** using the `classification_report()` function
found in scikit-learn’s `metrics` module

In [54]:
print(metrics.classification_report(y_test_int, predicted_int, labels = [1,2,3,4,5]))

              precision    recall  f1-score   support

           1       0.75      0.27      0.40        11
           2       0.00      0.00      0.00         8
           3       0.50      0.20      0.29         5
           4       0.14      0.17      0.15         6
           5       0.51      0.90      0.65        20

    accuracy                           0.46        50
   macro avg       0.38      0.31      0.30        50
weighted avg       0.44      0.46      0.40        50

