The project aims to analyze user reviews of Amazon Alexa products, classify them as positive or negative based on user feedback, and predict the sentiment of reviews using machine learning models. The dataset used consists of user ratings, feedback labels, product variations, and textual reviews.
The dataset used in this project is available at: Amazon Alexa Reviews Dataset.
The objective is to preprocess the text data, build a model pipeline, and evaluate the performance of different classifiers (Random Forest, XGBoost, and Decision Tree) for sentiment prediction.
- The dataset was loaded and examined for missing values.
- A new column,
length, was created to represent the length of each review. - The dataset was analyzed to understand the distribution of ratings, feedback, and product variations.
- Text reviews were cleaned by removing non-alphabetic characters, converting text to lowercase, and splitting them into words.
- The stopwords were removed, and the remaining words were stemmed using the Porter Stemmer.
- The corpus was built for use in the model.
- A
CountVectorizerwas applied to convert the text data into a bag-of-words representation. - A total of 2500 features were selected using the
CountVectorizer.
- The dataset was split into a training and testing set (80-20 split).
- Features were scaled using
MinMaxScalerto ensure that the values were between 0 and 1.
Multiple classifiers were trained and evaluated, including:
- Random Forest Classifier
- XGBoost Classifier
- Decision Tree Classifier
Each model's performance was evaluated based on training and testing accuracy, confusion matrix, and cross-validation scores.
-
Rating Distribution: The majority of reviews have a rating of 5 stars (72.57%), followed by 4 stars (14.44%) and lower ratings.

Inference: Most users rate the product highly (5 stars), suggesting general satisfaction with the product. This high concentration of 5-star reviews may lead to an imbalanced dataset that could affect model performance, potentially biasing it toward predicting positive sentiments. -
Feedback Distribution: A significant portion (91.84%) of feedback labels is positive (feedback = 1), with a smaller portion being negative (feedback = 0).

Inference: The feedback data is imbalanced, with most reviews being positive. This imbalance is important to consider when building models, as it might result in the model being more likely to predict positive feedback due to the higher prevalence of positive reviews in the dataset. -
Variation Distribution: The product variation with the highest count is 'Black Dot' (16.38% of the dataset), followed by 'Charcoal Fabric' (13.65%).

Inference: Variations like 'Black Dot' and 'Charcoal Fabric' are more popular, indicating the preference for specific product models. This insight could be leveraged for targeted marketing or sales strategies, as customers may have a higher preference for certain variations. -
Review Length: Reviews generally have an average length of around 132 characters, with some extremely long reviews reaching up to 2853 characters.

Inference: There is a diverse range of review lengths, with the median review length around 74 characters. Shorter reviews may reflect users who are either more concise or may have less detailed feedback, while longer reviews could be indicative of more in-depth satisfaction or dissatisfaction. The correlation between review length and sentiment could provide further insights for improving sentiment analysis models.
-
Text Processing: The reviews were processed and transformed using the
CountVectorizerto convert the text into a format suitable for machine learning models. Stopwords were excluded and words were stemmed to improve the model's ability to focus on key terms. -
Modeling:
- Three different models were trained:
- Random Forest Classifier
- XGBoost Classifier
- Decision Tree Classifier
- Hyperparameter tuning was performed using Grid Search to optimize the Random Forest model.
- Accuracy, cross-validation scores, and confusion matrices were used to evaluate model performance.
- Three different models were trained:
-
Evaluation: Performance metrics, including training accuracy, testing accuracy, and confusion matrices, were used to assess each model's effectiveness in predicting the sentiment (positive or negative) of reviews.
- Training Accuracy: 99.37%
- Testing Accuracy: 93.97%
- Cross-validation mean accuracy: 93.85%
- Best Hyperparameters:
bootstrap: Truemax_depth: 100min_samples_split: 8n_estimators: 300
- Confusion Matrix:

>Inference: The Random Forest model shows high performance on both training and testing datasets. The confusion matrix reveals a relatively low number of misclassifications, indicating the model's robustness. However, further tuning might help improve recall for the minority class.
- Training Accuracy: 97.42%
- Testing Accuracy: 93.33%
- Confusion Matrix:

Inference: XGBoost performs well, slightly outperforming the Random Forest model on training accuracy, but similar performance on testing data. The confusion matrix highlights that the model is effective at identifying both positive and negative reviews with minimal errors.
- Training Accuracy: 99.37%
- Testing Accuracy: 92.22%
- Confusion Matrix:

Inference: The Decision Tree model is also performing well but has a slightly lower testing accuracy than Random Forest and XGBoost. The confusion matrix shows some misclassifications, particularly in distinguishing between positive and negative reviews.
The confusion matrices for each model highlight how well the models distinguish between positive and negative reviews:
- False Positives (FP): Reviews predicted as positive but are actually negative.
- False Negatives (FN): Reviews predicted as negative but are actually positive.
- True Positives (TP): Reviews correctly predicted as positive.
- True Negatives (TN): Reviews correctly predicted as negative.
The evaluation metrics include:
- Accuracy: The overall percentage of correct predictions made by the model.
- Precision, Recall, and F1-Score: These metrics can be calculated using the confusion matrix and provide more insights into model performance.
- Data Imbalance: The dataset is imbalanced with a higher percentage of positive reviews. Future work could include methods like oversampling, undersampling, or using weighted loss functions to improve classification for the minority class.
- Model Improvement: The current models can be further fine-tuned using advanced techniques such as ensemble learning or deep learning models (e.g., LSTM or BERT for sentiment analysis).
- Review Complexity: The project could be expanded to analyze sentiment based on product-specific features and review sentiment trends over time.
The project demonstrates the ability to classify Amazon Alexa product reviews into positive and negative categories effectively using machine learning models. The Random Forest, XGBoost, and Decision Tree models all perform well, with Random Forest yielding the highest accuracy on the test set. This analysis can help businesses assess customer sentiment and make informed decisions on product development.