# Introduction

So, imagine this: you are on the hunt for a new smartphone, scrolling through Amazon's endless array of options. You're not just interested in what the specs say or what the manufacturers claim – you want to know what real users have to say. That's where the magic of customer reviews comes in!

Think about it, every review is a goldmine of insights waiting to be discovered. Are the users raving about the camera quality? or are they complaining about the battery life? Or perhaps, there is a hidden feature that no one is talking about (this is prevalent in our world as we really do not have the time to dig through the hooks and crannies of our smart phones). 

This project will dive into the world of Amazon phone reviews using python to unravel trends and  patterns through exploratory data analysis. We will engage in some data cleaning and wrangling, data exploration and answer some questions that can perhaps allow you to find your next dream phone 😁. The data used in this project was sourced from data world. You can view the full dataset [here](https://data.world/opensnippets/amazon-mobile-phones-reviews).
Below are some questions that will be answered:

1- What are the top 5 most reviewed phones on Amazon?

2- Which phone has the highest average review rating?

3- What percentage of reviews are verified purchases?

4- Is there a correlation between helpful_count and review_rating?

5- How does the distribution of review ratings vary across different phone brands?

6- What are the most commonly mentioned colors for phones in the reviews?

7- Are there any trends in the frequency of reviews over time?

8- Which sub-category of phones (e.g., "Smartphones," "Android Phones," "iPhones") receives the most positive reviews?

9- Do certain product features or styles correlate with higher review ratings?

10- Are there any differences in review sentiment between different countries or regions?



### Data Description
a) **product:** The name or model of the product being reviewed (e.g., "iPhone 12 Pro Max").

b) **product_company:** The company or brand that manufactures the product (e.g., "Apple").

c) **profile_name:** The name or username of the person who wrote the review.

d) **review_title:** The title or headline of the review.

e) **review_rating:** The numerical rating assigned by the reviewer (e.g., on a scale of 1 to 5 stars).

f) **review_text:** The main body of the review, containing the reviewer's opinions, experiences, and feedback.

g) **helpful_count:** The number of users who found the review helpful or voted it as such.

h) **total_comments:** The total number of comments or replies on the review.

i) **review_country:** The country or region associated with the reviewer.

j) **reviewed_at:** The date and time when the review was posted.

k) **url:** The URL of the product page on Amazon.

l) **crawled_at:** The date and time when the review data was collected or crawled.

m) **id:** Unique identifier for the review.

n) **verified_purchase:** Indicates whether the reviewer is a verified purchaser of the product (e.g., "Yes" or "No").

o) **color:** Color variant of the product (if applicable).

p) **style_name:** Style or model variant of the product (if applicable).

q) **size_name:** Size variant of the product (if applicable).

r) **category:** The general category to which the product belongs (e.g., "Electronics").

s) **sub_category:** A more specific sub-category within the main category (e.g., "Smartphones").


In [2]:
# import the libraries required for EDA

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [6]:
# read the data
data = pd.read_json('amazon_one_plus_reviews.json')
print(data.head())

                                             product product_company  \
0  OnePlus Nord 5G (Gray Onyx, 8GB RAM, 128GB Sto...         OnePlus   
1  OnePlus Nord 5G (Gray Onyx, 8GB RAM, 128GB Sto...         OnePlus   
2  OnePlus Nord 5G (Gray Onyx, 8GB RAM, 128GB Sto...         OnePlus   
3  OnePlus Nord 5G (Gray Onyx, 8GB RAM, 128GB Sto...         OnePlus   
4  OnePlus Nord 5G (Gray Onyx, 8GB RAM, 128GB Sto...         OnePlus   

      profile_name                   review_title       review_rating  \
0           Nikhil        *Read before you buy!!*  5.0 out of 5 stars   
1             Amit  Near to mid range  Perfection  5.0 out of 5 stars   
2        aishwarya                   Great price!  5.0 out of 5 stars   
3          vasu a.              Beast in OnePlus.  5.0 out of 5 stars   
4  Amazon Customer        Changed to Nord from 6t  5.0 out of 5 stars   

                                         review_text  \
0  \n  Yea..pre-ordered on 28 July, got it on 4 A...   
1  \n  Got it de

In [7]:
# get the descriptive statistics of the data
data.describe()

Unnamed: 0,total_comments
count,30612.0
mean,0.072325
std,0.444049
min,0.0
25%,0.0
50%,0.0
75%,0.0
max,24.0


Looking from the data, it is obvious that it is kind of messy and requires it to be cleaned. Currently, there is only one numerical data as seen in the descriptive statistics table. There are a couple of columns that would be more useful as numerical rather than objects. These would be worked upon also.

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30612 entries, 0 to 30611
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   product            30612 non-null  object        
 1   product_company    30612 non-null  object        
 2   profile_name       30612 non-null  object        
 3   review_title       30612 non-null  object        
 4   review_rating      30612 non-null  object        
 5   review_text        30612 non-null  object        
 6   helpful_count      30612 non-null  object        
 7   total_comments     30612 non-null  int64         
 8   review_country     30612 non-null  object        
 9   reviewed_at        30612 non-null  datetime64[ns]
 10  url                30612 non-null  object        
 11  crawled_at         30612 non-null  datetime64[ns]
 12  _id                30612 non-null  object        
 13  verified_purchase  30612 non-null  object        
 14  color 