# Amazon_data

## Part 1: Database and Jupyter Notebook Set Up
### Using following command to import dataset into mongo_db:

 mongoimport --type csv -d Clean_Data_Resources -c amazon_cleanData --headerline --drop amazon_cleanData.csv

In [None]:
# Import dependencies
from pymongo import MongoClient
from pprint import pprint
import pandas as pd

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from collections import Counter
import re
from textblob import TextBlob
import textblob
import plotly.graph_objects as go

In [None]:
# Create an instance of MongoClient
mongo = MongoClient(port=27017)

In [None]:
# confirm that our new database was created
print(mongo.list_database_names())

In [None]:
# assign the Clean_Data_Resources database to a variable name
db = mongo['Clean_Data_Resources']

In [None]:
# review the collections in our amazon database
print(db.list_collection_names())

In [None]:
# review a document in the establishments collection
#Display the first document in the results using pprint before filter
pprint(db.amazon_cleanData.find_one())

In [None]:
# assign the collection to a variable
amazon = db['amazon_cleanData']

In [None]:
# Display the total number of items in the amazon collection
amazon.count_documents({})

### Analysis

Based on Jupyter Notebook analysis, we know that there are 8 unique values (unique category column names) in the 'category' column which is the data collection of different kinds of cable in our amazon dataset. I am going to use 8 different unique values to make some queries , in order to derive some interesting information/facts and understand the trends of different cables being sold in the amazon website.

### Computer USB Cables

In [None]:
#Creating some queries to understand the trends of cables available in the dataset
#  Create a query that finds all computer USB cables in the 'amazon' collection
# Filter results by name
query = {'category': 'ComputersAccessoriesAccessoriesPeripheralsCablesAccessoriesCablesUSBCables'}

# Capture the results to a variable
results1 = list(amazon.find(query))

# Print the number of results
Computer_USBs = int(amazon.count_documents(query))
print("Total Number of items which are computer USB cables in result are:", Computer_USBs)


In [None]:
## Pretty print the results
pprint(results1[0:5])

In [None]:
# Initialize an empty list to store computer USB Cables which has discounted price greater than equal to 100 and have ratings greater than equal to 4
comp_USBs = []

# Loop through each document in the results and append it to the list
for item in results1:
   comp_USBs.append(item)
  
# Now, converted_results is a list of dictionaries
pprint(comp_USBs)

In [None]:
#create a dataframe for computer USBs
comp_USBs_df  = pd.DataFrame(comp_USBs)
comp_USBs_df.head(5)



In [None]:
#finding the maximum rating counts in the comp_USBs_df dataframe
max_value_index = comp_USBs_df['rating_count'].idxmax()
max_value = comp_USBs_df['rating_count'].max()

print(f"Index of the maximum rating count: {max_value_index}")
print(f"Maximum rating count value: {max_value}")

In [None]:
#   Create a query that finds top 5 computer USB cables in the 'amazon' collection which has rating score greater than equal to 4
# Filter results by name
query = {'category': 'ComputersAccessoriesAccessoriesPeripheralsCablesAccessoriesCablesUSBCables',
        'rating_count': {'$gte': 50000,  '$lte':180000}, 'rating':{'$gte':4}}

# Display the  fields
fields = {'category' :1, 'product_id':1, 'product_name': 1 , 'discounted_price': 1, 'discount_percentage':1,'actual_price': 1, "rating":1,"review_title":1, "review_content":1, "rating_count":1}

# sort in descending order by 
sort = [('rating_count', - 1)]

# limit the results to the first 5
limit = 5

#  Execute the query
result2= (list(amazon.find(query, fields).sort(sort).limit(limit)))


In [None]:
#displaying the result
result2 

In [None]:
# Initialize an empty list to store the top 5 computer USB Cables which have ratings greater than equal to 4
popular_comp_USBs = []

# Loop through each document in the results and append it to the list
for item in result2:
   popular_comp_USBs.append(item)
  
# Now, converted_results is a list of dictionaries
pprint(popular_comp_USBs)

In [None]:
# Create a DataFrame from the list 
df_Top_comp_USBs = pd.DataFrame(popular_comp_USBs)

# Display the DataFrame
print("The number of rows of the dataframe is:", len(df_Top_comp_USBs))

In [None]:
#displaying dataframe containing top 5 computer USB Cables which has ratings greater than equal to 4
df_Top_comp_USBs

In [None]:
#checking column names
print(df_Top_comp_USBs.columns)

In [None]:
# Reorder the columns
Reordered_Top_comp_USBs= df_Top_comp_USBs[[
    "product_id", "product_name", "category", "actual_price", "discounted_price",
    "discount_percentage", "rating", "review_title", "review_content","rating_count"
]]

# Print the DataFrame
print("The most popular 5 computer USB Cables which  have ratings greater than equal to 4 are following:")
Reordered_Top_comp_USBs

In [None]:
# Now let's create a bar chart for the Top  5 computer USB Cables which  have ratings greater than equal to 4
filtered_df = Reordered_Top_comp_USBs
fig = px.bar(filtered_df, x='product_id', y='rating_count',
             hover_data=['product_name', 'actual_price', 'discount_percentage'], color='discount_percentage',
             labels={'rating_count': 'Rating Counts'}, height=400)

# Update layout for a clearer x-axis and set the title
fig.update_layout(
    title="Top 5 Most Popular Computer USBs with Rating Score >=4",
    xaxis_title="Product ID",
    yaxis_title="Rating Counts",
    yaxis=dict(type='linear')  
)

# Display the figure
fig.show()

In [None]:
#saving the chart into the output Viz folder 
fig.write_html('../outputs_Viz/Most_popular_computer_USBs1.html')



## Computer USB Cable Analysis

The above chart and dataframe consists of top 5 Computer USBs having (Rating Score >=4 and rating counts >= 50000 and <= 180000 ). Here, **{"Product_id": B07DC4RZPY, "Product_name": Amazon Basics USB A to Lightning MFi Certified...}** is the most popular computer USB cable with the most rating counts, i.e, 1,78,817 rating counts, in our amazon dataset. This product has the highest actual_price (1999 INR); however after the significant discount percentage, i.e, 65%, the final discounted price for this product to its customers is 709 INR, which is still the most expensive one comparing to othe 4 popular brands.
The highest raing counts for this particular products tells us about this product being very popular among the customers. As in general , customers give the feedbacks in terms of rating score or rating reviews only after they use the products. Keeping in mind this product has a rating score >= 4,even with the highest cost with moderate level of discount percentage tells us about its huge market share. 

In order to better understand how the customers perceive this product whether in a positive or negative light, we need to do some further analysis on its review contents doing some sentiment analysis.


In [None]:
#creating dataframe with product id, product name and review content to make the sentiment analysis for top 5 most popular computer USB Cables
review_df = pd.DataFrame(Reordered_Top_comp_USBs[["product_id", "product_name","review_content", "review_title"]])
review_df

In [None]:
# Define a function to apply sentiment analysis
def analyze_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment

# Apply the function to the 'review_content' column
review_df['sentiment'] = review_df['review_content'].apply(analyze_sentiment)

# Now, df includes a new column 'sentiment' with sentiment analysis results
print(review_df[['review_content', 'sentiment']])



### Sentiment Analysis on Most Popular Computer USB Cables

Sentiment analysis is a technique that tries to understand if the words used in a text express positive, negative, or neutral feelings.Here, each 'sentiment' is a pair of numbers. These numbers are in the format (polarity, subjectivity).Based on the output of the sentiment analysis, we can observe the following:

-**Sentiment Polarity:**

The polarity score ranges from -1 (most negative) to 1 (most positive). Hence, all the provided sentiments have positive polarity, indicating that the reviews are overall positive.
The first review has a polarity of approximately 0.215, suggesting it is moderately positive.
The second review has a slightly lower polarity of approximately 0.199, which is still positive but a bit more subdued.
The last three reviews have identical polarity scores of approximately 0.240, indicating a positive sentiment that is slightly more pronounced than the first two reviews.

**Sentiment Subjectivity:**

The subjectivity score ranges from 0 (most objective) to 1 (most subjective). Higher scores indicate a more personal opinion or subjective text, while lower scores suggest factual or objective text.
The first review has a subjectivity score of approximately 0.438, which is moderately subjective.
The second review's subjectivity score is similar at approximately 0.432.
The last three reviews, which are identical, have a subjectivity score of approximately 0.544, suggesting they are more subjective than the first two reviews.

**Duplicate Reviews:**

The last three reviews are identical, which is reflected in the identical sentiment scores. This could be due to multiple reviewers left the same feedback, possibly through a review template or automated responses.

**General Observations:**

In this way, the above dataset seems to contain positive reviews, as indicated by the positive polarity scores. Likewise, the reviews appear to contain a mix of subjective opinions and objective statements, with subjectivity scores in the moderate range.


Actionable Insights:

For actionable business insights, one might look at the individual reviews that correspond to extreme sentiment scores (very high or very low) to understand what drives strong customer sentiment.
The sentiment analysis could be used to gauge overall customer satisfaction, identify areas for product improvement, or highlight aspects that customers particularly appreciate.
Remember, sentiment analysis provides a high-level overview of the sentiment but does not capture nuances such as sarcasm, context-specific language, or mixed sentiments within the same text. For a more detailed understanding of customer opinions, a deeper analysis, potentially using more advanced NLP techniques or manual review, would be necessary.






### TV HDMI Cables

In [None]:
#  Create a query that finds all TV HDMI cables in the 'amazon' collection
# Filter results by name
query = {'category': 'ElectronicsHomeTheater,TVVideoAccessoriesCablesHDMICables'}


# Capture the results to a variable
results3 = list(amazon.find(query))

# Print the number of results
TV_HDMI_cables = int(amazon.count_documents(query))
print("Total Number of items which are TV HDMI cables in result are:", TV_HDMI_cables)


In [None]:
# Initialize an empty list to store the top 5 computer USB Cables which have ratings greater than equal to 4
TV_HDMIs = []

# Loop through each document in the results and append it to the list
for item in results3:
 TV_HDMIs.append(item)
  
# Now, converted_results is a list of dictionaries
pprint(TV_HDMIs)

In [None]:
# Create a DataFrame from the list 
df_TV_HDMIs = pd.DataFrame(TV_HDMIs)

# Display the DataFrame
df_TV_HDMIs.head()

In [None]:
#   Create a query that finds top 5 TV HDMI cables in the 'amazon' collection which has rating score greater than equal to 4
# Filter results by name
query = {'category': 'ElectronicsHomeTheater,TVVideoAccessoriesCablesHDMICables',
        'rating_count': {'$gte': 10000,  '$lte':450000}, 'rating':{'$gte':4}}

# Display the  fields
fields = {'category' :1, 'product_id':1, 'product_name': 1 , 'discounted_price': 1, 'discount_percentage':1,'actual_price': 1, "rating":1,"review_title":1, "review_content":1, "rating_count":1}

# sort in descending order by 
sort = [('rating_count', - 1)]

# limit the results to the first 5
limit = 5

#  Execute the query
result4= (list(amazon.find(query, fields).sort(sort).limit(limit)))

In [None]:
result4

In [None]:
# Initialize an empty list to store the top 5 TV HDMI Cables having ratings greater than equal to 4
top_tv_HDMIs = []

# Loop through each document in the results and append it to the list
for item in result4:
   top_tv_HDMIs.append(item)
  
# Display the list
pprint(top_tv_HDMIs)

In [None]:
# Create a DataFrame from the list 
df_Top_TV_hdmi = pd.DataFrame(top_tv_HDMIs)

# Display the DataFrame
print(df_Top_TV_hdmi)

In [None]:
df_Top_TV_hdmi.head()

In [None]:
# Reorder the columns
Reordered_Top_tv_hdmi = df_Top_TV_hdmi[[
    "product_id", "product_name", "category", "actual_price", "discounted_price",
    "discount_percentage", "rating", "review_title", "review_content","rating_count"
]]

# Print the first 10 rows of the DataFrame
print("The most popular  TV HDMI Cables which  have ratings greater than equal to 4 are following:")
Reordered_Top_tv_hdmi.head()

In [None]:
## Create the scatter plot for the Most popular  5 TV HDMIs cables
# Create a hover text column by formatting the desired information
Reordered_Top_tv_hdmi['hover_text'] = 'Product Name: ' + Reordered_Top_tv_hdmi['product_name'] + \
                                      '<br>Category: ' + Reordered_Top_tv_hdmi['category'] + \
                                      '<br>Actual Price: INR ' + Reordered_Top_tv_hdmi['actual_price'].astype(str) + \
                                      '<br>Discounted Price: INR ' + Reordered_Top_tv_hdmi['discounted_price'].astype(str) + \
                                      '<br>Rating Count: ' + Reordered_Top_tv_hdmi['rating_count'].astype(str)

# Create the scatter plot
fig = go.Figure(data=go.Scatter(
    x=Reordered_Top_tv_hdmi['product_id'],
    y=Reordered_Top_tv_hdmi['rating_count'],
    mode='markers',
    marker=dict(size=[100, 80, 60, 40, 20],  # Adjust sizes as needed
                color=[0, 1, 2, 3, 4]),  # Adjust colors as needed
    text=Reordered_Top_tv_hdmi['hover_text'],  # Use the custom hover text
    hoverinfo='text'
))

# Customize layout
fig.update_layout(
    title='Most Popular TV HDMI Cables with Ratings >= 4',
    xaxis_title='Product ID',
    yaxis_title='Rating Counts',
    xaxis=dict(showgrid=False),
    yaxis=dict(showgrid=True),
    plot_bgcolor='skyblue'
)

# Show the plot
fig.show()

In [None]:
#saving the chart into the output Viz folder 
fig.write_html('../outputs_Viz/Most_popular_tv_bubble_HDMIs2.html')

## TV HDMIs Analysis

Based on the chart and dataframe created above we can say that **{"product_id" : B014I8SX4Y, "product_name": Amazon Basics High-Speed HDMI Cable, 6 Feet. }** is the most popular TV HDMI in the dataset with rating counts of 4,26,973.  It has the highest actual price 1400 INR and with 78% discounted rate the discounted price is 309 INR . It is not the most affordable TV HDMI in the dataset but it is definately very popular considering it has rating score of 4 and above and highest rating count.

In [None]:
#creating dataframe with product id, product name and review content to make sentiment analysis of the top 5 most popular TV HDMIs
review_TVs_df = pd.DataFrame(Reordered_Top_tv_hdmi[["product_id", "product_name","review_content", "review_title"]])
review_TVs_df



In [None]:
# Define a function to apply sentiment analysis
def analyze_sentiment1(text):
    analysis = TextBlob(text)
    return analysis.sentiment

# Apply the function to the 'review_content' column
review_TVs_df['sentiment'] = review_TVs_df['review_content'].apply(analyze_sentiment1)

# Now, review_TVs_df includes a new column 'sentiment' with sentiment analysis results
print(review_TVs_df[['review_content', 'sentiment']])

### Sentiment Analysis of column "review_content":

Sentiment Polarity: 
The polarity score ranges from -1 (very negative) to +1 (very positive). All the sentiments here have positive polarity, indicating positive reviews.
The first three reviews have identical polarity scores of approximately 0.375, suggesting moderate positivity. The duplication of reviews and sentiment scores suggests either repeated entries or very similar content across these reviews.
The fourth review has a higher polarity of about 0.604, indicating a very positive sentiment towards the product.
The fifth review shows a slightly higher than moderate positive sentiment with a polarity of approximately 0.414.

Sentiment Subjectivity: 
The subjectivity score ranges from 0 (very objective) to 1 (very subjective). Scores closer to 1 suggest personal opinions, feelings, or experiences, while scores closer to 0 suggest factual statements.
The first three reviews, which are identical, have a subjectivity score of about 0.577, indicating a mix of subjective opinions and objective facts.
The fourth review has a higher subjectivity score of about 0.686, suggesting that the review is more based on personal experience and opinions.
The fifth review has a subjectivity score of around 0.480, indicating it's somewhat balanced between factual information and personal opinion.

Repetition of Reviews:
The first three reviews are identical, both in content and sentiment analysis results. This repetition might be due to copied review being posted multiple times.

Overall Sentiment:
Overall, the sentiment analysis suggests that the reviews are generally positive about the HDMI cables, with varying degrees of positivity and subjectivity.
The analysis indicates a generally good customer experience with the HDMI cables, with the fourth review being the most positive and subjective, potentially highlighting a particularly strong personal satisfaction with the product.This analysis provides valuable insights into customer opinions on the HDMI cables, indicating overall positive experiences with nuances in the degree of satisfaction and the mix of subjective and objective statements in the reviews.

### Total Cables in in the dataset

In [None]:
#  Create a query that finds all cables in the 'amazon' collection
# Filter results by name

query = {'category': {'$regex': "cables$", '$options': 'i'}}

# Capture the results to a variable
result5 = list(amazon.find(query))

# Print the number of results
Total_cables = int(amazon.count_documents(query))
print("Total Number of cables in result are:",Total_cables  )


In [None]:
## Pretty print the 10 records
Cables = []
#displaying results1
for i in result5:
    Cables.append(i)
   

In [None]:
Cables[0:3]

In [None]:
# Create a DataFrame from the list 
df_cables = pd.DataFrame(Cables)

#reorder the df_Cables dataframe
reordered_df_cables = df_cables[["category", "product_name", "discounted_price", "actual_price", "rating", "rating_count"]]

# Display the DataFrame
print("Total number of cables in the dataset are:" , int(len(df_cables)))
reordered_df_cables

In [None]:
#  Create a query that finds all TV USB cables in the 'amazon' collection
# Filter results by name
query = {'category':  'ElectronicsHomeTheater,TVVideoAccessoriesCablesRCACables'}

# Capture the results to a variable
result6 = amazon.find(query)

# Print the number of results
RCA_cables = int(amazon.count_documents(query))
print("Total Number of rca cables in result are:", RCA_cables)


In [None]:
#  Create a query that finds all optical cables in the 'amazon' collection
# Filter results by name
query = {'category':  'ElectronicsHomeTheater,TVVideoAccessoriesCablesOpticalCables'}

# Capture the results to a variable
result7 = amazon.find(query)

# Print the number of results
optical_cables = int(amazon.count_documents(query))
print("Total Number of optical cables in result are:", optical_cables)

In [None]:
#  Create a query that finds all TV USB cables in the 'amazon' collection
# Filter results by name
query = {'category': 'ComputersAccessoriesAccessoriesPeripheralsCablesAccessoriesCablesDVICables'}

# Capture the results to a variable
result8 = amazon.find(query)

# Print the number of results
DVI_cables = int(amazon.count_documents(query))
print("Total Number of DVI cables in result are:", DVI_cables)

In [None]:
#  Create a query that finds all TV USB cables in the 'amazon' collection
# Filter results by name
query = {'category':  'ElectronicsHomeTheater,TVVideoAccessoriesCablesSpeakerCables'}

# Capture the results to a variable
result9 = amazon.find(query)

# Print the number of results
Speaker_cables = int(amazon.count_documents(query))
print("Total Number of speaker cables in result are:", Speaker_cables)

In [None]:
#  Create a query that finds all TV USB cables in the 'amazon' collection
# Filter results by name
query = {'category':   'ComputersAccessoriesAccessoriesPeripheralsCablesAccessoriesCablesEthernetCables'}

# Capture the results to a variable
result10 = amazon.find(query)

# Print the number of results
ethernet_cables = int(amazon.count_documents(query))
print("Total Number of ethernet cables in result are:", ethernet_cables)

In [None]:
#  Create a query that finds all TV USB cables in the 'amazon' collection
# Filter results by name
query = {'category':   'ComputersAccessoriesAccessoriesPeripheralsCablesAccessoriesCablesSATACables'}

# Capture the results to a variable
results11 = amazon.find(query)

# Print the number of results
SATA_cables = int(amazon.count_documents(query))
print("Total Number of SATA cables in result are:",SATA_cables)

## Number of Cables Available in the Dataset

In [None]:
print("There are eight kinds of Cables in this dataset, they are:")
print("1. Computer USB Cables are:", Computer_USBs)
print("2. TV HDMI Cables are:" , TV_HDMI_cables)
print("3. RCA Cables are:" , RCA_cables)
print("4. Optical Cables are:" , optical_cables)
print("5. DVI Cables are:" , DVI_cables)
print("6. Speaker Cables are:" , Speaker_cables)
print("7. ethernet Cables are:" , ethernet_cables)
print("8. SATA Cables are:" , SATA_cables)
print("Therefore, Total Cables = Computer_USBs + TV_HDMI_cables + RCA_cables + optical cables + DVI cables + speaker cables + ethernet cables + SATA cables =", Total_cables)


In [None]:
#creating the total Cables dataframe
cables_types_df = pd.DataFrame({
    "Cable_Type": ["Computer USB Cables", "TV HDMI Cables", "RCA Cables", "Optical Cables",
                   "DVI Cables", "Speaker Cables", "Ethernet Cables", "SATA Cables"],
    "Count": [Computer_USBs, TV_HDMI_cables, RCA_cables, optical_cables,
              DVI_cables, Speaker_cables, ethernet_cables, SATA_cables]
})

cables_types_df


In [None]:
# Creating a pie Chart from the cable dataframe
import plotly.graph_objects as go

labels = cables_types_df["Cable_Type"]

values = cables_types_df["Count"]

# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[ 0.2, 0,0,0,0 ,0,0,0])])
# Adding a title to the pie chart
fig.update_layout(title_text="Distribution of Different Types of Cables")
fig.show()

In [None]:
#saving the chart into the output folder 
fig.write_html('../outputs_Viz/Cables_types_pie3.html')

## Cable market Availability analysis

Based on the Cables dataframe and pie chart, the findings are as following:
- The category "Computer USB Cables" overwhelmingly dominates the chart, comprising a substantial majority of 83.7%. This indicates that among the cable types listed, Computer USB Cables are the most common or most stocked item in this dataset.Whereas, all other types of cables represent a small fraction of the dataset, each making up less than 2% individually.
- TV HDMI Cables are the second most common cable type at 11.1%, which is significantly less than Computer USB Cables but still a clear second.
- Categories such as RCA Cables, Optical Cables, Ethernet Cables, DVI Cables, Speaker Cables, and SATA Cables each make up between 0.526% and 1.58% of the dataset. Their marginal representation could point to a niche demand or specialization in these cable types.

In [None]:
#  Create a query that finds all Smart Televisions in the 'amazon' collection
# Filter results by category name
query = {'category': 'ElectronicsHomeTheater,TVVideoTelevisionsSmartTelevisions'}

# Capture the results to a variable
result12 = list(amazon.find(query))

# Print the number of results
print("Total Number of items which are all Smart Televisions in result are:", amazon.count_documents(query))


In [None]:
## Pretty print the 10 records
pprint(result12[0:10])

In [None]:
# Initialize an empty list to store the top 5 computer USB Cables which has discounted price greater than equal to 100 and have ratings greater than equal to 4
Smart_TVs= []

# Loop through each document in the results and append it to the list
for item in result12:
   Smart_TVs.append(item)
  
# Now, converted_results is a list of dictionaries
pprint(Smart_TVs)


In [None]:
#creating a dataframe of the Smart Tvs
# Create a DataFrame from the list 
df_Smart_TV = pd.DataFrame(Smart_TVs)

# Display the DataFrame
#print(df_Smart_TV)
df_Smart_TV.head(5)


In [None]:
#   Create a query that finds top 5  Smart Televisions in the 'amazon' collection have rating score greater than equal to 4 
# Filter results by product name
query = {'category': 'ElectronicsHomeTheater,TVVideoTelevisionsSmartTelevisions',
            'rating_count': {'$gte': 10000}, 'rating':{'$gte':4}}
         
# Display the required fields
fields = {'category' :1, 'product_id':1, 'product_name': 1 , 'discounted_price': 1, 'discount_percentage':1,'actual_price': 1, "rating":1,"review_title":1, "review_content":1, "rating_count":1}

# sort in descending order by 
sort = [ ('rating_count', -1)]

limit = 5

# Using find() with sort and limit as chained method calls
result13 = (list(amazon.find(query, fields).sort(sort).limit(limit)))

# Print the number of results
print("Total Number of items which are all Smart Televisions in result are:", amazon.count_documents(query))

In [None]:
#displaying the result
result13


In [None]:
# Initialize an empty list to store the top 5 smart Tvs have ratings greater than equal to 4
top_SmartTVs = []

# Loop through each document in the results and append it to the list
for item in result13:
  top_SmartTVs.append(item)
  
# Display the list
pprint(top_SmartTVs)


In [None]:
# Create a DataFrame from the list 
df_Top_Smart_TV = pd.DataFrame(top_SmartTVs)

# Display the DataFrame
#print(df_Top_Smart_TV)
df_Top_Smart_TV.head()


In [None]:
#creating an interactive barcharts showing top 5 smart TVs with rating >= 4
df_smart_TVs = df_Top_Smart_TV

fig = px.bar(df_smart_TVs, y='rating_count', x='product_id', text_auto= True,
              hover_data=['product_name', 'actual_price', 'discounted_price'],
            title="5 Most Popular Smart Tvs with rating >= 4")
fig.show()

In [None]:
#saving the chart into the output folder 
fig.write_html('../outputs_Viz/most_popular_smartTvs4.html')

## Most Popular Smart Tv Analysis

Based on the chart and dataframe above, we can tell **"product_id": B08Y55LPBF, product_name : Redmi 126 cm (50 inches) 4K Ultra HD Android S...is the most popular Smart Tv in the dataset with the rating counts equal to 45238.0. It is 44999 INR ,however; after discount perentage of 27 % it costs 32999 INR to its customers to purchase one. The discount percentage is of moderate level but bearing in mind the rating score of 4 and highest rating counts, it sure is the most popular smart Tv in the dataset.

In [None]:
#creating dataframe with product id, product name and review content to make the sentiment analysis of the most popular Smart Tvs 
review_smartTVs_df = pd.DataFrame(df_Top_Smart_TV[["product_id", "product_name","review_content", "review_title"]])
review_smartTVs_df

# Define a function to apply sentiment analysis
def analyze_sentiment2(text):
    analysis = TextBlob(text)
    return analysis.sentiment

# Apply the function to the 'review_content' column
review_smartTVs_df['sentiment'] = review_smartTVs_df['review_content'].apply(analyze_sentiment2)

# Now, df includes a new column 'sentiment' with sentiment analysis results
print(review_df[['review_content', 'sentiment']])


### Sentiment Analysis of Most popular Smart Tvs

Sentiment Polarity:
All the sentiment polarity scores are positive, which indicates that the reviews have a generally positive tone. The polarity scores are moderate, not extremely positive, suggesting that while the reviews are favorable, they may contain some reservations or less enthusiastic language.

Sentiment Subjectivity:
The subjectivity scores are also moderate, which implies that the reviews contain a mix of objective facts and subjective opinions.None of the reviews are entirely subjective or entirely objective; they strike a balance between providing factual information about the products and expressing personal opinions or experiences.

Similar Sentiment in Multiple Reviews:
Reviews 2, 3, and 4 have identical sentiment scores, suggesting that the content of these reviews is either very similar or duplicated.
This could indicate that the same review has been posted multiple times.

However, the limitation of Sentiment analysis tools here could be it may not capture nuanced expressions of sentiment, such as sarcasm or mixed emotions within the text. Hence, these scores should be interpreted as part of a broader analysis that might include manual review of the text for context.

### Statistical Data Analysis 

#### Calculating correlation coefficient between some numerical columns

In [None]:
# Calculating the correlation coefficient between 'discounted_price' and  "actual_price"
corr = df_Top_Smart_TV['actual_price'].corr(df_Top_Smart_TV['discounted_price'])
corr

# Analysis

The calculated correlation coefficient between 'actual_price' and 'discounted_price' is approximately 0.94. This value indicates a strong positive linear relationship between the actual price and the discounted price of items in the dataset. In practical terms, it means that as the actual price increases, the discounted price tends to increase as well, and vice versa. This strong correlation suggests that the two prices move in tandem to a large extent, implying that the discounting strategy might be proportionally related to the actual price of the items.






In [None]:
# Calculating the correlation coefficient between  "rating_count" and 'discounted_price'
corr = df_Top_Smart_TV['rating_count'].corr(df_Top_Smart_TV['discounted_price'])
corr

In [None]:
# Calculating the correlation coefficient between 'discounted_price' and  "rating_count"
corr = df_Top_Smart_TV['discounted_price'].corr(df_Top_Smart_TV['rating_count'])
corr

### Identifying the correlations between some numerical fields on the Most popular Smart TV Dataframe


The calculated correlation coefficient between 'rating_count' and 'discounted_price' is approximately 0.054. This value indicates a very weak positive linear relationship between the number of ratings a product has received and its discounted price within the dataset. Essentially, this suggests that there is barely any linear correlation between how many times a product has been rated and its discounted price. The correlation is very close to 0, implying that changes in the discounted price of items do not significantly correspond with changes in their rating count, and vice versa.






In [None]:
# creating a stacked bars for each product, with one bar for the actual price and another for the discounted price
df_filtered = df_Top_Smart_TV
# Transform the DataFrame to long format
df_long = df_filtered.melt(id_vars=['rating', 'product_id','product_name', 'category'], value_vars=['actual_price', 'discounted_price'],
                           var_name='Price Type', value_name='Price')
df_long.head()

In [None]:
# Create the bar chart using the long format DataFrame
fig = px.bar(df_long, x='Price Type', y='Price', color='Price Type', barmode='group', hover_data=['category','product_id','product_name' ])

# Update layout for a clearer x-axis and set the title
fig.update_layout(
    title="Discounted Price Vs Actual Prices for Smart TVs with Rating Score 4",
    xaxis_title="Price Type",
    yaxis_title="Price (in INR)",
    xaxis_tickangle=-45,
    legend_title="Price Type",)
    
# Show the figure
fig.show()

In [None]:
#saving the chart into the output folder 
fig.write_html('../outputs_Viz/Amazon_bar1.html')


### Analysis of discounted price and actual price

The prices are presented on the y-axis, which is labeled with the currency (INR - Indian Rupees). The x-axis represents the price type, distinguishing between actual and discounted prices.The image depicts a stacked bar graph showing a comparison between the actual prices and discounted prices of smart TVs with a rating score of 4, where one bar showing actual price of each product and another bar showing discounted price for each product.Here, we see actual prices are shown to be higher than the discounted prices, which is expected as discounts typically reduce the price from the original amount.Also, The scale of the y-axis gives an indication of the magnitude of prices, showing that the actual prices reach up to 150k INR while the discounted prices are consistently lower.