# **Top 50 bestselling novels of Amazon with Rstudio**

## Scenario

I'm a junior data analyst working for a business intelligence consultant. I have been asked to lead a project for a brand new client. This will involve everything from defining the business task all the way through presenting my data-driven recommendations. I will choose the topic, ask the right questions, identify a good dataset and ensure its integrity, conduct analysis, create compelling data visualizations, and prepare a presentation.


This will be a project for Amazon . The client is trying to make better approaches to their customers specifically in the novel book sales, they want to have a better understanding of what they like more to their client and how they can improve the novel book sales on their platform.



## Ask




#### Objective

Find and present data-driven recommendations that allow Amazon to improve Amazon novel books market position. Some key factors are the amount of selling and identifying categories or customer preferences.This problem seems an example of a problem requiring analysts to categorize things is a company's goal to improve customer satisfaction based on ranking and the reviews.

The primary stakeholders is MacKenzie Bezos (former wife of Jeff Bezos), Fidelity Management & Research Company, and BlackRock Institutional Trust Company and the audience is the Senior Manager, Operations Research & Supply Chain Analytics and the secondary stakeholders Analytic Teamlead Manager. 


## Prepare

First of all, before I start manipulating the data, I want to make sure that the dataset is a "good" dataset, that means that it has to accomplish a standard to be a useful data that can help me to answer questions and solve the problem that my stakeholders have.

#### Data Source Description

The data was collected from the kaggle web site called [Top 50 Bestselling Novels 2009-2021 of Amazon](https://www.kaggle.com/datasets/zwl1234/top-50-bestselling-novels-20092021-of-amazon) This file contains data on top 50 bestselling novels on Amazon each year from 2009 to 2021. The data is collected from amazon.com website and Kaggle. The inspiration behind it was Greywasp's Top 50 Bestselling Novels 2009-2020 of Amazon. 

* **Format:** Csv file
* **Data period:** 2009-2021
* **License CC0:** Public Domain
* **Size:** 37 Kb
* **Sources:** Amazon and Kaggle data collected from Amazon.com. Additionally, the price of the books has been rounded up and matches the price on the date 18 August 2022.
* **Author name:** Greywasp


#### Is the data reliable? 

The dataset is accurate, complete and unbiased. also it count with License CC0, that means legal document dedicating a copyrighted work to the public domain and also it was founded in a prestige database web page that give to it a score of 10 in Usability (number calculated by Kaggle representing the level of documentation of a dataset).

#### Is the data original?

Yes, the data was obtained directly from Kaggle web page

#### Is the data comprehensive?

Yes, the important data information is complete and leaded to answer the question or find the solution

#### Is the data current?

Yes, the data was uploaded in the present year, and the last update to today was 09/August/2022


## Process

Here comes the fun part. In this stage I will begin with the data wrangling to prepare the data set for the Analysis stage. This is not a large data set so for me it could be easy make this project in EXCEL, but I want to improve my programing skills in order to prepare myself for larger datasets, so I decided to use RStudio  for data wrangling and analysis tool, will use Tableau to make better visualizations in order to make a clean work for my stakeholders.

The structure for this will be:

* Identify relevant data
* Establish a Format
* Data Cleaning
* Validation


### Loading libraries

In [None]:

# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load
library(tidyverse) # metapackage of all tidyverse packages

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Importing data

In [None]:
#importing data

df <- read.csv('../input/top-50-bestselling-novels-20092021-of-amazon/Amazon Top 50 Books 2009-2021 - Reworked Sheet (1).csv')


### Exploring

In [None]:

glimpse(df)

In [None]:
summary(df)

### Formating

In [None]:
#put all the columns names in lower case

names(df) <- tolower(names(df))
head(df,5)

In [None]:
# removing commas in columns

df$reviews <- lapply(df$reviews,
    function(x){ as.numeric(as.character(gsub(",", "", x))) })
df$reviews <- as.numeric(df$review)
head(df,5)

In [None]:
#removing dollar sign

options(digits=2)
df$price <- as.numeric(gsub("\\$", "", df$price))
df$price_r <- as.numeric(gsub("\\$", "", df$price_r))
df$year <- as.numeric(df$year)
head(df,5)

In [None]:
# join the columns name and year to differentiate versions with the same name

df$name <- paste(df$name, df$year)
head(df)

In [None]:
# coercing multiple columns to factor

cols <- c("name","author","genre")
df[cols] <- lapply(df[cols],factor)
head(df,2)

In [None]:
#looking for NA values

colSums(is.na(df))

In [None]:
# looking for 0 in price column (inconsistency)
df%>%
filter(price==0)

In [None]:
# droping observations were price = 0

df <- df%>%
filter(price!=0)
head(df)
df%>%                   #check
filter(price==0)%>%
count()

# Analyze the data

Now let's get some approaches with the prepared dataset



In [None]:
# top 10 based on highest number of reviews and user rating ("most selling and rating")

dftop <- df %>% arrange(desc(reviews),desc(user.rating))
dftop <- dftop[1:10,]
dftop

Based on the top 10 it seems like fiction is a more popular genre but not necessarily the best reading experience vs non fictions, but because we do not have more data to support this idea will be just a hypothesis, that means that we can improve our customer insight if we could have the number of buyed copies.

Another interesting thing is that all the books in this top 10 are between 2020 and 2021, maybe because pandemic, but I need more data to support that hypothesis


In [None]:
# analysis of the top 10 by genre and price

tapply(dftop$price, dftop$genre, summary) #summary of genre
table(dftop$genre)

ggplot(dftop,aes(genre,price,colour=genre))+
geom_boxplot()

In this top ten we can see that the mean price is almost the same in both categories, between 11 and 12 dollars if we round numbers

In [None]:
tapply(dftop$user.rating, dftop$genre, summary) #summary of genre
table(dftop$genre)

ggplot(dftop, aes(genre,user.rating, colour=genre))+
geom_violin()

As I mentioned before, in top 10 books, non fiction category has been better evaluated than fiction by the customers

In [None]:
#Price Analysis by genre

tapply(df$price, df$genre, summary) #summary of genre
table(df$genre)                     # count of diferent levels in genre


ggplot(df, aes(x=price))+                                          #set the plot with xlabel = price
geom_histogram(aes(color= genre, fill = genre),                    #set histogram and separate categories by color
               position = "identity", bins = 100, alpha = 0.6) 


#### Non-fiction pays better
 
In general,the **mean for fiction book price is 11 dollars and 14.5 dollars for non-fiction**, what makes the difference is that **fiction customer moves around 6 to 13 dollars**, instead, the **non-fiction customer moves around 8 to 17 dollars**,that means non-fiction customers are willing to pay more for their books than fiction customers


In [None]:
# User rating analysis by genre

tapply(df$user.rating, df$genre, summary) #summary of genre
table(df$genre)

ggplot(df, aes(genre,user.rating, colour=genre))+
geom_violin()

#### Same mean of rating but non-fiction has a better start
 
In a general view, both has similar mean but in ranges **non-fiction has a better minimum range starting with a score of 4 instead of fiction that start near 3.5.Also there are more presence of non-fictions than fiction books.** 


In [None]:
# summary of reviews across the years

do.call(rbind,tapply(df$reviews, df$year, summary))

ggplot(df, aes(year,reviews,colour=genre))+
geom_point()+
geom_smooth(se= F)

### Customers are more willing to share their book opinion with others
 
**From 2019 to 2021 the mean of book comments has increased from an average of 16.000 to 56.000 reviews**. As you can see the number of reviews has increased a lot this time, so our customers are more willing to share their opinion with others.


In [None]:
# summary of price across the years

do.call(rbind,tapply(df$price, df$year, summary))

ggplot(df, aes(year,price,colour=genre))+
geom_point()+
geom_smooth(se=F)

### 11 dollars is a strong price
 
Even though the price has fall from 15 to 12 dollars between 2009-2017, **The price has remained at 11 dollars the last 3 years.**


# Act
 
In this part I suggest some points of actions in order to improve selling with insights based on the previous analysis
 
### Conclusions 
 
* non-fictions customers have more presence in total sellings.
* non-fictions customers are more willing to pay more for books than fiction .
* customers in general write more reviews compared to before.
* customers are willing to pay 11 dollars or more for their books.
 
### Sugestión
 
1. Create a “loyalty” media campaign that rewards those who share their reviews with others or help new customers to purchase books on Amazon platform. Also have in mind the prices that customers are willing to take in order to make offers focused on non-fiction books.
