<a href="https://colab.research.google.com/github/KacperKaszuba0608/Projects-R/blob/main/Creating_An_Efficient_Data_Analysis_Workflow_Part2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating An Efficient Data Analysis Workflow - Part2

## Introduction

In the project "Creating An Efficient Data Analysis Workflow", we are taking on
the role of an analyst for a book company. The company has provided us more data 
on some of its 2019 book sales, and it wants us to extract some usable knowledge 
from it. It launched a new program encouraging customers to buy more books on 
July 1st, 2019, and it wants to know if this new program was successful at 
increasing sales and improving review quality.

## Loading Libraries and Reading in File


In [1]:
library(tidyverse)
library(lubridate)

df <- read.csv("https://github.com/KacperKaszuba0608/Datasets/raw/main/sales2019.csv")
head(df)

“running command 'timedatectl' had status 1”
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.4.0      [32m✔[39m [34mpurrr  [39m 1.0.1 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.3.0      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.3      [32m✔[39m [34mforcats[39m 0.5.2 

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


Attaching package: ‘lubridate’


The following objects are masked from ‘package:base’:

    date, intersect, setdiff, union




Unnamed: 0_level_0,date,user_submitted_review,title,total_purchased,customer_type
Unnamed: 0_level_1,<chr>,<chr>,<chr>,<int>,<chr>
1,5/22/19,it was okay,Secrets Of R For Advanced Students,7.0,Business
2,11/16/19,Awesome!,R For Dummies,3.0,Business
3,6/27/19,Awesome!,R For Dummies,1.0,Individual
4,11/6/19,Awesome!,Fundamentals of R For Beginners,3.0,Individual
5,7/18/19,Hated it,Fundamentals of R For Beginners,,Business
6,1/28/19,Never read a better book,Secrets Of R For Advanced Students,1.0,Business


## Extracting informations


In [2]:
# Number of rows and columns
dim(df)

In [3]:
# Checking type of columns
glimpse(df)

Rows: 5,000
Columns: 5
$ date                  [3m[90m<chr>[39m[23m "5/22/19", "11/16/19", "6/27/19", "11/6/19", "7/…
$ user_submitted_review [3m[90m<chr>[39m[23m "it was okay", "Awesome!", "Awesome!", "Awesome!…
$ title                 [3m[90m<chr>[39m[23m "Secrets Of R For Advanced Students", "R For Dum…
$ total_purchased       [3m[90m<int>[39m[23m 7, 3, 1, 3, NA, 1, 5, NA, 7, 1, 7, NA, 3, 2, 0, …
$ customer_type         [3m[90m<chr>[39m[23m "Business", "Business", "Individual", "Individua…


In [4]:
# Checking missing values in each column
length(which(is.na(df$user_submitted_review)))
length(which(is.na(df$date)))
length(which(is.na(df$title)))
length(which(is.na(df$total_purchased)))
length(which(is.na(df$customer_type)))

Dataset consists of the 5 column and 5000 rows. Data has missing value in 
`total_purchased` column and `user_submitted_review`. Types of column is as follows:

* `date` - contain information about date in character type
* `user_submitted_review` - contains review from reader in character type
* `title` - title of a book in character type
* `total_purchased` - number of purchased books in range 0:12, integer type
* `customer_type` - type of customer in character type, with 2 level 'Business' and 'Individual'

## Data Cleaning


### Handling Missing Values

For total_purchased, we're going to replace all of the NA values with an
average value that we calculate from the complete dataset.

In [5]:
# Removing NA values from second column
df <- df %>%
  filter(!(is.na(user_submitted_review))) %>%
  mutate(total_purchased = ifelse(is.na(total_purchased),
                                  round(mean(!is.na(total_purchased)),3),
                                  total_purchased))

### Processing Review Data

At first I extract unique values from `user_submitted_review`. 
Then I choose some words which will be information about is a review 
positive or negative.

In [6]:
unique(df$user_submitted_review)

Above we have few sentence and I think postivie word/phrase can be: 'Awesome', 'okay', 
'learned a lot', 'Never read a better book' and 'OK'. The negative word/phrase can be:
'Hated', 'not needed', 'not recommend' and 'other books were better'. 

If I have this words, now I can create a function which will returns a value 
indicating if the review is positive or not.

In [7]:
p_or_n <- function(review){
  result <- case_when(str_detect(review, 'Awesome')~TRUE,
                      str_detect(review, 'okay')~TRUE,
                      str_detect(review, 'learned a lot')~TRUE,
                      str_detect(review, 'Never read a better book')~TRUE,
                      str_detect(review, 'OK')~TRUE,
                      TRUE~FALSE)
}

df <- df %>%
  mutate(positive_or_not = unlist(map(user_submitted_review, p_or_n)))

## Comparing Book Sales Between Pre- and Post-Program Sales

I can finally make a move towards answering the main question of the analysis, 
Was the new book program effective in increasing book sales? The program started 
on July 1, 2019 and the data we have contains all of the sales for 2019. But at 
first I have to change type od column to date type and check how many books was 
before and after July 1, 2019.

In [8]:
# Changing types of date column
df <- df %>%
  mutate(date = mdy(date),
         when_date = ifelse(date < '2019-07-01', 'Pre-', 'Post-'))

book_program_status <- df %>%
  group_by(when_date) %>%
  summarize(books_purchased = sum(total_purchased))

book_program_status

when_date,books_purchased
<chr>,<dbl>
Post-,7990.808
Pre-,8145.956


As we see the result of purchased book after the new book program on July 1, 2019
wasn't effective. We can see a decrease of number of books sold from approximately 
8167 to 7970 books. Coclusion of this program is that the new book program wasn't
effective for company.

## Comparing Book Sales Within Customer Type

In previous step of analysis we claim that the new book program hadn't been effective.
That's why I make a step further and check if it's possible that individual customers
responded better to the program and bought more books in response to the program.
Or, it could have been businesses that bought more books.

In [9]:
customers <- df %>%
  group_by(when_date, customer_type) %>%
  summarize(books_purchased = sum(total_purchased)) %>%
  arrange(customer_type, when_date)

customers

[1m[22m`summarise()` has grouped output by 'when_date'. You can override using the
`.groups` argument.


when_date,customer_type,books_purchased
<chr>,<chr>,<dbl>
Post-,Business,5535.822
Pre-,Business,5546.255
Post-,Individual,2454.986
Pre-,Individual,2599.701


Comparing Book Sales Within Customer Type doesn't give us me satisfied answer, because
still the new program doesn't bring benefits to the company.

## Comparing Review Sentiment Between Pre- and Post-Program Sales

The last question that I need to answer with the data is, **did review scores improve
as a result of the program?** The answer is below.

In [10]:
better_reviews <- df %>%
  group_by(when_date) %>%
  summarize(positive_or_not = sum(positive_or_not)) %>%
  arrange(-positive_or_not)

better_reviews

when_date,positive_or_not
<chr>,<int>
Pre-,1134
Post-,1128



The reviews seems to be better before the new book program, but this difference is 
very small.