# Exploratory Data Analysis in R for Absolute Beginners

Note: You can consult the solution for this workspace in the file browser (`notebook-solution.ipynb`).

In this code-along, we'll learn the basics of R, the Tidyverse package, and the Lubridate package to determine which user onboarding 
flow the product team at DataCamp should focus on this quarter.

We will be using a synthetic dataset titled "user_page_view_history.csv" avaliable in our Workspace today.

This data set contains:
- Page view history of users.
- List of pages visited, the page that referred them, onboarding flow label, and date stamp of the visitation. 

## Table Of Contents
1. The Foundations
2. Exploring the Data
3. Onboarding Flows

## The Foundations of R

### What is R?

R is a programming language we can use to tell computers what to do! R can be used to solve problems as complex as prototyping a dashboard to as simple as 2 + 2.

In [47]:
# We could use R like a calculator
2 + 2

### What is a variable?

Sometimes after we have performed a calculation, we would like to save the output. We can do this by storing outputs as variables!

You can think of a variable as a box. Much like the boxes we use on a day to day basis, **variables** can store objects and assigned an alias for later. In R we store data objects inside variables using this pointer symbol "<-".

Here are three common data objects we like to store:
- **Integers** (`thenumberfour <- 4`)
- **Strings** (`thewordhello <- "hello"`)
- **Results Of Calculations** (`resultoftwoplustwo <- 2+2`)

In [48]:
# Here are three examples
resultoftwoplustwo <- 2 + 2

# We can view the contents of a variable by typing its name
resultoftwoplustwo

### What is a function?

![functioninmath](functioninmath.jpg)

Similar to functions in mathematics, **functions** take some input and create some output. In computer science you can create custom functions, although we won't be covering that in this lesson, or use premade function avaliable to us from the R. For example, we could write a function that computes 2 + 2 or we could use the function sum(1,1).

In [49]:
# Here is a simple and commonly used function
x <- sum(2,2)
x

Today we will be using these twelve functions:
- [library()](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/library)
- [read_csv()](https://www.rdocumentation.org/packages/qtl2/versions/0.24/topics/read_csv)
- [select()](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/select)
- [filter()](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/filter)
- [mutate()](https://www.rdocumentation.org/packages/dplyr/versions/0.5.0/topics/mutate)
- [floor_date()](https://www.rdocumentation.org/packages/lubridate/versions/1.3.3/topics/floor_date)
- [group_by()](https://www.rdocumentation.org/packages/dplyr/versions/1.0.10/topics/group_by)
- [summarise()](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/summarise)
- [n()](www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/n)
- [mean()](https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/mean)
- [ungroup()](https://www.rdocumentation.org/packages/multiplyr/versions/0.1.1/topics/ungroup)
- [count()](https://www.rdocumentation.org/packages/plyr/versions/1.8.8/topics/count)

Lets try using the `library()` function to load the `tidyverse` package.

In [50]:
# Load tidyverse
library(tidyverse)

### What is a package?

The sum function we used earlier come a preloaded package when R is activated. For data analysis, we will be using the `tidyverse` package. A package containing many functions for manipulating data.

Now that we know the basics, lets get started!

## Exploratory Analysis

In order to figure out what onboarding flow we should optimze we have to figure out where users are coming from, where they are going, and when they are doing it. To do this we need to familiarize ourselves with the data we have avaliable. We typically do this by conducting exploratory analysis!

### Loading the Data

First things first, lets load our data.

#### Instructions

The dataset is stored in a CSV file named `user_page_view_history.csv` in the `data` folder.
- Load the data from the CSV file "user_page_view_history.csv" using the `read_csv` function and store the dataset to the variable `user_page_view_history`.
- View the data by typing out `user_page_view_history`.

In [51]:
# Read the CSV file "user_page_view_history.csv" and assign it to the user_page_view_history variable
user_page_view_history <- read_csv('data/user_page_view_history.csv')

# Type the variable name to view the result!
user_page_view_history

[1mRows: [22m[34m40000[39m [1mColumns: [22m[34m4[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (3): current_page_url, referral_page_url, user_flow
[34mdate[39m (1): date_visited

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


current_page_url,referral_page_url,user_flow,date_visited
<chr>,<chr>,<chr>,<date>
www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToCourse,2022-08-04
www.datacamp.com/users/sign_up,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToSignup,2022-05-17
www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToCourse,2022-11-02
www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToCourse,2022-05-06
www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToCourse,2022-10-26
www.datacamp.com/users/sign_up,www.datacamp.com/tutorial/create-histogram-plotly,tutorialToSignup,2022-01-19
www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToCourse,2022-02-10
www.datacamp.com/users/sign_up,www.datacamp.com/tutorial/create-histogram-plotly,tutorialToSignup,2022-11-03
www.datacamp.com/users/sign_up,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToSignup,2022-01-21
www.datacamp.com/users/sign_up,www.datacamp.com/blog/data-analytics-projects-all-levels,blogToSignup,2022-12-12


This looks a bit difficult to read, lets reorganise below using the `select()` function

In [52]:
# Reorganise the columns using select
user_page_view_history %>%
	select(date_visited, user_flow, current_page_url, referral_page_url)

date_visited,user_flow,current_page_url,referral_page_url
<date>,<chr>,<chr>,<chr>
2022-08-04,blogToCourse,www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-05-17,blogToSignup,www.datacamp.com/users/sign_up,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-11-02,blogToCourse,www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-05-06,blogToCourse,www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-10-26,blogToCourse,www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-01-19,tutorialToSignup,www.datacamp.com/users/sign_up,www.datacamp.com/tutorial/create-histogram-plotly
2022-02-10,blogToCourse,www.datacamp.com/courses,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-11-03,tutorialToSignup,www.datacamp.com/users/sign_up,www.datacamp.com/tutorial/create-histogram-plotly
2022-01-21,blogToSignup,www.datacamp.com/users/sign_up,www.datacamp.com/blog/data-analytics-projects-all-levels
2022-12-12,blogToSignup,www.datacamp.com/users/sign_up,www.datacamp.com/blog/data-analytics-projects-all-levels


#### Thoughts

After loading the data set their are a few key observations to note. We have a dataset titled user_page_view_history containing four columns:
- `current_page_url`: The URL of the current webpage.
- `referral_page_url`: The URL of the previous webpage.
- `user_flow`: Category for each onboarding flow.
- `date_visited`: The date of the page view.
We have 40,000 rows where each row contains the page viewed, the page that lead them there, an onboarding flow label, and the date the page view happened. 

### What pages do users enter DataCamp from?

Let's start by figuring out what pages users enter DataCamp from and how many users enter through each page!

#### Instructions

1. Use the `group_by()` function to group by the referral page in the `referral_page_url` column.
2. Use the `summarise()` and `n()` function to count the number of rows for each day.
3. Use a line plot to better visualize your data.

In [53]:
# Add code below!
user_page_view_history %>%
	group_by(referral_page_url) %>%
	summarise(page_visits = n()) %>%
	ungroup()

referral_page_url,page_visits
<chr>,<int>
www.datacamp.com/blog/data-analytics-projects-all-levels,19891
www.datacamp.com/tutorial/create-histogram-plotly,20109


This can also be simplified using the count() function.

In [54]:
# Add code below! Simplify using count()
user_page_view_history %>%
	count(referral_page_url)

referral_page_url,n
<chr>,<int>
www.datacamp.com/blog/data-analytics-projects-all-levels,19891
www.datacamp.com/tutorial/create-histogram-plotly,20109


#### Thoughts

Take notes on your findings below by hovering over the cell below and clicking Add Text!

It looks like users enter our website through two different webpages:
- [A Blog](https://www.datacamp.com/blog/data-analytics-projects-all-levels)
- [A Tutorial](https://www.datacamp.com/tutorial/create-histogram-plotly)
The blog page appears to be more users onto our website, but only by a thin margin.

Two different ways users enter datacamp, blog or a tutorial. Tutorials have bit more traffic, but by a very thin margin.

### What do they do after?

Now you try! What pages do users visit after they land? How many users entered through each page?

#### Instructions

Add your code below, visualize your results, and take notes on your findings!

In [55]:
# Add code below! Try solving using the group_by() function.
user_page_view_history %>%
	group_by(current_page_url) %>%
	summarise(page_visits = n()) %>%
	ungroup()

current_page_url,page_visits
<chr>,<int>
www.datacamp.com/courses,19952
www.datacamp.com/users/sign_up,20048


#### Thoughts

Take notes on your findings below!

It looks like users enter our website through two different webpages:
- [Courses](https://www.datacamp.com/courses)
- [Registration](www.datacamp.com/users/sign_up)
The course page appears to be the most popular next step for users on our website, but only by a thin margin.

Two pages users go after, courses page or the sign up page with equal traffic. 

### What paths do users take through onboarding?

#### Instructions

Add your code below, visualize your results, and take notes on your findings!

In [56]:
# Add code below! Try solving using the count() function. 
user_page_view_history %>%
	count(user_flow) %>%
	ungroup()

user_flow,n
<chr>,<int>
blogToCourse,9948
blogToSignup,9943
tutorialToCourse,10004
tutorialToSignup,10105


#### Thoughts

Take notes on your findings below!


There are four flows users tend to use when onboarding. This makes sense because there are two referral page types and two current pages types. Although there is a bit of variability, all three onboarding flows seem like equal candidates. 

At this point it looks like we can choose any path. How can we make a better decision?

### When did these visits happen?

We have one more column we haven't explored yet, the time column! It's important to know how old the data we are working with is and what time frame we are working with. Lets visualize the number of page visits over time!

#### Instructions

Graph the number of visits over time! If the graph is difficult to read, group by the week instead of the day.

In [57]:
library(lubridate)

user_page_view_history %>%
	filter(date_visited <= "2022-12-31") %>%
	mutate(week = floor_date(date_visited, "week")) %>%
	group_by(week) %>%
	summarise(page_visits = n()) %>%
	ungroup()

week,page_visits
<date>,<int>
2022-01-09,699
2022-01-16,863
2022-01-23,808
2022-01-30,813
2022-02-06,824
2022-02-13,804
2022-02-20,784
2022-02-27,826
2022-03-06,833
2022-03-13,818


#### Thoughts

Take notes on your findings below!

There are two major take aways from this plot. The first observation is that our data goes as far back as Janurary 1st 2022. The second is that during the month of July 2022 there was a drop in traffic! This is our first major find! Up until now path traffic appeared relatively consistant, but it seems we used to average around 120 visits per day, but during the month of July, 2022 it dropped. This graph opens the door to a number of questions!
- What was most visits we've had in a day?
- What day did we have the least? Did something happen?
- What is the average?
- What happened between July 2022 and August 2022?

#### Instructions

Quantify how large the drop in visitations was using the formula:

(new average - old average) / old average

In [58]:
page_views_weekly <- user_page_view_history %>%
	filter(date_visited <= "2022-12-31") %>%
	mutate(week = floor_date(date_visited, "week")) 

page_views_weekly %>%
	group_by(week) %>%
	summarise(page_visits = n()) %>%
	ungroup() %>%
	filter(week >= "2022-08-01") %>%
	mutate(average_views = mean(page_visits))

week,page_visits,average_views
<date>,<int>,<dbl>
2022-08-07,759,711.3333
2022-08-14,716,711.3333
2022-08-21,700,711.3333
2022-08-28,691,711.3333
2022-09-04,678,711.3333
2022-09-11,749,711.3333
2022-09-18,677,711.3333
2022-09-25,751,711.3333
2022-10-02,678,711.3333
2022-10-09,668,711.3333


#### Thoughts

Take notes on your findings below!

Before August 1st, 2022 we averaged 801 viewed per week, now we average 711. This is an 11% decrease in traffic!

### How has the landing page performed over time?

Now that we have a better handle on our data, lets figure out which onboarding flow to focus on. In order to decide between the flows, lets see how each landing page is performing over time!

In [59]:
# Add code below!
#page_views_weekly %>%
	#group_by(week, current_page_url) %>%
	#summarise(page_visits = n()) %>%
	#ungroup()

user_page_view_history %>%
	count(date_visited)


date_visited,n
<date>,<int>
2022-01-10,122
2022-01-11,122
2022-01-12,112
2022-01-13,114
2022-01-14,99
2022-01-15,130
2022-01-16,112
2022-01-17,125
2022-01-18,104
2022-01-19,123


#### Thoughts

Take notes on your findings below!

When we group by the date visited and the referral page, we see that the tutorial page has retained consistant performance over this past year, but the blog page has suffered a major drop in traffic. This opens the door to multiple questions:

- Why hasn't the tutorial page grown over time? Is this an area we need to expand?
- What happened to the blog page in July? Could this be an engineering bug? 
- How did this effect the subsequent pages?  

When we perform the same exercise on our current page urls, we see a similar pattern on our registration pages. It would also appear that the majority of registration traffic was driven by our blog page. Depending on the percentage of users that purchase after registration this could be a serious cause for concern!


### Which onboarding flow should the product team focus on?

When we first began our analysis in aggregate, the data showed it did not matter which onboarding path we chose to optimze as all four had roughly the same traffic. Upon further investigation it would seem that the blog page and the registration page have seen large drop in traffic. Lets see if the drop in blog page views is the cause of the drop in registration page views.

#### Instructions

Now you try! Graph the user_flows over time and record your findings!

In [60]:
# Add code below!
# TODO
#page_views_weekly %>%
#	group_by(week, user_flow) %>%
#	summarise(page_visits = n()) %>%
#	ungroup()

user_page_view_history %>%
	mutate(week = floor_date(date_visited, "week")) %>%
	count(week, user_flow)

week,user_flow,n
<date>,<chr>,<int>
2022-01-09,blogToCourse,151
2022-01-09,blogToSignup,223
2022-01-09,tutorialToCourse,167
2022-01-09,tutorialToSignup,158
2022-01-16,blogToCourse,195
2022-01-16,blogToSignup,265
2022-01-16,tutorialToCourse,208
2022-01-16,tutorialToSignup,195
2022-01-23,blogToCourse,199
2022-01-23,blogToSignup,237


#### Instructions

Now you try! Calculate the average page views before and after the first of August!

In [61]:
# Add code below!
page_views_weekly %>%
	group_by(week, user_flow) %>%
	summarise(page_visits = n()) %>%
	ungroup() %>%
	filter(week >= "2022-08-01",
          user_flow == "blogToSignup") %>%
	mutate(average = mean(page_visits))

[1m[22m`summarise()` has grouped output by 'week'. You can override using the
`.groups` argument.


week,user_flow,page_visits,average
<date>,<chr>,<int>,<dbl>
2022-08-07,blogToSignup,142,133.1905
2022-08-14,blogToSignup,142,133.1905
2022-08-21,blogToSignup,112,133.1905
2022-08-28,blogToSignup,138,133.1905
2022-09-04,blogToSignup,133,133.1905
2022-09-11,blogToSignup,133,133.1905
2022-09-18,blogToSignup,154,133.1905
2022-09-25,blogToSignup,139,133.1905
2022-10-02,blogToSignup,135,133.1905
2022-10-09,blogToSignup,119,133.1905


#### Thoughts

Take notes on your findings below! 

The blog to sign up flow was the once the most popular onboarding and has seen as 42% drop since July 2022. This should be brought up to the product team.