# 1. Ticket Sales Data

### Importing the data

Welcome! This course will give you some additional practice with importing and cleaning data through a series of four case studies. We’re assuming you’ve already completed these four courses:

 - Intro to R
 - Intermediate R
 - Importing Data Into R
 - Cleaning Data in R
 
If not, you’ll want to check those out first.

You'll be importing and cleaning four real datasets that are a little messier than before. Don't worry -- you're up for the challenge!

Your first dataset describes online ticket sales for various events across the country. It's stored as a Comma-Separated Value (CSV) file called sales.csv. Let's jump right in!

Instructions
 - Import sales.csv to the variable sales using the read.csv() function. Set the stringsAsFactors argument to FALSE so that character strings are preserved.

In [1]:
# Import sales.csv: sales
sales <-read.csv(file="sales.csv", stringsAsFactors=FALSE)

### Examining the data

As you know from the Cleaning Data in R course, the first step when preparing to clean data is to inspect it. Let's refresh your memory on some useful functions that can do that:

 - dim() returns the dimensions of an object
 - head() displays the first part of an object
 - names() returns the names associated with an object
The sales data frame you created in the last exercise is pre-loaded in your workspace.

Instructions
 - View the dimensions of your new sales data frame.
 - Inspect the first 6 rows of sales.
 - View the column names of sales.

In [2]:
# View dimensions of sales
dim(sales)

# Inspect first 6 rows of sales
head(sales,6)

# View column names of sales
names(sales)

X,event_id,primary_act_id,secondary_act_id,purch_party_lkup_id,event_name,primary_act_name,secondary_act_name,major_cat_name,minor_cat_name,...,edu_1st_indv_val,edu_2nd_indv_val,adults_in_hh_num,married_ind,child_present_ind,home_owner_ind,occpn_val,occpn_1st_val,occpn_2nd_val,dist_to_ven
1,abcaf1adb99a935fc661,43f0436b905bfa7c2eec,b85143bf51323b72e53c,7dfa56dd7d5956b17587,Xfinity Center Mansfield Premier Parking: Florida Georgia Line,XFINITY Center Mansfield Premier Parking,,MISC,PARKING,...,,,,,,,,,,
2,6c56d7f08c95f2aa453c,1a3e9aecd0617706a794,f53529c5679ea6ca5a48,4f9e6fc637eaf7b736c2,Gorge Camping - dave matthews band - sept 3-7,Gorge Camping,Dave Matthews Band,MISC,CAMPING,...,,,,,,,,,,59.0
3,c7ab4524a121f9d687d2,4b677c3f5bec71eec8d1,b85143bf51323b72e53c,6c2545703bd527a7144d,Dodge Theatre Adams Street Parking - benise,Parking Event,,MISC,PARKING,...,,,,,,,,,,
4,394cb493f893be9b9ed1,b1ccea01ad6ef8522796,b85143bf51323b72e53c,527d6b1eaffc69ddd882,Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow,Gexa Energy Pavilion VIP Parking,,MISC,PARKING,...,,,,,,,,,,
5,55b5f67e618557929f48,91c03a34b562436efa3c,b85143bf51323b72e53c,8bd62c394a35213bdf52,Premier Parking - motley crue,White River Amphitheatre Premier Parking,,MISC,PARKING,...,,,,,,,,,,
6,4f10fd8b9f550352bd56,ac4b847b3fde66f2117e,63814f3d63317f1b56c4,3b3a628f83135acd0676,Fast Lane Access: Journey,Fast Lane Access,Journey,MISC,SPECIAL ENTRY (UPSELL),...,,,,,,,,,,


### Summarizing the data

Luckily, the rows and columns appear to be arranged in a meaningful way: each row represents an observation and each column a variable, or piece of information about that observation.

In R, there are a great many tools at your disposal to help get a feel for your data. Besides the three you used in the previous exercise, the functions str() and summary() can be very helpful.

The dplyr package, introduced in Cleaning Data in R, offers the glimpse() function, which can also be used for this purpose. The package is already installed on DataCamp; you just need to load it.

Instructions
  - Look at the structure of sales.
  - View a summary of your data.
  - Load the dplyr package.
  - Use glimpse() to look at your data.

In [3]:
# Look at structure of sales
str(sales)

# View a summary of sales
summary(sales)

# Load dplyr
library(dplyr)

# Get a glimpse of sales
glimpse(sales)

'data.frame':	5000 obs. of  46 variables:
 $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
 $ event_id              : chr  "abcaf1adb99a935fc661" "6c56d7f08c95f2aa453c" "c7ab4524a121f9d687d2" "394cb493f893be9b9ed1" ...
 $ primary_act_id        : chr  "43f0436b905bfa7c2eec" "1a3e9aecd0617706a794" "4b677c3f5bec71eec8d1" "b1ccea01ad6ef8522796" ...
 $ secondary_act_id      : chr  "b85143bf51323b72e53c" "f53529c5679ea6ca5a48" "b85143bf51323b72e53c" "b85143bf51323b72e53c" ...
 $ purch_party_lkup_id   : chr  "7dfa56dd7d5956b17587" "4f9e6fc637eaf7b736c2" "6c2545703bd527a7144d" "527d6b1eaffc69ddd882" ...
 $ event_name            : chr  "Xfinity Center Mansfield Premier Parking: Florida Georgia Line" "Gorge Camping - dave matthews band - sept 3-7" "Dodge Theatre Adams Street Parking - benise" "Gexa Energy Pavilion Vip Parking : kid rock with sheryl crow" ...
 $ primary_act_name      : chr  "XFINITY Center Mansfield Premier Parking" "Gorge Camping" "Parking Event" "Gexa Energy Pavilion VI

       X          event_id         primary_act_id     secondary_act_id  
 Min.   :   1   Length:5000        Length:5000        Length:5000       
 1st Qu.:1251   Class :character   Class :character   Class :character  
 Median :2500   Mode  :character   Mode  :character   Mode  :character  
 Mean   :2500                                                           
 3rd Qu.:3750                                                           
 Max.   :5000                                                           
                                                                        
 purch_party_lkup_id  event_name        primary_act_name   secondary_act_name
 Length:5000         Length:5000        Length:5000        Length:5000       
 Class :character    Class :character   Class :character   Class :character  
 Mode  :character    Mode  :character   Mode  :character   Mode  :character  
                                                                             
                          

"package 'dplyr' was built under R version 3.3.3"
Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union



Observations: 5,000
Variables: 46
$ X                      <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ event_id               <chr> "abcaf1adb99a935fc661", "6c56d7f08c95f2aa453...
$ primary_act_id         <chr> "43f0436b905bfa7c2eec", "1a3e9aecd0617706a79...
$ secondary_act_id       <chr> "b85143bf51323b72e53c", "f53529c5679ea6ca5a4...
$ purch_party_lkup_id    <chr> "7dfa56dd7d5956b17587", "4f9e6fc637eaf7b736c...
$ event_name             <chr> "Xfinity Center Mansfield Premier Parking: F...
$ primary_act_name       <chr> "XFINITY Center Mansfield Premier Parking", ...
$ secondary_act_name     <chr> "NULL", "Dave Matthews Band", "NULL", "NULL"...
$ major_cat_name         <chr> "MISC", "MISC", "MISC", "MISC", "MISC", "MIS...
$ minor_cat_name         <chr> "PARKING", "CAMPING", "PARKING", "PARKING", ...
$ la_event_type_cat      <chr> "PARKING", "INVALID", "PARKING", "PARKING", ...
$ event_disp_name        <chr> "Xfinity Center Mansfield Premier Parking: F...
$ ticket_text     

### Removing redundant info

You may have noticed that the first column of data is just a duplication of the row numbers. Not very useful. Go ahead and delete that column.

Remember that nrow() and ncol() return the number of rows and columns in a data frame, respectively.

Also, recall that you can use square brackets to subset a data frame as follows:

`my_df[1:5, ] # First 5 rows of my_df
my_df[, 4]   # Fourth column of my_df`
Alternatively, you can remove rows and columns using negative indices. For example:

`my_df[-(1:5), ] # Omit first 5 rows of my_df
my_df[, -4]     # Omit fourth column of my_df`

Instructions

 - Take a subset of sales to omit the first column. Assign the result to sales2.

In [5]:
## sales is available in your workspace
#!file.exists(sales)

# Remove the first column of sales: sales2
sales2<-sales[,-1]

### Information not worth keeping

Many of the columns have information that's of no use to us. For example, the first four columns contain internal codes representing particular events. The last fifteen columns also aren't worth keeping; there are too many missing values to make them worthwhile.

An easy way to get rid of unnecessary columns is to create a vector containing the column indices you want to keep, then subset the data based on that vector using single bracket subsetting.

Instructions
 - Create a vector called keep that contains the indices of the columns you want to save. Remember: you want to keep everything besides the first 4 and last 15 columns of sales2.
 - Subset the columns of sales2 using your vector and assign the result to sales3.

In [6]:
## sales2 is available in your workspace

# Define a vector of column indices: keep
keep =c(5:30)

# Subset sales2 using keep: sales3
sales3 = sales2[, keep]

### Separating columns

Some of the columns in your data frame include multiple pieces of information that should be in separate columns. In this exercise, you will separate such a column into two: one for date and one for time. You will use the separate() function from the tidyr package (already installed for you).

Take a look at the event_date_time column by typing head(sales3$event_date_time) in the console. You'll notice that the date and time are separated by a space. Therefore, you'll use sep = " " as an argument to separate().

Instructions
 - Load the tidyr package.
 - Split the event_date_time column of sales3 into "event_dt" and "event_time". Assign the result to sales4.
 - Split the sales_ord_create_dttm column of sales4 into "ord_create_dt" and "ord_create_time". Assign the result to sales5.

In [7]:
## sales3 is pre-loaded in your workspace

# Load tidyr
library(tidyr)

# Split event_date_time: sales4
sales4 <- separate(sales3, event_date_time, c("event_dt", "event_time"), sep = " ")

# Split sales_ord_create_dttm: sales5
sales5 <- separate(sales4, sales_ord_create_dttm, c("ord_create_dt","ord_create_time"), sep = " ")

"Too few values at 4 locations: 2516, 3863, 4082, 4183"

Uh oh! Did you see the warning message that just popped up in the console? No need to panic (yet). You'll sort it out in the next exercise.

### Dealing with warnings

Looks like that second call to separate() threw a warning. Not to worry; warnings aren't as bad as error messages. It's not saying that the command didn't execute; it's just a heads-up that something unusual happened.

The warning says Too few values at 4 locations. You may be able to guess already what the issue is, but it's still good to take a look.

The locations (i.e. rows) given in the warning are 2516, 3863, 4082, and 4183. Have a look at the contents of the sales_ord_create_dttm column in those rows.

Instructions
 - Assign a vector issues that contains the indices of the four troublesome rows: 2516, 3863, 4082, and 4183.
 - Subset sales3\$sales_ord_create_dttm to look at these observations. Remember to use sales3 (not sales4), since you want the data frame from before you separated columns!
 - For comparison, print element 2517 of sales3\$sales_ord_create_dttm, which did not cause a warning.

In [8]:
# Define an issues vector
issues = c(2516, 3863, 4082, 4183)

# Print values of sales_ord_create_dttm at these indices
print(sales3$sales_ord_create_dttm[issues])

# Print a well-behaved value of sales_ord_create_dttm
print(sales3$sales_ord_create_dttm[2517])

[1] "NULL" "NULL" "NULL" "NULL"
[1] "2013-08-04 23:07:19"


Whew! The warning was just because of four missing values. You'll ignore them for now, but if your analysis depended on complete date/time information, you would probably need to delete those rows.

In [9]:
print(sales2$sales_ord_create_dttm[issues])

[1] "NULL" "NULL" "NULL" "NULL"


Identifying dates
100xp
Some of the columns in your dataset contain dates of different events. Right now, they are stored as character strings. That's fine if all you want to do is look up the date associated with an event, but if you want to do any comparisons or math with the dates, it's MUCH easier to store them as Date objects.

Luckily, all of the date columns in this dataset have the substring "dt" in their name, so you can use the str_detect() function of the stringr package to find the date columns. Then you can coerce them to Date objects using a function from the lubridate package.

You'll use lapply() to apply the appropriate lubridate function to all of the columns that contain dates. Recall the following syntax for lapply() applied to some data frame columns of interest:

`lapply(my_data_frame[, cols], function_name)`
Also recall that function names in lubridate combine the letters y, m, d, h, m, and s depending on the format of the date/time string being read in.

Instructions
 - Load the stringr package.
 - Use str_detect() to find values in the names() of sales5 containing the string "dt". Assign the resulting logical vector to the variable date_cols.
 - Load the lubridate package.
 - Coerce the date_cols into Date objects using lapply() and the appropriate function from lubridate. Conveniently, all date columns in sales5 are in year-month-day format, so you can use the ymd() function from lubridate.

In [10]:
## sales5 is pre-loaded

# Load stringr
library(stringr)

# Find columns of sales5 containing "dt": date_cols
date_cols = names(sales5)[str_detect(names(sales5), "dt")]

# Load lubridate
library(lubridate)

# Coerce date columns into Date objects
sales5[, date_cols] <- lapply(sales5[, date_cols], ymd)

"package 'lubridate' was built under R version 3.3.3"
Attaching package: 'lubridate'

The following object is masked from 'package:base':

    date

" 424 failed to parse."

Not again! Your code looks great, but there were a few more warnings… Sigh. You'll sort it out in the next exercise.

## More warnings!

As you saw, some of the calls to ymd() caused a failure to parse warning. That's probably because of more missing data, but again, it's good to check to be sure.

The first two lines of code (provided for you here) create a list of logical vectors called missing. Each vector in the list indicates the presence (or absence) of missing values in the corresponding column of sales5. See if the number of missing values in each column is the same as the number of rows that failed to parse in the previous exercise.

As a reminder, here are the warning messages:

Warning message:  2892 failed to parse.
Warning message:  101 failed to parse.
Warning message:  4 failed to parse.
Warning message:  424 failed to parse.
Instructions
    
 - Run the first line as-is to regenerate the date_cols vector.
 - Run the second line as-is to generate a list of logical vectors representing missing values in the date columns of sales5.
 - Use sapply() to create a numerical vector containing the number of NA values in each date column. Call this vector num_missing.
 - Print out the num_missing vector.

In [11]:
## stringr is loaded

# Find date columns (don't change)
date_cols <- str_detect(names(sales5), "dt")

# Create logical vectors indicating missing values (don't change)
missing <- lapply(sales5[, date_cols], is.na)

# Create a numerical vector that counts missing values: num_missing
num_missing = sapply(missing, sum)

# Print num_missing
num_missing

Yep, it was missing data again. Ah, the joys of working with real-life datasets!

### Combining columns

Sure enough, the number of NAs in each column match the numbers from the warning messages, so missing data is the culprit. How to proceed depends on your desired analysis. If you really need complete sets of date/time information, you might delete the rows or columns containing NAs.

As your last step, you'll use the tidyr function unite() to combine the venue_city and venue_state columns into one column with the two values separated by a comma and a space. For example, "PORTLAND" "MAINE" should become "PORTLAND, MAINE".

Instructions
 - Combine the venue_city and venue_state columns of sales5 into a new column called venue_city_state, containing the city and state names separated by a comma and a space. Call the resulting data frame sales6.
 - View the first 6 rows of sales6.

In [None]:
## tidyr is loaded

# Combine the venue_city and venue_state columns
sales6 = unite(sales5, venue_city_state ,venue_city, venue_state,  sep =", ")

# View the head of sales6
head(sales6,6)