## Week 6:  Loading and Cleaning Data

Today's lecture will cover the basics of loading data into R and cleaning it. In class and online tutorials (like DataCamp) the datasets you're given are impeccably clean and ready to go immediately. In practice this is simply not the case--75% of a data scientist's time is used for getting and cleaning data prior to modeling! So today you will learn the basics of loading common static data formats and cleaning the data using [regular expressions]().

Up until now the instructors have been writing the code for reading and loading data to R for you. From this tutorial forward we will leave the loading to you.

## Common Data Formats

<strong><a href="https://en.wikipedia.org/wiki/Comma-separated_values">CSV</a></strong> stands for Comma-Seperated Values. These are text files, where each line is an observation and the variables are seperated by commas. In a <code>.csv</code> file, the <strong>delimiter</strong> is the comma. There are similar offshoots of this format: tab-seperated values or a general delimiter-seperated values. If a <code>.csv</code> file includes the header, it is entered as the first line in the file. Also notice in the sample that there is a comma included in the <code>name</code> field--how is that possible?

    "Date","Name","Grade"
    "25 May","Bloggs, Fred","C"
    "25 May","Doe, Jane","B"
    "15 July","Bloggs, Fred","A"
    "15 April","Muniz, Alvin ""Hank""","A"
    
<strong><a href="http://json.org/">JSON</a></strong> stands for Javascipt Object Notation. This data storage format is very common in application and website databases because it is lightweight and flexible. The example below holds two records, one for John Smith and another for Jason Freeberg. As you can see, the two records have different numbers of fields--the first does not have a <code>middlename</code> and the second does not have <code>postalCode2</code>. If we attempted to store this in a tabular format, we would have many missing entries and waste space.

    {
      "firstName": "John",
      "lastName": "Smith",
      "isAlive": true,
      "age": 25,
      "address": {
        "streetAddress": "21 2nd Street",
        "city": "New York",
        "state": "NY",
        "postalCode1": 3100
        "postalCode2": 10021
      }
    },
    {
      "firstName": "Jason",
      "middleName": "Robert",
      "lastName": "Freeberg",
      "isAlive": true,
      "age": 21,
      "address": {
        "streetAddress": "6760 Sabado",
        "streetAddress2": "Unit B"
        "city": "Isla Vista",
        "state": "CA",
        "postalCode1": 93117
      }
    }

## Loading Data in R

Since R is a statistical software, and statistics is the analysis of data, it makes sense that R has many functions built in for loading data. R's general <code>read.table()</code> function will read tabular data but requires that you specify the file's delimiter.

In [25]:
# Package for working with JSON
#install.packages("rjson")
library(rjson)

# Let's load last week's data as a refresher
arrests <- read.table("nfl_arrests.csv", header=T, sep=",")
head(arrests)

# Now let's try a JSON collection from the internet
jsonFile <- "http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json"
jsonDoc <- fromJSON(paste(readLines(jsonFile), collapse=""))
str(jsonDoc)
firstRecord


season,week_num,day_of_week,gametime_local,home_team,away_team,home_score,away_score,OT_flag,arrests,division_game
2011,1,Sunday,1:15:00 PM,Arizona,Carolina,28,21,,5,n
2011,4,Sunday,1:05:00 PM,Arizona,New York Giants,27,31,,6,n
2011,7,Sunday,1:05:00 PM,Arizona,Pittsburgh,20,32,,9,n
2011,9,Sunday,2:15:00 PM,Arizona,St. Louis,19,13,OT,6,y
2011,13,Sunday,2:15:00 PM,Arizona,Dallas,19,13,OT,3,n
2011,14,Sunday,2:05:00 PM,Arizona,San Francisco,21,19,,4,y


“incomplete final line found on 'http://api.worldbank.org/country?per_page=10&region=OED&lendingtype=LNX&format=json'”

List of 2
 $ :List of 4
  ..$ page    : num 1
  ..$ pages   : num 4
  ..$ per_page: chr "10"
  ..$ total   : num 31
 $ :List of 10
  ..$ :List of 10
  .. ..$ id         : chr "AUS"
  .. ..$ iso2Code   : chr "AU"
  .. ..$ name       : chr "Australia"
  .. ..$ region     :List of 2
  .. .. ..$ id   : chr "EAS"
  .. .. ..$ value: chr "East Asia & Pacific"
  .. ..$ adminregion:List of 2
  .. .. ..$ id   : chr ""
  .. .. ..$ value: chr ""
  .. ..$ incomeLevel:List of 2
  .. .. ..$ id   : chr "HIC"
  .. .. ..$ value: chr "High income"
  .. ..$ lendingType:List of 2
  .. .. ..$ id   : chr "LNX"
  .. .. ..$ value: chr "Not classified"
  .. ..$ capitalCity: chr "Canberra"
  .. ..$ longitude  : chr "149.129"
  .. ..$ latitude   : chr "-35.282"
  ..$ :List of 10
  .. ..$ id         : chr "AUT"
  .. ..$ iso2Code   : chr "AT"
  .. ..$ name       : chr "Austria"
  .. ..$ region     :List of 2
  .. .. ..$ id   : chr "ECS"
  .. .. ..$ value: chr "Europe & Central Asia"
  .. ..$ adminregion:List of 2
 

In [26]:
# Use read.table() to import "arrests.txt"

#dirtyData <- read.table(<FILL-IN>)
dirtyData <- read.table("arrests.txt", header=T, sep="\t")
head(dirtyData)

# Hint: open the file in a text editor--how is this file different from a .csv?

Unnamed: 0,season,week_num,day_of_week,gametime_local,home_team,away_team,home_score,away_score,OT_flag,arrests,division_game
689,2011,4,Sunday,1:00:00 PM,Philadelphia,San Francisco,23,24,,5,n
745,2013,3,Sundayoops!,8:30:00 PM,Pittsburgh,123Chicago,23,40,,56,n
964,2015,12,Sunday,12:00:00 PM,Tennessee,Oakland,21,24,,0,n
448,2014,5,Sunday,1:00:00 PM,Jacksonville,Pittsburgh,9,17,,3,n
666,2012,15,Sunday,1:25:00 PM,Oakland,Kansas City,0,15,,8,y
21,2013,10,Sundayoops!,2:25:00 PM,Arizona,Houston,27,24,,3,n


## Checking Data

In data analysis, there is a phase that precedes modeling and we call it <strong>exploratory analysis</strong>. This step involves familiarizing yourself with the data by checking the dimensions, understanding the variables, making visualizations and performing other sanity checks. Our last topic of the quarter will be data visualization with ggplot2, so let's cover the other ways we can explore our data.

The list below are some tips that I have found very useful when I'm doing my exploratory analysis.
<ul>
    <li>
    Use <code>dim()</code> to get the dimensions of the dataframe. The functions <code>summary()</code> and <code>glimpse()</code> are great for orienting yourself with a new dataset.
    </li>
    <li>
    Always understand the units and range of your numeric variables. Similarly, understand the naming convention of factor levels within categorical variables.
    </li>
    <li>
    Use <code>max()</code> and <code>min()</code> to check for odd values in numeric variables. And use <code>unique()</code> to check for incorrect levls in categorical variables.
    </li>
    <li>
    <code>table()</code> is great for getting frequency counts of the factors within categorical variables. You can give it two categorical variables to get two-way tables as well.
    </li>
</ul>

In [None]:
# Get the dimensions of dirtyData
<FILL-IN>

# Try using glimpse() (from the dplyr library) on dirtyData
library(plyr)
library(dplyr)
<FILL-IN>

# Get the summary of the arrests column
<FILL-IN>

# Call unique() on the weekday, away team, and division game columns
uniqueWeek <- unique(dirtyDaat)
uniqueTeam <- <FILL-IN>
uniqueDiv <- <FILL-IN>

uniqueWeek <- <FILL-IN>
uniqueTeam <- <FILL-IN>
uniqueDiv <- <FILL-IN>

uniqueWeek
uniqueTeam
uniqueDiv

## Regular Expressions

<em>Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.</em>

Regular expressions define a syntax of characters that can be used to match patterns in strings. Moreover, they can be used by search algorithms to find matches, or even replace the matches with other characters. If you're observant, you noticed that the data above has some odd values--like "oops!" in <code>day_of_week</code> and 123 in <code>away_team</code>. We can use R's functions <code>grep()</code> and <code>grepl()</code> to find matches, and <code>gsub()</code> to replace matches with other characters.

### Regex Tips

<ul>
    <li>
    <strong>Do not make one big regular expression.</strong> Break down the regex into smaller, <em>more manageable</em>, problems. Use comments to help yourself keep track of the expressions.
    </li>
    <li>
    <strong>Use [Regex101.com](https://regex101.com).</strong> This website will check your expression against sample text. The top right breaks down your regex character-by-character, letting you know what it is <em>and is not</em> capturing. In the bottom right, there is a small window with common tokens and expressions.
    </li>
    <li>
    <strong>Test your regex.</strong> Double-check the test worked by 
    </li>
</ul>