## Welcome to Week 1!
Last week, you had a crash course in data analysis and learned statistical transformations.
This week, you'll be introduced to applying these principles on a large set of data using R.

Before we start the fun stuff, let's review some vocabulary.
I know, I know - _groan_ - but this stuff really comes in handy when you want to look things up or ask questions.

*Data* are values that describes something.

*Information* is data that is useful - data is not meaningful. Information is.

*Attributes* are the variables of your dataset. If you have a dataset that describes a windows, the attributes of the dataset could be, for example,  the dimensions of the window or the materials it is made from. Think of attributes as the headers of your dataset. In a dataframe, the values that describes the attributes are the data underneath the headers.

*Features* are attributes that are useful. Attributes are not meaningful. Features are. The moment that an attribute tells us something about the dataset is the moment that it becomes a feature. Features are the useful bits of information that are fed into machine learning algorithms. Attributes can't be fed into machine learning algorithms because they don't tell the algorithm anything. It's just a jumble of numbers.

*Feature Engineering* is the extraction of meaningful information out of attributes in our dataset. Think of it as turning attributes into features. An example of feature engineering is determining the day of the week given a date. We are creating a new feature (the day of the week) given an attribute (the date). Another example is extracting the surname out of a full name. This is considered feature engineering because surnames can identify people who are part of the same family in a dataset.

With our vocabulary refreshed, let's dive right in with an investigation of how R works.
Write a number!

In [None]:
42

Write a mathematical expression!

In [None]:
15+27

This is the most basic thing you can do in R - write an expression that R will print the value of.
So far we know this works with numbers. Numbers are stored as numeric data types in R. 
Note that you can also use the colon operator to make a consecutive collection of numbers.
The colon operator works by placing a 'from' and 'to' number on either sides of a colon.
For example, 3:8 will print the values _from_ three _to_ eight.

In [None]:
3:8

There are more than just numbers in R, though! Try writing 'Hello World' with single and double quotation marks.
This is a character data type. Sometimes it is called a string.

In [None]:
"Hello, world!"

In [None]:
'Hello, world!'

They simply get printed out on your page. To write things, you may also use the print() function. This is useful for documentation and debugging.

In [None]:
print("Aye maties!")

Character values can be modified with a variety of functions, but the ones I want to go over are substr() and paste().
substr() stands for substring, and it does what it says on the tin - given a start and end index, it gets a small string from a larger string.

In [None]:
substr("Hello world", 4,8)

paste() simply adds more things to the character string. You can think of it as pasting more characters in.
Let's paste 'world!' to 'Hello' to make 'Hello world!'

In [None]:
paste("Hello", "world!")

What's important to note is that using both single and double quotes is effectively the same.
You can tell when something is a character because it is surrounded by quotations.
In other words, both single and double quotation marks make character values. You can test this with the class() function. By putting a '==' between the two expressions, you are asking whether or not the two expressions are equal.

*NOTE: don't use '='! This is a reserved operator for assignment.*

In [None]:
class('Hello, world')
class("Hello World")
class('Hello, world')==class("Hello World")

This TRUE or FALSE value is called a logical. They are the results of a condition - in this case, the condition was that both the expressions had the same class.
Logicals are very useful for writing conditional code, i.e. code that gets excecuted only when a set of conditions are met.
The main operators for logicals are '==', '!=', '>', "<', ">=', '<='. The first asks if two values are the same, as shown above. The second asks if they are NOT the same, signified by the exclamation point, meaning not.
The last four are numerical comparisons, asking if something is greater than or lesser than something else.

In [None]:
"Poly"=="Martha"
"Poly"=="Poly"

5==5
5!=5

6<4

71>5
71>71

3>=3
5<=0

If you want two conditions to be met for something to count as true, you must use && to say that 'this AND this must be true for the condition to be true'. If you only need one of two conditions to be true, you must you || to say 'this OR this must be true for the condition to be true'.

In [None]:
5<7 && 8<=9 #Both conditions are true
5>7 && 8<=9 #First condition is false
5>7 || 8<=9 #First condition is false
5>7 || 8<=4 #Both conditions are false

(5>7 && 'Polly'=="Polly") || 800!=5 #First condition in brackets evaluates to false, 
                                    #however the second condition is true, so the OR returns a TRUE value.

It gets tedious to write things out all the time, though.
You can store these values for use with variables. In R, the primary (and most stylistic) way to assign values to a variable is to use the '<-' arrow. Once you have this variable, you can use it in place of the value.

In [None]:
x <- 42
x

Variable names should be short and descriptive. 

You can of course change your values, even to an entirely different data type!

In [None]:
x <- 'Forty-two'
x

Now that you know how to use your variables, let's install a package.
Packages contain useful functions to make data manipulation easy.
The two we're going to be working with are dyplr, for manipulating dataframes (you'll learn what this is later on) and RCurl, for dealing with online requests. We'll use this to get our data.


install.packages() accepts the name of the package you want to install and a repository to get it from. library() load in the packages you want.

In [None]:
install.packages("dpylr", repos = "https://cran.rstudio.com/", force=TRUE)
library("dplyr")
install.packages("RCurl", repos = "https://cran.rstudio.com/", force=TRUE)
library("RCurl")

In R, data can be also stored as a vector. Vectors with numeric, logical, and character datatypes are stored as 'atomic' vectors. 


*Bonus: Everything in R is technically a vector. Our variable 'x' is being stored as an atomic vector of length 1 containing the numeric value 42.*

To make a vector in R, use the c() function, meaning 'collective'.

In [None]:
c(1,2,3)

Vectors can be stored just as any other variable.

In [None]:
myVector<-c(4:9)
myVector

Each number being stored is an element of the vector. You can access each element by putting their index (place in the vector) in square brakets after the variable name.

In [None]:
myVector[1]

Accessing multiple consecutive elements is as simple as using the colon function. However, to access a few non-consecutive elements, you should pass in a vector of the indices you want to access.

In [None]:
myVector[2:5]
myVector[c(3,5)]

Be careful with what you put in the brakets. If you put a number larger than the length of the vector, you'll get a NA value.

In [None]:
myVector[7]

To avoid this, you can check the length of your vector (and other datatypes) with the length() function.

In [None]:
length(myVector)

Therefore, a safer way to access the last element of your vector is to use the length() function as an index.

In [None]:
myVector[length(myVector)]

Negative numbers are a little strange. The negative sign tells R to ignore the element at that list.

Note that while this prints the vector with a missing value, this doesn't actually change the vector - in order to actually change the value of your variables, you need to use the assignment operator '<-'.

In [None]:
myVector[-1]

You can pass in a condition as an index as well. This will cause it to only return values that fit that condition (As well as NA values).

In [None]:
myVector[myVector[]<6]

By assigning a value to an out-of-bounds index, you can alter the length and contents of your vector.

In [None]:
myVector[7]<-10
myVector
myVector[10]<-42
myVector

_Notice that the indexes are assigned NA if you did not give them a value._

You can modify multiple elements in your vector.

In [None]:
myVector[8:10]<-c(11,12,13)
myVector

Splicing in values is easy. By using the c() function, you can simply surround the new values with the values of the old vector using indices. The c() function is very flexible and accepts any values separated by commas.

In [None]:
myVector<-c(myVector[1:5], c(42,42), 42, myVector[6:10])
myVector

One thing that vectors do not like is mixing data types. Look what happens when you try to store a character in a numeric vector.

In [None]:
myVector[length(myVector)]<-'Kitty'
myVector

Every one of the values of myVector are now vectors! This is no good - working with Big Data means working with different data types from the same dataset.
Luckily, R has a data structure for this - lists!

In [None]:
myList<-list(c(1,2,3), "Kitty")
myList

Look how the information is laid out. The different data types are stored as lists in different indices.
The way lists store information is important because accessing elements with them is different than with vectors.
For example, writing a single square bracket references the _list_ at that index. The list contains some vectors. 
To reference the vector at that list, you have you use double square brackets.

In [None]:
myList[1]
class(myList[1])
myList[[1]]
class(myList[[1]])

To access individual elements in the vectors, you have to use get the vector and use two indices.

In [None]:
myList[[1]][2]

The reasoning behind this is much more obvious as the lists get more and more complex.

In [None]:
myComplexList<-list(myList, 5:11, c("Lazy", "Sleepy", "Fluffy"), c(TRUE, FALSE, NA), function(x) is.na(x))
    #yes, you CAN store functions!
myComplexList

How could I reference 'Kitty' from here?

We can use a single square bracket to get back myList, which is stored in the first index.

In [None]:
myComplexList[1]
length(myComplexList[1])

This is a list of length one, containing our list from before.
To actually get our list, we need to use two brackets.

In [None]:
myComplexList[[1]]

This is myList!
From there, we can reference the second index of that list with an additional square bracket.

In [None]:
myComplexList[[1]][2]

This returns a list. If we want the character value 'Kitty', we need to use double brackets.

In [None]:
myComplexList[[1]][[2]]

Compare this to what we did with just myList.

In [None]:
myList[[2]]

And that's that! A lot of information is stored in the form of lists, so make sure you know how to use them.
For the next section, let's make myList a little more complex.

In [None]:
myList[[3]]<-c("Cute", "Hungry", "Small")
myList[[4]]<-seq(6, 18, 3)
myList

A very useful function for lists is apply(). This comes in a variety of different flavours, each with their own special uses. However, the premise is the same - pass in a list and a function that you want to apply to that list.

To learn more about the apply functions, type ?apply in your console to read the documentation.
For now, let's go ahead and try out a simple lapply function.

In [None]:
lapply(myList,function(x)
                    is.numeric(x))

The above code looks at all the vectors of that list and applies the is.numeric() function on them. x in our case is the element that is being passed in - you can see that the function accepts the element and applies is.numeric() on x.

You can also use applys to extract things. Remember the double bracket, [[? Passing it in as a function along with an index will return the element at that index.

In [None]:
lapply(myList, '[[',1)

If you don't want your apply to return a list, you can either use sapply to return a vector or use the unlist() function.
Remember, however, that vectors cannot hold different variable types. This means your numerics will be coerced into characters.
Try putting this in a variable.

In [None]:
sapply(myList, '[[',1)
unlist(lapply(myList, '[[',1))
vectorFromList<-unlist(lapply(myList, '[[',1))

However, you can coerce the variables too! as.DATATYPE() will coerce a value into whatever data type you want, provided it can make the transition. For example, you can coerce character numbers into numerics, but you can't coerce words into numerics.

In [None]:
as.numeric(vectorFromList[4])
as.numeric(vectorFromList[3])

Before we move on to two-dimensional data, let's try out a more complicated function with our lapply.

In [None]:
print("Before apply:")
myList
print("After apply:")
lapply(myList, 
            function(x) {
                            if(is.numeric(x)) # this is an if statement. 
                                              #The code within its {brackets} only gets evaluated if the condition is true.
                               {
                                   return(x^2 + 1) # if x is a numeric, square it and add one.
                               }
                            else # this else statement tells R what to do if the condition is false.
                                {
                                    return(paste(x,"Kat")) # if x is not a numeric, add the word 'Kat' to it.
                                }
                        }
        )    

## Two-dimensional data
That concludes the one-dimensional data structures!
But where are the tables? The charts? Doesn't data usually come in those?

It does indeed: now we're going to go through two-dimensional data.
To start off, let's talk about matrices. You can make a matrix with the matrix() function - simply give it the number of rows and coloumns to start out with! Use ?matrix if you get stuck.

In [None]:
myMatrix<-matrix(nrow=3,ncol=3)
myMatrix

Thus a matrix is born. Right now it's empty, filled with NA values. Let's see what we can do with it.
If we ever need to know the dimensions of our matrix, we can use the dim() function.

In [None]:
dim(myMatrix)

To access a spot in the matrix, we have to use square brackets in the same way we did for vectors.
However, matrices need you to specify both row and column.
You can do this by seperating the row index and the column index by a comma. 

Remember that the row goes first!
If you do not provide a value for the row, R interprets that as a reference to the ENTIRE row.
Likewise for columns.

In [None]:
myMatrix[1,2]<-1
myMatrix[2,]<-2
myMatrix[3,]<-c(3,4,5)
myMatrix

Notice that the entire second row is filled with twos and the third row accepted all three numbers. This is because we did not specify a column.
Getting an index is otherwise very similar to vectors. Assigning, too. The only thing to keep in mind is that you _must_ specify the column and row, even if you want to refer to the entire row or column. There _must_ be a comma in your index.

In [None]:
myMatrix<-myMatrix[1:2,]
myMatrix

Now, myMatrix only has its first and second rows (but all its original coloumns). We can do operations to this entire matrix in a similar fashion to what we did with vectors.

In [None]:
myMatrix<-myMatrix+1
myMatrix

Let's use the colSums() function to find out the sum of the coloumns.

In [None]:
colSums(myMatrix)

Uh oh! We don't want NA values in our sum.
To fix this problem, we should replace the NAs with zeros.
Remember that we can pass in conditions into the indices, so that only elements fulfilling those conditions will be returned.
By asking for the NA indices, we can replace any pesky NAs with zeros!

In [None]:
myMatrix[is.na(myMatrix)]<-0
myMatrix
colSums(myMatrix)

Hooray; it works! But wouldn't it be nice to have the sums as part of our matrix?

We can bind the values row-wise using rbind().

In [None]:
myMatrix<-rbind(myMatrix,colSums(myMatrix))
myMatrix

Similarly, if we wanted to add a coloumn, we can use cbind().
Remember to keep your dimensions in mind when adding rows and coloumns!

In [None]:
myMatrix<-cbind(myMatrix, c(1,1,1))
myMatrix

We can also do some transformations to our matrix, such as transposing it.
Transposing a matrix can be done with the t() function.
Save this matrix in a separate variable.

In [None]:
myTransposedMatrix<-t(myMatrix)
myTransposedMatrix

Let's suppose we wanted to multiply our original matrix with our transposed matrix!
The %*% operator does just that.

In [None]:
myTransposedMatrix%*%myMatrix

This is fun and all, but there is a problem.
Matrices, like their one-dimensional vector cousins, cannot support multiple data types.

In [None]:
myMatrix[1,1]<-'Foo'
myMatrix
class(myMatrix[1,2])

Like a vector, it coerces our numerics to characters.
What we need is a two-dimensional list... a dataframe!
### Data Frames
Data frames are the most useful data structures avaiable to you. Since they can store different types of data in a way that is more easily accesible than a list, they are very commonly found in datasets.
You can coerce your matrix into a data frame by using the as.data.frame() function.

In [None]:
myFrame<-as.data.frame(myMatrix)
myFrame

Let's pretend we're making a menu.
The names V1, V2, V3, and V4 aren't very useful. They don't tell us anything!
Luckily we have a solution to this. Data frames have a names feature, which you can access and modify using the names() function. names() returns a vector of the names of the data frame. Read more about names() using the ? function for the documentation.

In [None]:
names(myFrame)
names(myFrame)<-c("Meal Name","Fish", "Chips", "Crisps")
myFrame

Not only do good names make your dataset easier to navigate, but they give you a new way to index your data!
The dollar sign operator '$' accepts a name as an index!
Let's give our meals some names.

In [None]:
myFrame$"Meal Name"<-c("Foo", "Baa", "Combo")
myFrame

Great! Not the only thing missing from our menu is a price.
Adding coloumns is similar with matrices. Remember that you need to state both the row and the coloumn you are modifying.

In [None]:
myFrame[,5]<-c(4.55, 7.22, 10.33)
names(myFrame)[5]<-"Price"
myFrame

Excellent!
Before moving on, let's use the summary() function to get some insight on our menu.

In [None]:
summary(myFrame)

## Rain Data
Up until now, you've been learning to use the different data structures in R.
Let's use your knowledge to do an investigation of a real life dataset!

We know our dataset is about rainfall in the city of Toronto. Let's investigate April showers and see if we can find any interesting trends.

First, get the datasets with RCurl. Pass the url of the raw github pages for the datasets ending with 04 (April) into the url() function, and pass all that into the read.csv() function. Remember to assign these to a variable!

In [1]:
rainData <- read.csv(url("https://raw.githubusercontent.com/bigdatachallenge/bdc_workshops/master/2017_rainfall_data/rainfall201704.csv"))
sitesData <- read.csv(url("https://raw.githubusercontent.com/bigdatachallenge/bdc_workshops/master/2017_rainfall_data/sites201704.csv"))

## The Data Science Process
Whenever we're working with a dataset we usually go through five stages:

1. Data Inspection
2. Normalization and cleaning
3. Gathering insights (Feature Engineering)
4. Generate the models
5. Storytelling

#### Data Inspection
Every good data science project begins with data inspection. When we inspect the data we want to look at the beast and see what's coming. The point of inspecting the data is to gather context on what we're looking at so we can begin to ask ourselves questions to answer!
Some things we can glimpse into are the means of the columns, the unique values of categorical variables, counts, max, mins, and modes. This should spark some questions if you see an abnormally high number in a coloumn.

To take a 'glimpse' of the data, we can use the function head(). This will give us the first six rows of the dataset.


In [2]:
head(rainData)
head(sitesData)

id,name,date,rainfall
7677,RG_001,2017-04-01T00:00:00,0
7677,RG_001,2017-04-01T00:05:00,0
7677,RG_001,2017-04-01T00:10:00,0
7677,RG_001,2017-04-01T00:15:00,0
7677,RG_001,2017-04-01T00:20:00,0
7677,RG_001,2017-04-01T00:25:00,0


id,name,longitude,latitude
7677,RG_001,-79.47811,43.64768
7678,RG_002,-79.44362,43.6512
7679,RG_003,-79.40509,43.65662
7680,RG_004,-79.40283,43.67834
7681,RG_006,-79.3751,43.66127
7682,RG_007,-79.33114,43.67672


'Date' and 'rainfall' are pretty self explatory, as are 'longitude' and 'latitude'. But what do 'name' and 'id' mean?

Hmm... it would be nice to have some more information about the properties of the columns.
Thankfully, we can use summary().

In [3]:
summary(rainData)
summary(sitesData)

       id            name                         date           rainfall      
 Min.   :7674   RG_001 :  8353   2017-04-01T00:00:00:    45   Min.   :0.00000  
 1st Qu.:7685   RG_002 :  8353   2017-04-01T00:05:00:    45   1st Qu.:0.00000  
 Median :7697   RG_003 :  8353   2017-04-01T00:10:00:    45   Median :0.00000  
 Mean   :7717   RG_004 :  8353   2017-04-01T00:15:00:    45   Mean   :0.01158  
 3rd Qu.:7708   RG_006 :  8353   2017-04-01T00:20:00:    45   3rd Qu.:0.00000  
 Max.   :8049   RG_012 :  8353   2017-04-01T00:25:00:    45   Max.   :5.84200  
                (Other):325765   (Other)            :375613                    

       id            name      longitude         latitude    
 Min.   :7674   RG_001 : 1   Min.   :-79.59   Min.   :43.61  
 1st Qu.:7685   RG_002 : 1   1st Qu.:-79.51   1st Qu.:43.68  
 Median :7696   RG_003 : 1   Median :-79.43   Median :43.72  
 Mean   :7717   RG_004 : 1   Mean   :-79.42   Mean   :43.72  
 3rd Qu.:7708   RG_006 : 1   3rd Qu.:-79.33   3rd Qu.:43.76  
 Max.   :8049   RG_007 : 1   Max.   :-79.15   Max.   :43.82  
                (Other):40                                   

We can see that the IDs range from 7674 to 8049. Could these possibly be related to longitude and latitude?
To find out, we can use another function - unique(). unique() will give us a vector filled with all the unique values of that column.
By comparing the lengths of the unique values of two columns, we can guess that they correspond with each other.

In [4]:
length(unique(sitesData$name))==length(unique(sitesData$id))
length(unique(sitesData$name))==length(unique(sitesData$latitude))                                              

With this, we can guess that 'name' corresponds with ID, which in turn corresponds with location data.

#### Normalization and Cleaning
Once we understand our data, we'll have to start cleaning our data for modeling. This involves clearing rows with null values (changing the null values to something else), removing useless columns (or columns that are too similiar to others), placing all values into a linear scale, or converting all numbers to a universal unit for your dataset.
Take a look at the datasets above.

Hmm... it looks like 'name' isn't really useful to us, since we could just use 'id'. Let's get rid of that column to make our datasets sleeker.

In [5]:
rainData<-rainData[,-2]
sitesData<-sitesData[,-2]
head(rainData)
head(sitesData)

id,date,rainfall
7677,2017-04-01T00:00:00,0
7677,2017-04-01T00:05:00,0
7677,2017-04-01T00:10:00,0
7677,2017-04-01T00:15:00,0
7677,2017-04-01T00:20:00,0
7677,2017-04-01T00:25:00,0


id,longitude,latitude
7677,-79.47811,43.64768
7678,-79.44362,43.6512
7679,-79.40509,43.65662
7680,-79.40283,43.67834
7681,-79.3751,43.66127
7682,-79.33114,43.67672


Great! Now that we have two sleek datasets, we can merge them together.
What is a common variable across both sets? ID! We can use ID to merge them.
To use merge(), pass in the two datasets you want to merge and the parameter they should be merged by.

In [6]:
total <- merge(rainData, sitesData, by="id")

In [7]:
head(total)

id,date,rainfall,longitude,latitude
7674,2017-04-01T00:00:00,0,-79.46603,43.69735
7674,2017-04-01T00:05:00,0,-79.46603,43.69735
7674,2017-04-01T00:10:00,0,-79.46603,43.69735
7674,2017-04-01T00:15:00,0,-79.46603,43.69735
7674,2017-04-01T00:20:00,0,-79.46603,43.69735
7674,2017-04-01T00:25:00,0,-79.46603,43.69735


Excellent!
Let's take a look at that date column. Since part of data cleaning is making our variables easier to work with we probably want to be able to pick apart the dates by year, day, month, and hour.

In [8]:
class(total$date)

A factor?!
This is a strange data type. It stores values as a vector, but has an additional attribute: levels. These levels contain all of the unique values in the vector.
Unfortunately, this isn't very useful to us, because there are going to be a lot of levels in this dataframe, since each measurement is taken a short time apart for an entire month!
What we want to do is format our dates so that we can access them easily.
For this, we're going to use a function in R called strptime(). Let's go over its documentation.

In [9]:
?strptime()

_Functions to convert between character representations and objects of classes "POSIXlt" and "POSIXct" representing calendar dates and times._

Excellent! This is just what we need.
To use this function, we need to pass in the things we need to format - in our case, the date coloumn - and the format that the date is in.
How do we write the format? Looking at the documentation, what we want is
%Y-%m-%dT%H:%M:%S
Because that is how the data in our dataframe looks like, with a T in the middle.
Let's go ahead and pass that in!

In [10]:
formattedTime<-strptime(total$date, format='%Y-%m-%dT%H:%M:%S')
head(formattedTime)
class(formattedTime[1])

[1] "2017-04-01 00:00:00 EDT" "2017-04-01 00:05:00 EDT"
[3] "2017-04-01 00:10:00 EDT" "2017-04-01 00:15:00 EDT"
[5] "2017-04-01 00:20:00 EDT" "2017-04-01 00:25:00 EDT"

Hooray! Now our dates are easily accessible. Give it a shot - use your '$' operator and ask for the month of the second row!
In order to know what to ask for, use unlist() on an element in the coloumn.

In [11]:
unlist(formattedTime[1])

In [12]:
formattedTime[1]$mon

3? But isn't this dataset April, the _fourth_ month?
This is a feature to note about POSIX data types - they start counting at 0, not one. This is because POSIX are very old formats as far as computers are concerned.
So now that we have formatted our dates to be easily accessible, we can go ahead and replace all those factors with their POSIX counterparts.

In [13]:
total$date<-formattedTime
head(total)

id,date,rainfall,longitude,latitude
7674,2017-04-01 00:00:00,0,-79.46603,43.69735
7674,2017-04-01 00:05:00,0,-79.46603,43.69735
7674,2017-04-01 00:10:00,0,-79.46603,43.69735
7674,2017-04-01 00:15:00,0,-79.46603,43.69735
7674,2017-04-01 00:20:00,0,-79.46603,43.69735
7674,2017-04-01 00:25:00,0,-79.46603,43.69735


#### Gathering Insights

With our data cleaned and normalized we feature engineer to discover more patterns and insights within our dataset. The goal of feature engineering is to provide more variables for our algorithms to play with. By manually identifying these patterns, the computer doesn't have to expend additional effort to discover these patterns algorithmically.
This is the hardest part of data science so don't worry if you need more explanation or aren't good at it. Feel free to message @wooden-plancks on Slack if you want further explanation!
From experience, steps 1-3 should take up 90% of your time. It also just so happens that the first three steps occur organically with the inception of your questions. Once you start looking at the data, you'll want to answer some questions... which will lead you to clean the data to gather insights from it. Then you may have more questions so you'll repeat the cycle.

We can ask our first question of our data: does it rain more closer to Lake Ontario? If our hypothesis is correct, we know that as latitude decreases, precipitation should increase.
To see if there is a correlation between the two, we can use the cor() function.

In [16]:
cor(total$latitude, total$rainfall)

There's basically no correlation! There could be multiple reasons for this, including if the relationship between latitude and precipitation being non-linear or if the range in latitude is just too small.
We can keep on investigating to make sure that there is no significant correlation, but for now let's keep asking questions.
For example, let's ask this question:

In April, did it rain more in the morning?
To figure this out, we should look at the total rainfall for the entire month and group them by hours. We can do this using aggregate(). aggregate() takes in three things - the function you want to apply to your data, the data you want to apply the function to, and the list of values you want to aggregate your data by.

In [17]:
myData<-aggregate(total$rainfall, by=list(total$date$hour),FUN=mean)
myData
cor(myData[1], myData[2])

Group.1,x
0,0.008679317
1,0.004247254
2,0.001572286
3,0.001086207
4,0.0012
5,0.001410983
6,0.003339464
7,0.014837165
8,0.013294636
9,0.01566092


Unnamed: 0,x
Group.1,0.408948


Hey, this is interesting - there's a positive correlation! And not that much of a small one, either.
Can we say, tentitively, that it rains more later on in the day?
Obviously the correlation isn't a very effective way to describe the relationship. There are better, more complicated answers to our question we can get by generating models.

#### Generating the models
Once our dataset has been prepared (and you've answered most of your simple questions), we can finally have some fun and do some machine learning to make predictions! Depending on the type of dataset, you may want to implement a regression algorithm to predict a numeric value (like predicting the speed of a car given these variables) or sort variables into different categories (like passing or failing a test). We'll dive deeper into machine learning in workshop #3.

All of this will build up and give you trends or surprise you with the absence thereof. In any way, you'll be telling a story.

#### Storytelling
The purpose of data science is to tell a story. We crunch numbers to discover new things and to propose new policies. Our world will crumble if decisions were not backed by data. So whatever you do when you're analyzing your dataset(s) always keep this purpose in the back of your mind.

These workshops will help you on your way to telling a story with your data. You have a long way to go, but for now, congratulations on completing Week One! You can now manipulate and clean, and, using what you have learned, no dataset will be too big!