# R Markdown

This is Jupyter Notebook using an R Kernel. 

Another option is using R Markdown which is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

## Hello World
Lets start with the classic "Hello World" as well as taking a look at the types of variables R has, along with how to properly make variable assignments and view the structure of it.


In [30]:
print("Hello World")

a <- 'a'
b <- 1
c <- '1'
d <- TRUE
e = 'an equal sign works, but is not proper variable assignment'

#use the function 'str()' to explore the structure of a variable.
str(d)

[1] "Hello World"
 logi TRUE


# Vectors
***

Vectors are power data structures in R.  You can have two types:

* Atomic Vector
* List

Atomic Vector can be a vector of characters, logical, integers or numeric.  If they are mixed types, R will coerce the vector to a common data type. (my_third_vector)

In [32]:
my_first_vector <- c('a','b','c','d')
my_second_vector <- c(1,2,3,4)
my_third_vector <- c('six','seven',8,9)
long_vector <- c(my_first_vector, my_second_vector)

str(long_vector)

 chr [1:8] "a" "b" "c" "d" "1" "2" "3" "4"


# Lists

A list is a generic vector containing other objects. Unlike Python, the index for R start at 1.  They are sometimes refered to as recursive lists, as you can have a list of lists.  Since R doesn't have a native dictionary, recursive lists are the closest thing.

In [33]:
my_first_list <- list('cat','dog','bird','snake')
my_second_list <- list(1,2,3,4)

my_third_list <- list(my_first_list, my_second_list)
str(my_third_list)

List of 2
 $ :List of 4
  ..$ : chr "cat"
  ..$ : chr "dog"
  ..$ : chr "bird"
  ..$ : chr "snake"
 $ :List of 4
  ..$ : num 1
  ..$ : num 2
  ..$ : num 3
  ..$ : num 4


In [34]:
my_third_list[[1]][1]
my_third_list[[2]][3]

# List vs Vector

A list stores values for retrival, whereas you can perform numerical functions on a vector.  Take a look below:

In [35]:
#remember: my_second_list <- list(1,2,3,4)
#Running the below returns a Traceback Error
print(my_second_list * 2)

ERROR: Error in my_second_list * 2: non-numeric argument to binary operator


In [36]:
#remember: my_second_vector <- c(1,2,3,4)
print(my_second_vector * 2)

[1] 2 4 6 8


# Matrices
***
Matrices are a necessity if you plan to do any linear algebra type operations on your data.  It is also extrememly fast and efficient for performing numerical analysis.  The equialent in Python would be the NumPy package.

In [47]:
x <- 1:9 

myfirstmatrix <- matrix(x, nrow = 3, byrow = TRUE)
mysecondmatrix <- matrix(x, nrow = 3, byrow = FALSE)
str(myfirstmatrix)

# Matrix notation in R works similar to vectors, but with an additional dimension
myfirstmatrix[3,2]
myfirstmatrix[2,]  #returns the 2nd row
myfirstmatrix[,2]  #returns the 2nd column

print(myfirstmatrix * 2)

#can also name rows/cols
rownames(myfirstmatrix) <- LETTERS[1:3]
colnames(myfirstmatrix) <- letters[1:3]
print(myfirstmatrix)

print(myfirstmatrix["B", "c"])

 int [1:3, 1:3] 1 4 7 2 5 8 3 6 9


     [,1] [,2] [,3]
[1,]    2    4    6
[2,]    8   10   12
[3,]   14   16   18
  a b c
A 1 2 3
B 4 5 6
C 7 8 9
[1] 6


# Solving a linear equation:

## $$x_1 +3x_2 = 7$$

## $$2x_1+4x_2 = 10$$

In [46]:
a <- matrix(c(1,2,3,4),ncol = 2)
b <- c(7,10)

print(a)
print(b)
solve(a,b) #solves for x1 and x2

     [,1] [,2]
[1,]    1    3
[2,]    2    4
[1] ""
[1]  7 10
[1] ""


# Dataframes
***
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. They share many of the properties of matrices and lists. Dataframes are used as the fundamental data structure by most of R's modeling software.

In [39]:
#diamonds is a built in df that ships with R

mynewdataframe <- data.frame(diamonds)

str(mynewdataframe)

head(mynewdataframe)
#tail(mynewdataframe)

'data.frame':	53940 obs. of  10 variables:
 $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
 $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
 $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
 $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
 $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...


carat,cut,color,clarity,depth,table,price,x,y,z
0.23,Ideal,E,SI2,61.5,55,326,3.95,3.98,2.43
0.21,Premium,E,SI1,59.8,61,326,3.89,3.84,2.31
0.23,Good,E,VS1,56.9,65,327,4.05,4.07,2.31
0.29,Premium,I,VS2,62.4,58,334,4.2,4.23,2.63
0.31,Good,J,SI2,63.3,58,335,4.34,4.35,2.75
0.24,Very Good,J,VVS2,62.8,57,336,3.94,3.96,2.48


In [40]:
#take a quick look at the df
summary(mynewdataframe)
dim(mynewdataframe)
nrow(mynewdataframe)
ncol(mynewdataframe)

     carat               cut        color        clarity          depth      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
                                    J: 2808   (Other): 2531                  
     table           price             x                y         
 Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
 Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
 Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
 3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.:

In [41]:
#view rows 3 and 4
print(mynewdataframe[3:4,])

#view columns CUT and COLOR for the head of the df
head(mynewdataframe[,c('cut','color')])

#notice mynewdataframe[ROW,COLUMN] syntax

  carat     cut color clarity depth table price    x    y    z
3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31
4  0.29 Premium     I     VS2  62.4    58   334 4.20 4.23 2.63


cut,color
Ideal,E
Premium,E
Good,E
Premium,I
Good,J
Very Good,J


In [42]:
#subsetting a dataframe
good_top <- head(subset(mynewdataframe, cut == 'Good'))
j_340_tail <- (subset(mynewdataframe, color == 'J' & price < 340 ))

print(good_top)
print(j_340_tail)

   carat  cut color clarity depth table price    x    y    z
3   0.23 Good     E     VS1  56.9    65   327 4.05 4.07 2.31
5   0.31 Good     J     SI2  63.3    58   335 4.34 4.35 2.75
11  0.30 Good     J     SI1  64.0    55   339 4.25 4.28 2.73
18  0.30 Good     J     SI1  63.4    54   351 4.23 4.29 2.70
19  0.30 Good     J     SI1  63.8    56   351 4.23 4.26 2.71
21  0.30 Good     I     SI2  63.3    56   351 4.26 4.30 2.71
   carat       cut color clarity depth table price    x    y    z
5   0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
6   0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
11  0.30      Good     J     SI1  64.0    55   339 4.25 4.28 2.73
