# Introduction to R
> R is an interpreted programming language that is majorly used in the scientific domain.
> It is widely used among statisticians and data miners.

## Basics

### Comments

In [2]:
# Booga booga
# Namaskara

### Integers

In [3]:
4L + 2L

### Strings

In [4]:
"Hello World"

### Floats

In [5]:
4.2

### Logical

In [6]:
TRUE

In [7]:
FALSE

---

## Arithmatic

In [10]:
# Additions are done with `+`
40 + 2

In [11]:
# Subtractions with `-`
44 - 2

In [12]:
# Multiplication with `*`
21 * 2 

In [13]:
# Divisions with `/`
84 / 2

In [14]:
# Exponentiation with `^`
7 ^ 8

In [15]:
# Modulo with `%%`
71 %% 5

---

## Variables & Assignment
Use the `<-` operator to assign values to variables.

In [16]:
answer_to_life <- 42

All arithmatic operations are supported on variables

In [17]:
foo <- 21
bar <- 21

answer_to_life <- foo + bar
answer_to_life

---

## Types
The `class` function can be used to identify the underlying type of a variable or literal

In [18]:
class(42L)
class(42.0)
class("Fourty Two")

---

### [ Exercise ]

What is the type of `TRUE` / `FALSE` ?

In [19]:
# Answer here

---

## Vectors
Vectors are one dimensional arrays that can hold **one** type of data. The `c` function allows us to create a vector out of provided values

In [21]:
# Let us say we want to express the amount of Kilometers 
# that we have run in the past 5 days. 
# We can use a vector for this.

kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_run

In [22]:
# We can also use it to track which all days of the week we ran

did_run <- c(TRUE, TRUE, FALSE, TRUE, FALSE)
did_run

---

### [ Exercise ]

For some analysis, let us put together the amount of kilometers that we have run over the past 2 weeks. The last week, we ran: 

Day of week | Kilometers
----------- | ------------
Monday      | 4
Tuesday     | 5.2
Wednesday   | 6
Thursday    | 5.2
Friday      | 5

This is expressed as the vector:

In [23]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)

This week, we ran:

Day of week | Kilometers
----------- | ------------
Monday      | 6
Tuesday     | 6.2
Wednesday   | 6
Thursday    | 7.2
Friday      | 7.5

Populate this in a vector `kms_this_week`

In [24]:
# Answer here

---

When we're looking at a vector, it makes more sense if we can somehow name all the values, right? 

Just looking at `kms_last_week` can become confusing. Let us use the `names` function to give each element the day of the week

In [26]:
names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

In [27]:
# Now, the data can stand independently and is much more clearer
kms_last_week

**Note**: We're assigning a vector when we're giving names. So instead of repeating it multiple times, we can reuse the vector as well 


In [30]:
days_of_week <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(kms_last_week) <- days_of_week
kms_last_week

### Vector arithmatic

Arithmatic can be performed on vectors. Let us calculate the total amount of kilometers that we ran on each day in the past 2 weeks 

In [32]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)
kms_this_week <- c(6.0, 6.2, 6.0, 7.2, 7.5)

total_kms_past_2_weeks <- kms_last_week + kms_this_week
total_kms_past_2_weeks

But how many kilometers did we run totally in each week? Sum each of the vectors using the `sum` function - simple, no?


In [34]:
distance_last_week <- sum(kms_last_week)
distance_this_week <- sum(kms_this_week)

distance_last_week
distance_this_week

---

### [ Exercise ]

- Assign days of the week names to the `total_kms_past_2_weeks` vector using the `names` function.

In [35]:
# Answer here

- What is the total distance we ran across both weeks? Use the `total_kms_past_2_weeks` vector to arrive at your answer

In [36]:
# Answer here

---

### Vector element selection

Consider the `total_kms_past_2_weeks` vector. Let us say that we want to get the distance we ran across both weeks, on wednesday. We know that wednesday is the 3rd day of the week, So we pick up the **3rd** element from the vector like so:

In [37]:
total_kms_past_2_weeks[3]

**Note:** A very important thing to note here is that R begins its indexing from `1` and not `0` unlike most other programming languages.


What if we're interested in a section of results, say our performance as the week comes to an end (wednesday, thursday, friday).

We can provide a vector of required indices like so:

In [38]:
total_kms_past_2_weeks[c(3,4,5)]

But say we have 100 elements in the vector, it would soon become tedious if we want to select a range, say from `50-72` or from `44-62`, right? To solve this problem, R provides us with the range operator - `:` which we takes a starting number and an ending numer and returns a vector containing all those numbers. We can then use this to fetch required elements.

In [39]:
# Let us look at just the range operator
1:5

---

### [ Exercise ]

Use the Range operator (`:`) to fetch the Monday - Wednesday section in the `total_kms_past_2_weeks` vector

In [40]:
# Answer here

---

Also, since we've given names to the vector elements, we can use those names to seek to the elements instead of using indexes.

In [41]:
names(total_kms_past_2_weeks) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
total_kms_past_2_weeks["Wednesday"]
total_kms_past_2_weeks[c("Monday", "Tuesday")]

We can also perform logical operations on vectors. Let us check to see on how many days in the last week, we ran more than 4 kilometers

In [43]:
kms_last_week <- c(4.0, 5.2, 6.0, 5.2, 5.0)

names(kms_last_week) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")

days_more_than_5 <- kms_last_week > 5
days_more_than_5

We can use logical operations in combination with the vector to select only those elements from a vector that match a condition.

Now that `days_more_than_5` contain a list of days where we ran more than 5 kilometers, let us select _just_ those items into another vector

In [44]:
kms_last_week[days_more_than_5]

## Factors
Usually, most data is catagorical. Meaning that data can usually be put into catagories.

Let's start with something simple. Training for runs happens in 2 forms:

- Interval based training, where you focus on speed 
- Distance based training, where the focus is on endurance

Take a vector that represents all the days we did intervals / distance in the last 10 days. There are some days where we rest as well.

In [45]:
running_style <- c("INT", "INT", "DIST", "DIST", "DIST", "REST", "INT", "DIST", "DIST", "DIST")
names(running_style) <- 1:10
running_style

As you see, we can divide our runs into **categories**. Factors are used to represent these categories. Let us use the `factor` function to extract the factors out of this vector

In [46]:
running_style.f = factor(running_style)
running_style.f

Perfect, this tells us that we have 3 `level`s, i.e. we indeed have 3 running styles. 

We can confirm that `running_style.f` is indeed a factor variable by checking its underlying type with the `class function`

In [47]:
class(running_style.f)

Once we have our `level`s, we can modify them to our suiting with the `levels` function (very similar to the `names` function)

In [48]:
levels(running_style.f)
levels(running_style.f) <- c("Endurance", "Speed", "Rest")
levels(running_style.f)

This also gives us access to a new function - `summary` which gives us a summary of the data

In [49]:
summary(running_style.f)

This quickly tells us that out of the 10 days we ran, on 6 we did distance runs, 3 were interval runs and we took 1 day of rest.

### Types of factor variables

As said, `Factor` allow us to create _categorical_ variables. These variables can be of 2 types:

- Nominal
- Ordinal

#### Nominal Variables

By default a factor is nominal. Meaning that it picks categories by name and without any assigned order. So trying a logical `<` or `>` operation against them won't yield us anything

In [50]:
running_style.f = factor(running_style)
running_style.f[1] > running_style.f[2]

In Ops.factor(running_style.f[1], running_style.f[2]): '>' not meaningful for factors

[1] NA

As you see, it yields us a "`>` not meaningful for factors" error


#### Ordinal Variables

Passing a `order=TRUE` argument to `factor` will make the factor into an ordinal variable and `<` and `>` are meaningful here. 

Consider the amount of kilometers run in the past 10 days

In [52]:
kms_run <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)

In [53]:
# Lets classify this into Long, Medium and Short runs manually
distance_type <- c("S", "M", "M", "M", "M", "M", "M", "M", "L", "L")

Now we can pick up factors from this, but we understand an order here. Short < Medium < Long. To introduce an order, we need to pass the `order=TRUE` and pass the right order of the `levels` we require. 


In [54]:
distance_type.f = factor(distance_type, order=TRUE, levels=c("S", "M", "L"))
distance_type.f

Now that we have an order in place, we can use `<` and `>`


In [55]:
distance_type.f[1]
distance_type.f[2]
distance_type.f[1] > distance_type.f[2]

The real reason why factors are important will be covered in forthcoming sessions. This just introduces the concept and the necessity for it.

## Data Frame

The Data Frame is R's most iconic type. Soon, you'll find out that a Data Frame
is great to express all kinds of data

Think of a Data Frame as a 2 dimensional structure having rows and columns. Each column may be of a different type each row can be thought of as representing an observation

To quickly get started with data frames, let us use an inbuilt data frame in R that contains some data on cars. From the help:

> The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

This data is stroed in `mtcars`. Let us look at it.

In [56]:
mtcars

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


---

### [ Exercise ]

How will you find out what **type** `mtcars` is?

In [57]:
# Answer here

---

As you can see, it contains data about cars, each row represents one particular car and its associated details.

One of the most important things when working with Data Frames and in general with Data Science is to spend time understanding the structure of data. The structure of data, however is independent of the data itself. It is enough to get a _glimpse_ of the data to get started with. 

For this sake, R exposes 2 functions - `head` and `tail` that allow us to peek at the starting / ending of the data frame


In [58]:
# head
head(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


In [59]:
# tail
tail(mtcars)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


In [60]:
# Another way to get a quick glimpse of the data is to use the `str` function.
str(mtcars)

'data.frame':	32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...


The `str` function, as you can see shows us some nice details. It tells us

- The number of observations (rows) we have (`32`)
- The number of variables (columns) in consideration (`11`)
- Each of the column with their data type and the first few entries

### Creating Data Frames

Let us create our own Data Frame to better understand their underlying concepts.

Let's put together a bunch of vectors representing the different variables (columns) in our data frame 

In [62]:
distance <- c(4.0, 5.2, 6.0, 5.2, 5.0, 6.0, 6.2, 6.0, 7.2, 7.5)
time_taken <- c(20.5, 28.0, 40.2, 24.1, 26.0, 42.0, 43.2, 40.1, 50.2, 50.7)
run_type <- c("S", "S", "E", "S", "S", "E", "E", "E", "E", "E") # S is speed; E is endurance
workout_after <- c(TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)

Now that we have 3 vectors, we can create a data frame from these 3 vectors using the `data.frame` function


In [64]:
running.df <- data.frame(distance, time_taken, run_type, workout_after)

In [65]:
# Lets print this to see what we get
running.df

Unnamed: 0,distance,time_taken,run_type,workout_after
1,4.0,20.5,S,True
2,5.2,28.0,S,False
3,6.0,40.2,E,False
4,5.2,24.1,S,False
5,5.0,26.0,S,True
6,6.0,42.0,E,True
7,6.2,43.2,E,False
8,6.0,40.1,E,False
9,7.2,50.2,E,False
10,7.5,50.7,E,False


Quite similar to the data frame we earlier saw with cars. Lets work with this!

---

### [ Exercise ]

Use the `head`, `tail` and the `str` function to inspect the data frame we just created (`running.df`)

In [66]:
# Answer here

---

**Note:** When you run `str` on the data frame, notice that the `run_type` column has automatically been interpreted as a `Factor` type.

### Selecting elements

Rows and columns can be selected from the data frame by similar methods as followed in vectors. I.e. using `[` and `]`

Within `[` and `]` there are 2 parts - The row part and the column part separated by a comma (`,`)

In [67]:
# Let us pick up the value in the 4th column, 2nd row
running.df[2, 4]

---

### [ Exercise ]

- The row and column part, much like vector element selector allows for use of the `range` operator (`:`). Select columns 1-2 for rows 5-9  

In [68]:
# Answer here

R also makes it possible to omit one part of the 2 parts inside `[` and `]`. The separator is mandatory though. So now

- Select columns 1-2 for all rows     

In [69]:
# Answer here

   - Select all columns for rows 5-9

In [70]:
# Answer here 

---

You can also use the name of the column to select instead of specifying the numbers

In [72]:
running.df[5:9, "distance"]

There are times where we want to operate only on one column. We have, as of now, understood that there are 2 ways to do this: 

In [74]:
# Using the index of the distance column
running.df[,1]

In [75]:
# Using the column name
running.df[, "distance"]

There's also a 3rd way which you'll see used extensively through out R and that uses the `$` operator

In [76]:
running.df$distance

**Note:** Do note that when you are working on an **individual** column, the data structure is a `vector` and not a `data.frame`


### Working with subsets

Let us say that we want to select the first 4 rows of the data frame, we can do so my passing a vector like so

In [77]:
running.df[c(TRUE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE),]

Unnamed: 0,distance,time_taken,run_type,workout_after
1,4.0,20.5,S,True
2,5.2,28.0,S,False
3,6.0,40.2,E,False
4,5.2,24.1,S,False


But this is tedious so R gives us a `subset` function to do the same thing in a more readable fashion

In [79]:
# Lets pick out all the days where we did endurance runs.
subset(running.df, subset = (run_type == "E"))

Unnamed: 0,distance,time_taken,run_type,workout_after
3,6.0,40.2,E,False
6,6.0,42.0,E,True
7,6.2,43.2,E,False
8,6.0,40.1,E,False
9,7.2,50.2,E,False
10,7.5,50.7,E,False


---

### [ Exercise ]

Pick out all those rows where we did endurance runs **and** worked out after the run

In [81]:
# Answer here

---

### Ordering

Ordering helps us to understand our data better and helps with comparison.

The `order` function helps us to do that in R. It is quite smart as well. Consider a vector

In [85]:
some_alphabets <- c("h", "a", "q", "z", "n", "r")
some_alphabets

In [83]:
# Lets call order on them and see what happens
order(some_alphabets)

It gives us a vector. An ordered vector. Now let us select the original vector using this one


In [84]:
o <- order(some_alphabets)
some_alphabets[o]

We can also sort it in the opposite order using the `decreasing=TRUE` argument to `order`

---

### [ Exercise ]

- Sort the `some_alphabets` vector in the descending order

In [86]:
# Answer here    

- We can also get use `order` on a column in a data frame. Order the `distance` column within our `running.df` data frame

In [87]:
# Answer here

- From the existing `running.df` data frame, create a new data frame (`running.df_ordered`) that is ordered by the `distance` column

In [88]:
# Answer here

---