<!-- img src="http://cognitiveclass.ai/wp-content/uploads/2017/11/cc-logo-square.png" width="150"-->

<h1 align=center>R BASICS</h1> 

### Welcome!

By the end of this notebook, you will have learned the basics of R!  

## Table of Contents


<ul>
<li><a href="#About-the-Dataset">About the Dataset</a></li>
<li><a href="#Simple-Mathematics-in-R">Simple Math in R</a></li>
<li><a href="#Variables-in-R">Variables in R</a></li>
<li><a href="#Vectors-in-R">Vectors in R</a></li>
<li><a href="#Strings-in-R">Strings in R</a></li>
</ul>
<p></p>
Estimated Time Needed: <strong>15 min</strong>

<hr>

<center><h2>About the Dataset</h2></center>

Which movie should you watch next? 

Let's say each of your friends tells you their favorite movies. You do some research on the movies and put it all into a table. Now you can begin exploring the dataset, and asking questions about the movies. For example, you can check if movies from some certain genres tend to get better ratings. You can check how the production cost for movies changes across years, and much more. 

**Movies dataset**

The table gathered includes one row for each movie, with several columns for each movie characteristic:

- **name** - Name of the movie
- **year** - Year the movie was released
- **length_min** - Length of the movie (minutes)
- **genre** - Genre of the movie
- **average_rating** - Average rating on [IMDB](http://www.imdb.com/)
- **cost_millions** - Movie's production cost (millions in USD)
- **foreign** - Is the movie foreign (1) or domestic (0)?
- **age_restriction** - Age restriction for the movie
<br>


<img src = "https://ibm.box.com/shared/static/6kr8sg0n6pc40zd1xn6hjhtvy3k7cmeq.png" width = 90% align="left">

### We can use R to help us explore the dataset
But to begin, we'll need to get the basics, so let's get started!

<hr>

<center><h2>Simple Mathathematics in R</h2></center>

Let's say you want to watch *Fight Club* and *Star Wars: Episode IV (1977)*, back-to-back. Do you have enough time to **watch both movies in 4 hours?** Let's try using simple math in R.  

What is the **total movie length** for Fight Club and Star Wars (1977)?
- **Fight Club**: 139 min
- **Star Wars: Episode IV**: 121 min

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip**: To run the grey code cell below, click on it, and press Shift + Enter.
</div>

In [1]:
139 + 121 

Great! You've determined that the total number of movie play time is **260 min**.  

**What is 260 min in hours?**

In [2]:
260 / 60

Well, it looks like it's **over 4 hours**, which means you can't watch *Fight Club* and *Star Wars (1977)* back-to-back if you only have 4 hours available!

<hr></hr>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<h4> [Tip] Simple math in R </h4>
<p></p>
You can do a variety of mathematical operations in R including:  
<li> addition: **2 + 2** </li>
<li> subtraction: **5 - 2** </li>
<li> multiplication: **3 \* 2** </li>
<li> division: **4 / 2** </li>
<li> exponentiation: **4 \*\* 2** or **4 ^ 2 **</li>
</div>
<hr></hr>

<center><h2>Variables in R</h2>

We can also **store** our output in **variables**, so we can use them later on. For example:

In [3]:
x <- 139 + 121

To return the value of **`x`**, we can simply run the variable as a command:

In [4]:
x

We can also perform operations on **`x`** and save the result to a **new variable**:

In [5]:
y <- x / 60
y

If we save something to an **existing variable**, it will **overwrite** the previous value:

In [6]:
x <- x / 60
x

It's good practice to use **meaningful variable names**, so you don't have to keep track of what variable is what:

In [7]:
total <- 139 + 121
total

In [8]:
total_hr <- total / 60
total_hr

You can put this all into a single expression, but remember to use **round brackets** to add together the movie lengths first, before dividing by 60.

In [None]:
total_hr <- (139 + 121) / 60
total_hr

<hr></hr>
<div class="alert alert-success alertsuccess" style="margin-top: 0px">
<h4> [Tip] Variables in R </h4>
<p></p>
As you just learned, you can use **variables** to store values for repeated use. Here are some more **characteristics of variables in R**:
<li>variables store the output of a block of code </li>
<li>variables are typically assigned using **<-**, but can also be assigned using **=**, as in **x <- 1** or **x = 1** </li>
<li>once created, variables can be removed from memory using **rm(**my_variable**)**  </li>
<p></p>
</div>
<hr></hr>

<center><h2>Vectors in R</h2></center>

What if we want to know the **movie length in _hours_**, not minutes, for _Toy Story_ and for _Akira_?
- **Toy Story (1995)**: 81 min
- **Akira (1998)**: 125 min

In [9]:
c(81, 125) / 60

As you see above, we've applied a single math operation to both of the items in **`c(81, 125)`**. You can even assign **`c(81, 125)`** to a variable before performing an operation.

In [None]:
ratings <- c(81, 125)
ratings / 60

What we just did was create vectors, using the combine function **`c()`**. The **`c()`** function takes multiple items, then combines them into a **vector**. 

It's important to understand that **vectors** are used everywhere in R, and vectors are easy to use.

In [2]:
c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
c(1:10)

In [1]:
c(10:1) # 10 to 1

<hr></hr>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<h4> [Tip] # Comments</h4>  

Did you notice the **comment** after the **c(10:1)** above? Comments are very useful in describing your code. You can create your own comments by using the **#** symbol and writing your comment after it. R will interpret it as a comment, not as code.

<p></p>
</div>

<hr></hr>

<center><h2>Strings in R</h2></center>

R isn't just about numbers -- we can also have strings too. For example:

In [4]:
movie <- "Toy Story"
movie

In R, you can identify **character strings** when they are wrapped with **matching double (") or single (') quotes**.

Let's create a **character vector** for the following **genres**:
- Animation
- Comedy
- Biography
- Horror
- Romance
- Sci-fi

In [5]:
genres <- c("Animation", "Comedy", "Biography", "Horror", "Romance", "Sci-fi")
genres

<hr>

<center><h2>Vectors</h2></center>

<ul>
<li><a href="#Vector-Operations">Vector Operations</a></li>
<li><a href="#Subsetting-Vectors">Subsetting Vectors</a></li>
<li><a href="#Factors">Factors</a></li>
</ul>
<p></p>
Estimated Time Needed: <strong>15 min</strong>

**Vectors** are strings of numbers, characters or logical data (one-dimension array). In other words, a vector is a simple tool to store your grouped data.

In R, you create a vector with the combine function **c()**. You place the vector elements separated by a comma between the brackets. Vectors will be very useful in the future as they allow you to apply operations on a series of data easily.

Note that the items in a vector <em>must be of the same class</em>, for example all should be either number, character, or logical.

### Numeric, Character and Logical Vectors

Let's say we have four movie release dates (1985, 1999, 2015, 1964) and we want to assign them to a single variable, `release_year`. This means we'll need to create a vector using **`c()`**.

Using numbers, this becomes a **numeric vector**.

In [None]:
release_year <- c(1985, 1999, 2015, 1964)
release_year

What if we use quotation marks? Then this becomes a **character vector**.

In [None]:
# Create genre vector and assign values to it 
titles <- c("Toy Story", "Akira", "The Breakfast Club")
titles

There are also **logical vectors**, which consist of TRUE's and FALSE's. They're particular important when you want to check its contents

In [None]:
titles == "Akira" # which item in `titles` is equal to "Akira"?

<hr></hr>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<h4> [Tip] TRUE and FALSE in R </h4>  

Did you know? R only recognizes `TRUE`, `FALSE`, `T` and `F` as special values for true and false. That means all other spellings, including *True* and *true*, are not interpreted by R as logical values.

<p></p>
</div>

<hr></hr>

<center><h2>Vector Operations</h2></center>

### Adding more elements to a vector

You can add more elements to a vector with the same **`c()`** function you use the create vectors:

In [None]:
release_year <- c(1985, 1999, 2015, 1964)
release_year

In [None]:
release_year <- c(release_year, 2016:2018)
release_year

### Length of a vector

How do we check how many items there are in a vector? We can use the **length()** function:

In [None]:
release_year
length(release_year)

### Head and Tail of a vector

We can also retrieve just the **first few items** using the **head()** function:

In [None]:
head(release_year) #first six items

In [None]:
head(release_year, n = 2) #first n items

In [None]:
head(release_year, 2)

We can also retrieve just the **last few items** using the **tail()** function:

In [None]:
tail(release_year) #last six items

In [None]:
tail(release_year, 2) #last two items

### Sorting a vector

We can also sort a vector:

In [None]:
sort(release_year)

We can also **sort in decreasing order**:

In [None]:
sort(release_year, decreasing = TRUE)

But if you just want the minimum and maximum values of a vector, you can use the **`min()`** and **`max()`** functions

In [None]:
min(release_year)
max(release_year)

### Average of Numbers

If you want to check the average cost of movies produced in 2014, what would you do? Of course, one way is to add all the numbers together, then divide by the number of movies:

In [None]:
cost_2014 <- c(8.6, 8.5, 8.1)

# sum results in the sum of all elements in the vector
avg_cost_2014 <- sum(cost_2014)/3
avg_cost_2014

In [None]:
avg_cost_2014 <- sum(cost_2014)/length(cost_2014)
avg_cost_2014

You also can use the <b>mean</b> function to find the average of the numeric values in a vector:

In [None]:
mean_cost_2014 <- mean(cost_2014)
mean_cost_2014

### Giving Names to Values in a Vector

Suppose you want to remember which year corresponds to which movie.

With vectors, you can give names to the elements of a vector using the **names() ** function:

In [None]:
#Creating a year vector
release_year <- c(1985, 1999, 2010, 2002)

#Assigning names
names(release_year) <- c("The Breakfast Club", "American Beauty", "Black Swan", "Chicago")

release_year

Now, you can retrieve the values based on the names:

In [None]:
release_year[c("American Beauty", "Chicago")]

Note that the values of the vector are still the years. We can see this in action by adding a number to the first item:

In [None]:
release_year[1] + 100 #adding 100 to the first item changes the year

And you can retrieve the names of the vector using **`names()`**

In [None]:
names(release_year)[1:3]

### Summarizing Vectors

You can also use the **"summary"** function for simple descriptive statistics: minimum, first quartile, mean, third quartile, maximum:

In [None]:
summary(cost_2014)

### Using Logical Operations on Vectors

A vector can also be comprised of **`TRUE`** and **`FALSE`**, which are special **logical values** in R. These boolean values are used used to indicate whether a condition is true or false.  

Let's check whether a movie year of 1997 is older than (**greater in value than**) 2000.

In [None]:
movie_year <- 1997
movie_year > 2000

You can also make a logical comparison across multiple items in a vector. Which movie release years here are "greater" than 2014?

In [1]:
movies_years <- c(1998, 2010, 2016)
movies_years > 2014

We can also check for **equivalence**, using **`==`**. Let's check which movie year is equal to 2015.

In [2]:
movies_years == 2015 # is equal to 2015?

If you want to check which ones are **not equal** to 2015, you can use **`!=`**

In [3]:
movies_years != 2015

<hr></hr>
<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<h4> [Tip] Logical Operators in R </h4>
<p></p>
You can do a variety of logical operations in R including:  
<li> Checking equivalence: **1 == 2** </li>
<li> Checking non-equivalence: **TRUE != FALSE** </li>
<li> Greater than: **100 > 1** </li>
<li> Greater than or equal to: **100 >= 1** </li>
<li> Less than: **1 < 2** </li>
<li> Less than or equal to: **1 <= 2** </li>
</div>
<hr></hr>

<a id="ref3"></a>
<center><h2>Subsetting Vectors</h2><center>

What if you wanted to retrieve the second year from the following **vector of movie years**?

In [None]:
movie_years <- c(1985, 1999, 2002, 2010, 2012)
movie_years

To retrieve the **second year**, you can use square brackets **`[]`**:

In [None]:
movie_years[2] #second item

To retrieve the **third year**, you can use:

In [None]:
movie_years[3]

And if you want to retrieve **multiple items**, you can pass in a vector:

In [None]:
movie_years[c(1,3)] #first and third items

**Retrieving a vector without some of its items**

To retrieve a vector without an item, you can use negative indexing. For example, the following returns a vector slice **without the first item**.

In [None]:
titles <- c("Black Swan", "Jumanji", "City of God", "Toy Story", "Casino")
titles[-1]

You can save the new vector using a variable:

In [None]:
new_titles <- titles[-1] #removes "Black Swan", the first item
new_titles

** Missing Values (NA)**

Sometimes values in a vector are missing and you have to show them using NA, which is a special value in R for "Not Available". For example, if you don't know the age restriction for some movies, you can use NA.

In [None]:
age_restric <- c(14, 12, 10, NA, 18, NA)
age_restric


<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<h4> [Tip] Checking NA in R </h4>
<p></p>
You can check if a value is NA by using the **is.na()** function, which returns TRUE or FALSE. 
<li> Check if NA: **is.na(NA)** </li>
<li> Check if not NA: **!is.na(2)** </li>
</div>


### Subsetting vectors based on a logical condition

What if we want to know which movies were created after year 2000? We can simply apply a logical comparison across all the items in a vector:

In [None]:
release_year > 2000

To retrieve the actual movie years after year 2000, you can simply subset the vector using the logical vector within **square brackets "[]"**:

In [None]:
release_year[movie_years > 2000] #returns a vector for elements that returned TRUE for the condition

As you may notice, subsetting vectors in R works by retrieving items that were TRUE for the provided condition. For example, `year[year > 2000]` can be verbally explained as: _"From the vector `year`, return only values where the values are TRUE for `year > 2000`"_.

You can even manually write out TRUE or T for the values you want to subset:

In [None]:
release_year
release_year[c(T, F, F, F)] #returns the values that are TRUE

<a id="ref4"></a>
<center><h2>Factors</h2></center>

Factors are variables in R which take on a limited number of different values; such variables are often refered to as  **categorical variables**. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values. For example, the height of a tree is a continuous variable, but the titles of books would be a categorical variable.

One of the most important uses of factors is in statistical modeling; since categorical variables enter into statistical models differently than continuous variables, storing data as factors insures that the modeling functions will treat such data correctly. 


Let's start with a _**vector**_ of genres:

In [None]:
genre_vector <- c("Comedy", "Animation", "Crime", "Comedy", "Animation")
genre_vector

As you may have noticed, you can theoretically group the items above into three categories of genres: _Animation_, _Comedy_ and _Crime_. In R-terms, we call these categories **"factor levels"**.

The function **factor()** converts a vector into a factor, and creates a factor level for each unique element.

In [None]:
genre_factor <- as.factor(genre_vector)
levels(genre_factor)

### Summarizing Factors

When you have a large vector, it becomes difficult to identify which levels are most common (e.g., "How many 'Comedy' movies are there?").

To answer this, we can use **summary()**, which produces a **frequency table**, as a named vector.

In [None]:
summary(genre_factor)

And recall that you can sort the values of the table using **sort()**.

In [None]:
sort(summary(genre_factor)) #sorts values by ascending order

### Ordered factors

There are two types of categorical variables: a **nominal categorical variable** and an **ordinal categorical variable**.

A **nominal variable** is a categorical variable for names, without an implied order. This means that it is impossible to say that 'one is better or larger than the other'. For example, consider **movie genre** with the categories _Comedy_, _Animation_, _Crime_, _Comedy_, _Animation_. Here, there is no implicit order of low-to-high or high-to-low between the categories. 

In contrast, **ordinal variables** do have a natural ordering. Consider for example, **movie length** with the categories: _Very short_, _Short_ , _Medium_, _Long_, _Very long_. Here it is obvious that _Medium_ stands above _Short_, and _Long_ stands above _Medium_.

In [None]:
movie_length <- c("Very Short", "Short", "Medium","Short", "Long",
                        "Very Short", "Very Long")
movie_length

__`movie_length`__ should be converted to an ordinal factor since its categories have a natural ordering. By default, the function <b>factor()</b> transforms `movie_length` into an unordered factor. 

To create an **ordered factor**, you have to add two additional arguments: `ordered` and `levels`. 
- `ordered`: When set to `TRUE` in `factor()`, you indicate that the factor is ordered. 
- `levels`: In this argument in `factor()`, you give the values of the factor in the correct order.

In [None]:
movie_length_ordered <- factor(movie_length, ordered = TRUE , 
                                 levels = c("Very Short" , "Short" , "Medium", 
                                            "Long","Very Long"))
movie_length_ordered

Now, lets look at the summary of the ordered factor, <b>factor_mvlength_vector</b>:

In [None]:
summary(movie_length_ordered)

<hr>

<h1 align=center>What is an Array?</h1>
<br>
An array is a structure that holds values grouped together, like a 2 x 2 table of 2 rows and 2 columns. Arrays can also be **multidimensional**, such as a 2 x 2 x 2 array.

#### What is the difference between an array and a vector?

Vectors are always one dimensional like a single row of data. On the other hand, an array can be multidimensional (stored as rows and columns). The "dimension" indicates how many rows of data there are.

#### Let's create a 4 x 3 array (4 rows, 3 columns)
The example below is a vector of 9 movie names, hence the data type is the same for all the elements.

In [None]:
#lets first create a vector of nine movies
movie_vector <- c("Akira", "Toy Story", "Room", "The Wave", "Whiplash",
                  "Star Wars", "The Ring", "The Artist", "Jumanji")
movie_vector

To create an array, we can use the **array()** function.

In [None]:
movie_array <- array(movie_vector, dim = c(4,3))
movie_array

Note that **arrays are created column-wise**. Did you also notice that there were only 9 movie names, but the array was 4 x 3? The original **vector doesn't have enough elements** to fill the entire array (that should have 3 x 4 = 12 elements). So R simply fills rest of the empty values by going back to the beginning of the vector and starting again ("Akira", "Toy story", "Room" in this case).

We also needed to provide **`c(4,3)`** as a second _argument_ to specify the number of rows (4) and columns (3) that we wanted.

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**[Tip] What is an "argument"? How are "arguments" different from "_parameters_"?**   
<br>
Arguments and parameters are terms you will hear constantly when talking about **functions**.  
- The _**parameters**_ are the input variables used in a function, like **dim** in the function **array()**.   
- The _**arguments**_ refer to the _values_ for those parameters that a function takes as inputs, like **c(4,3)**  
<br>
We actually don't need to write out the name of the parameter (dim) each time, as in:  
`array(movie_vector, c(4,3))`  
As long as we write the arguments out in the correct order, R can interpret the code.  

<br>
Arguments in a function may sometimes need to be of a **specific type**. For more information on each function, you can open up the help file by running the function name with a ? beforehand, as in:  
`?array`
<p></p>

</div>

<h2 align=center>Array Indexing</h2>

Let's look at our array again:

In [None]:
movie_array

To access an element of an array, we should pass in **[row, column]** as the row and column number of that element.  
For example, here we retrieve **Whiplash** from row 1 and column 2:

In [None]:
movie_array[1,2] #[row, column]

To display all the elements of the first row, we should put 1 in the row and nothing in the column part. Be sure to keep in the comma after the `1`.

In [None]:
movie_array[1,]

Likewise, you can get the elements by column as below.

In [None]:
movie_array[,2]

To get the dimension of the array, **dim()** should be used.

In [None]:
dim(movie_array)

We can also do math on arrays. Let's create an array of the lengths of each of the nine movies used earlier.

In [None]:
length_vector <- c(125, 81, 118, 81, 106, 121, 95, 100, 104)
length_array <- array(length_vector, dim = c(3,3))
length_array

Let's add 5 to the array, to account for a 5-min bathroom break:

In [None]:
length_array + 5

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip**: Performing operations on objects, like adding 5 to an array, does not change the object. **To change the object, we would need to assign the new result to itself**.
</div>



<a id="ref3"></a>
<h2 align=center>Using Logical Conditions to Subset Arrays</h2>

Which movies can I finish watching in two hours? Using a logical condition, we can check which movies are less than 2 hours long.

In [None]:
mask_array <- length_array > 120
mask_array

Using this array of TRUEs and FALSEs, we can subset the array of movie names:

In [None]:
x_vector <- c("Akira", "Toy Story", "Room", "The Wave", "Whiplash",
              "Star Wars", "The Ring", "The Artist", "Jumanji")
x_array <- array(x_vector, dim = c(3,3))

x_array[mask_array]

<hr>

<h1 align=center>What is a Matrix?</h1>

Matrices are a subtype of arrays. A matrix **must** have 2 dimensions, whereas arrays are more flexible and can have, 1, 2 or more dimensions.  

To create a matrix out of a vector , you can use **matrix()**, which takes in an argument for the vector, an argument for the number of rows and another for the number of columns.

In [None]:
movie_matrix <- matrix(movie_vector, nrow = 3, ncol = 3)

In [None]:
movie_matrix

### Accessing elements of a matrix

As with arrays, you can use **[row, column]** to access elements of a matrix. To retrieve "Akira", you should use [1,1] as it lies in the first row and first column.

In [None]:
movie_matrix[1,1]

To get data from a certain range, the folowing code can help. This takes the elements from rows 2 to 3, and from columns 1 to 2.

In [None]:
movie_matrix[2:3, 1:2]

<h2>Concatenation function</h2>

A concatenation function is used to combine two vectors into one vector. It combines values of both vectors.<br>
Lets create a new vector for the upcoming movies as upcoming_movie and add them to the movie_vector to create a new_vector of movies.

In [None]:
upcoming_movie <- c("Fast and Furious 8", "xXx: Return of Xander Cage", "Suicide Squad")

In [None]:
new_vector <- c(movie_vector, upcoming_movie)

In [None]:
new_vector

<hr>

<a id="ref1"></a>
<h1 align=center>Lists</h1>

First of all, we're gonna take a look at lists in R. A list is a sequenced collection of different objects of R, like vectors, numbers, characters, other lists as well, and so on. You can consider a list as a container of correlated information, well structured and easy to read. A list accepts items of different types, but a vector (or a matrix, which is a multidimensional vector) doesn't. To create a list just type __list()__ with your content inside the parenthesis and separated by commas. Let’s try it!

In [None]:
movie <- list("Toy Story", 1995, c("Animation", "Adventure", "Comedy"))

In the code above, the variable movie contains a list of 3 objects, which are a string, a numeric value, and a vector of strings. Easy, eh? Now let's print the content of the list. We just need to call its name.

In [None]:
movie

A list has a sequence and each element of a list has a position in that sequence, which starts from 1. If you look at our previous example, you can see that each element has its position represented by double square brackets "**[[ ]]**".

### Accessing items in a list
It is possible to retrieve only a part of a list using the **single _square** bracket operator_ "**[ ]**". This operator can be also used to get a single element in a specific position. Take a look at the next example:

The index number 2 returns the second element of a list, if that element exists:

In [None]:
movie[2]

Or you can select a part or interval of elements of a list. In our next example we are retrieving the 1st, 2nd, and 3rd elements:

In [None]:
movie[2:3]

It looks a little confusing, but lists can also store names for its elements.

### Named lists

The following list is a named list:

In [None]:
movie <- list(name = "Toy Story",
             year = 1995,
             genre = c("Animation", "Adventure", "Comedy"))

Let me explain that: the list **movie** has some named objects within it. **name**, for example, is an object of type **character**, **year** is an object of type **number**, and **genre** is a vector with objects of type **character**.


Now take a look at this list. This time, it's full of information and well organized. It's clear what each element means. You can see that the elements have different types, and that's ok because it's a list.

In [None]:
movie

You can also get separated information from the list. You can use **listName\$selectorName**. The _dollar-sign operator_ **$** will give you the block of data that is related to selectorName.

Let's get the genre part of our movies list, for example.

In [None]:
movie$genre

Another way of selecting the genre column:

In [None]:
movie["genre"]

You can also use numerical selectors like an array. Here we are selecting elements from 2 to 3.

In [None]:
movie[2:3]

The function __class()__ returns the type of a object. You can use that function to retrieve the type of specific elements of a list:

In [None]:
class(movie$name)
class(movie$foreign)

### Adding, modifying, and removing items

Adding a new element is also very easy. The code below adds a new field named **age** and puts the numerical value 0 into it. In this case we use the double square brackets operator, because we are directly referencing a list member (and we want to change its content).

In [None]:
movie[["age"]] <- 5
movie

In order to modify, you just need to reference a list member that already exists, then change its content.

In [None]:
movie[["age"]] <- 6
# Now it's 6, not 5
movie

And removing is also easy! You just put **_NULL_**, which means missing value/data, into it.

In [None]:
movie[["age"]] <- NULL
movie

### Concatenating lists

Concatenation is the proccess of puting things together, in sequence. And yes, you can do it with lists. Just call the function **_c()_**. Take a look at the next example:

In [None]:
# We split our previous list in two sublists
movie_part1 <- list(name = "Toy Story")
movie_part2 <- list(year = 1995, genre = c("Animation", "Adventure", "Comedy"))

# Now we call the function c() to put everything together again
movie_concatenated <- c(movie_part1, movie_part2)

# Check it out
movie_concatenated

Lists are really handy for organizing different types of elements in R, and also easy to use. Additionally, lists are also important since this type of data structure is essential to create data frames, our next covered topic.

<hr>

<a id="dataframes"></a>
<h1>DataFrames</h1>

A DataFrame is a structure that is used for storing data tables. Underneath it all, a data frame is a list of vectors of same length, exactly like a table (each vector is a column). We call a function called  __data.frame()__ to create a data frame and pass vector, which are our columns, as arguments. It is required to name the columns that will compose the data frame.

In [None]:
movies <- data.frame(name = c("Toy Story", "Akira", "The Breakfast Club", "The Artist",
                              "Modern Times", "Fight Club", "City of God", "The Untouchables"),
                    year = c(1995, 1998, 1985, 2011, 1936, 1999, 2002, 1987),
                    stringsAsFactors=F)

Let's print its content of our recently created data frame:

In [None]:
movies

It's very easy! You can note how it looks like a table.

We can also use the __"$"__ selector to get some type of information. This operator returns the content of a specific column of a data frame (that's why we have to choose a name for each column).

In [None]:
movies$name

You retrieve data using numeric indexing, like in lists:

In [None]:
# This returns the first (1st) column
movies[1]

The function called __str()__ is one of most useful functions in R. With this function you can obtain textual information about an object. In this case,  it delivers information about the objects whitin a data frame. Let's see what it returns:

In [None]:
str(movies)

It shouws this data frame has 8 observations, for 2 columns, so called __name__ and __year__. The "name" column is a factor with 8 levels and "year" is a numerical column. 

The class() function works for data frames as well. You can use it to determine the type of a column of a data frame.

In [None]:
class(movies$year)

You can use numerical selectors to reach information inside the table.

In [None]:
movies[1,2] #1-Toy Story, 2-1995

The **_head()_** function is very useful when you have a large table and you need to take a peek at the first elements. This function returns the first 6 values of a data frame (or event a list).

In [None]:
head(movies)

Similar to the previous function, **_tail()_** returns the last 6 values of a data frame or list.

In [None]:
tail(movies)

Now let's try to add a new column to our data frame with the length of each movie in minutes.

In [None]:
movies['length'] <- c(81, 125, 97, 100, 87, 139, 130, 119)
movies

A new column was included into our data frame with just one line of code. We just needed to add a vector to data frame, then it will be our new column.

Now let's try to add a new movie to our data set.

In [None]:
movies <- rbind(movies, c(name="Dr. Strangelove", year=1964, length=94))
movies

Remember, you can't add a list with more variables than the data frame, and vice-versa.

We don't need this movie anymore, so let's delete it. Here we are deleting row 12 by assigning to itself the movies dataframe without the 12th row.

In [None]:
movies <- movies[-12,]
movies

To delete a column you can just set it as **_NULL_**.

In [None]:
movies[["length"]] <- NULL
movies

That is it! You learned a lot about data frames and how easy it is to work with them. 

<hr>

Lets first download the dataset that we will use in this notebook:

In [None]:
# code to download the dataset
download.file("https://ibm.box.com/shared/static/n5ay5qadfe7e1nnsv5s01oe1x62mq51j.csv", destfile="movies-db.csv")

To begin, we can start by associating our data to a data frame. Let's call it `movies_Data`.

In [None]:
movies_data <- read.csv("movies-db.csv", header=TRUE, sep=",")

<hr>

<a id="control"></a>
<center><h1>Control statements</h1></center>

**Control statements** are ways for a programmer to control what pieces of the program are to be executed at certain times. The syntax of control statements is very similar to regular english, and they are very similar to logical decisions that we make all the time.

**Conditional statements** and **Loops** are the control statements that are able to change the execution flow. The expected execution flow is that each line and command should be executed in the order they are written. Control statements are able to change this, allowing you to skip parts of the code or to repeat blocks of code.

<hr>

<a id="ref2"></a>
<center><h2>Conditional Statements</h2></center>

We often want to check a conditional statement and then do something in response to that condition being true or false.

### If Statements

**If** statements are composed of a conditional check and a block of code that is executed if the check results in **true**. For example, assume we want to check a movie's year, and print something if it greater than 2000:

In [None]:
movie_year = 2002

# If Movie_Year is greater than 2000...
if(movie_year > 2000){

    # ...we print a message saying that it is greater than 2000.
    print('Movie year is greater than 2000')

}

Notice that the code in the above block **{}** will only be executed if the check results in **true**. 

You can also add an `else` block to `if` block -- the code in `else` block will only be executed if the check results in **false**.

**Syntax:** 

if (condition) {
    # do something
} else {
    # do something else
}

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip**: This syntax can be spread over multiple lines for ease of creation and legibility.
</div>

Let's create a variable called **`Movie_Year`** and attribute it the value 1997. Additionally, let's add an `if` statement to check if the value stored in **`Movie_Year`** is greater than 2000 or not -- if it is, then we want to output a message saying that Movie_Year is greater than 2000, if not, then we output a message saying that it is not greater than 2000.

In [None]:
movie_year = 1997

# If Movie_Year is greater than 2000...
if(movie_year > 2000){

    # ...we print a message saying that it is greater than 2000.
    print('Movie year is greater than 2000')

}else{ # If the above conditions were not met (Movie_Year is not greater than 2000)...
    
    # ...then we print a message saying that it is not greater than 2000.
    print('Movie year is not greater than 2000') 

}

Feel free to change **`movie_year`**'s value to other values -- you'll see that the result changes based on it!

To create our conditional statements to be used with **`if`** and **`else`**, we have a few tools:

### Comparison operators
When comparing two values you can use this operators

<ul>
<li>equal: `==`
<li>not equal: `!=`
<li>greater/less than: `>` `<` </li>
<li> greater/less than or equal: `>=` `<=` </li>
</ul>

### Logical operators
Sometimes you want to check more than one condition at once. For example you might want to check if one condition **and** other condition are true. Logical operators allow you to combine or modify conditions.

<ul>
<li> and: `&`
<li> or: `|`
<li> not: `!`
</ul>

Let's try using these operators:

In [None]:
movie_year = 1997

# If Movie_Year is BOTH less than 2000 AND greater than 1990 -- both conditions have to be true! -- ... 
if(movie_year < 2000 & movie_year > 1990 ) {
    # ...then we print this message.
    print('Movie year between 1990 and 2000') 
}

# If Movie_Year is EITHER greater than 2010 OR less than 2000 -- any of the conditions have to be true! -- ... 
if(movie_year > 2010 | movie_year < 2000 ) {
    # ...then we print this message.
    print('Movie year is not between 2000 and 2010') 
}

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip**: All the expressions will return the value in Boolean format -- this format can only house two values: true or false!
</div>

### Subset

Sometimes, we don't want an entire dataset -- maybe in a dataset of people we want only people with age 18 and over, or in the movies dataset, maybe we want only movies that were created after a certain year. This means we want a **subset** of the dataset. In R, we can do this by utilizing the **`subset`** function.

Suppose we want a subset of the **`movies_Data`** data frame composed of movies from a given year forward (e.g. year 2000) if a selected variable is **recent**, or from that given year back if we select **old**. We can quite simply do that in R by doing this:

In [None]:
decade = 'recent'

# If the decade given is recent...
if(decade == 'recent' ){
    # Subset the dataset to include only movies after year 2000.
    subset(movies_data, year >= 2000)
} else { # If not...
    # Subset the dataset to include only movies before 2000.
    subset(movies_data, year < 2000)
}

<hr>

<a id="ref3"></a>
<center><h2>Loops</h2></center>

Sometimes, you might want to repeat a given function many times. Maybe you don't even know how many times you want it to execute, but have an idea like `once for every row in my dataset`. Repeated execution like this is supplemented by **loops**. In R, there are two main loop structures, **`for`** and **`while`**.

### The `for` loop
The `for` loop structure enables you to execute a code block once for every element in a given structure. For example, it would be like saying **execute this once for every row in my dataset**, or "execute this once for every element in this column bigger than 10". **`for`** loops are a very useful structure that make the processing of a large amount of data very simple.

Let's try to use a **`for`** loop to print all the years present in the **`year`** column in the **`movies_Data`** data frame. We can do that like this:

In [None]:
# Get the data for the "year" column in the data frame.
years <- movies_data['year']

# For each value in the "years" variable...
# Note that "val" here is a variable -- it assumes the value of one of the data points in "years"!
for (val in years) {
    # ...print the year stored in "val".
    print(val)
}

### The `while` loop
As you can see, the `for` loop is useful for a controlled flow of repetition. However, what if we don't know when we want to stop the loop? What if we want to keep executing a code block until a certain threshold has been reached, or maybe when a logical expression finally results in an expected fashion?

The `while` loop exists as a tool for repeated execution based on a condition. The code block will keep being executed until the given logical condition returns a `False` boolean value.

Let's try using `while` to print the first five movie names of our dataset. It can be done like this:

In [None]:
# Creating a start point.
iteration = 1

# We want to repeat until we reach the sixth operation -- but not execute the sixth time.
# While iteration is less or equal to five...
while (iteration <= 5) {
    
    print(c("This is iteration number:",as.character(iteration)))
    
    # ...print the "name" column of the iteration-th row.
    print(movies_data[iteration,]$name)
    
    # And then, we increase the "iteration" value -- so that we actually reach our stopping condition
    # Be careful of infinite while loops!
    iteration = iteration + 1
}

### Applying Functions to Vectors

One of the most common uses of loops is to **apply a given function to every element in a vector of elements**. Any of the loop structures can do that, however, R conveniently provides us with a very simple way to do that: By inferring the operation.

R is a very smart language when it comes to element-wise operations. For example, you can perform an operation on a whole list by utilizing that function directly on it. Let's try that out:

In [None]:
# First, we create a vector...
my_list <- c(10,12,15,19,25,33)

# ...we can try adding two to all the values in that vector.
my_list + 2

# Or maybe even exponentiating them by two.
my_list ** 2

# We can also sum two vectors element-wise!
my_list + my_list

R makes it very simple to operate over vectors -- anything you think should work will probably work. Try to mess around with vectors and see what you find out!

This is the end of the **Loops and Conditional Execution in R** notebook. Hopefully, now you know how to manipulate the flow of your code to your needs. Thank you for reading this notebook, and good luck on your studies.

<hr>
#### Scaling R with big data

As you learn more about R, if you are interested in exploring platforms that can help you run analyses at scale, you might want to sign up for a free account on [IBM Watson Studio](http://cocl.us/dsx_rp0101en), which allows you to run analyses in R with two Spark executors for free.



Lets first download the dataset that we will use in this notebook:

In [None]:
# code to download the dataset
download.file("https://ibm.box.com/shared/static/n5ay5qadfe7e1nnsv5s01oe1x62mq51j.csv", destfile="movies-db.csv")

<hr>

<a id='ref1'></a>
<center><h2>What is a Function?</h2></center>

A function is a re-usable block of code which performs operations specified in the function.

There are two types of functions :

- **Pre-defined functions**
- **User defined functions**

<b>Pre-defined</b> functions are those that are already defined for you, whether it's in R or within a package. For example, **`sum()`** is a pre-defined function that returns the sum of its numeric inputs.

<b>User-defined</b> functions are custom functions created and defined by the user. For example, you can create a custom function to print **Hello World**.

<h3><b>Pre-defined functions</b></h3>

There are many pre-defined functions, so let's start with the simple ones.

Using the **`mean()`** function, let's get the average of these three movie ratings:
- **Star Wars (1977)** - rating of 8.7 
- **Jumanji** - rating of 6.9
- **Back to the Future** - rating of 8.5

In [None]:
ratings <- c(8.7, 6.9, 8.5)
mean(ratings)

We can use the **`sort()`** function to sort the movies rating in _ascending order_.

In [None]:
sort(ratings)

You can also sort by _decreasing_ order, by adding in the argument **`decreasing = TRUE`**.

In [None]:
sort(ratings, decreasing = TRUE)

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
<h4> [Tip] How do I learn more about the pre-defined functions in R? </h4>
<p></p>
We will be introducing a variety of **pre-defined functions** to you as you learn more about R. There are just too many functions, so there's no way we can teach them all in one sitting. But if you'd like to take a quick peek, here's a short reference card for some of the commonly-used pre-defined functions:   
https://cran.r-project.org/doc/contrib/Short-refcard.pdf
</div>

<h3>User-defined functions</h3>

Functions are very easy to create in R:

In [None]:
printHelloWorld <- function(){
    print("Hello World")
}
printHelloWorld()

To use it, simply run the function with **`()`** at the end:

In [None]:
printHelloWorld()

But what if you want the function to provide some **output** based on some **inputs**?

In [None]:
add <- function(x, y) {
    x + y
}
add(3, 4)

As you can see above, you can create functions with the following syntax to take in inputs (as its arguments), then provide some output.

**`f <- function(<arguments>) {  `    
  `  Do something`  
  `  Do something`  
  `  return(some_output)`  
`}  `**


<hr>

<a id='ref2'></a>
<center><h2>Explicitly returning outputs in user-defined functions</h2></center>

In R, the last line in the function is automatically inferred as the output the function. 

#### You can also explicitly tell the function to return an output.

In [None]:
add <- function(x, y){
    return(x + y)
}
add(3, 4)

It's good practice to use the `return()` function to explicitly tell the function to return the output.

<hr>

<a id='ref3'></a>
<center><h2>Using IF/ELSE statements in functions</h2></center>

The **`return()`** function is particularly useful if you have any IF statements in the function, when you want your output to be dependent on some condition:

In [None]:
isGoodRating <- function(rating){
    #This function returns "NO" if the input value is less than 7. Otherwise it returns "YES".
    
    if(rating < 7){
        return("NO") # return NO if the movie rating is less than 7
    
    }else{
        return("YES") # otherwise return YES
    }
}

isGoodRating(6)
isGoodRating(9.5)

<hr>

<a id='ref4'></a>
<center><h2>Setting default argument values in your custom functions</h2></center>

You can a set a default value for arguments in your function. For example, in the **`isGoodRating()`** function, what if we wanted to create a threshold for what we consider to be a good rating?  
  
Perhaps by default, we should set the threshold to 7:

In [None]:
isGoodRating <- function(rating, threshold = 7){
    if(rating < threshold){
        return("NO") # return NO if the movie rating is less than the threshold
    }else{
        return("YES") # otherwise return YES
    }
}

isGoodRating(6)
isGoodRating(10)

Notice how we did not have to explicitly specify the second argument (threshold), but we could specify it. Let's say we have a higher standard for movie ratings, so let's bring our threshold up to 8.5:

In [None]:
isGoodRating(8, threshold = 8.5)

Great! Now you know how to create default values. **Note that** if you know the order of the arguments, you do not need to write out the argument, as in:

In [None]:
isGoodRating(8, 8.5) #rating = 8, threshold = 8.5

<hr>

<a id='ref5'></a>
<center><h2>Using functions within functions</h2></center>

Using functions within functions is no big deal. In fact, you've already used the **`print()`** and **`return()`** functions. So let's try making our **`isGoodRating()`** more interesting.

Let's create a function that can help us decide on which movie to watch, based on its rating. We should be able to provide the name of the movie, and it should return **NO** if the movie rating is below 7, and **YES** otherwise.

First, let's read in our movies data:

In [None]:
my_data <- read.csv("movies-db.csv")
head(my_data)

Next, do you remember how to check the value of the **average_rating** column if we specify a movie name?  
Here's how:

In [None]:
# Within myData, the row should be where the first column equals "Akira"
# AND the column should be "average_rating"

akira <- my_data[my_data$name == "Akira", "average_rating"]
akira

isGoodRating(akira)

Now, let's put this all together into a function, that can take any **moviename** and return a **YES** or **NO** for whether or not we should watch it.

In [None]:
watchMovie <- function(data, moviename){
    rating <- data[data["name"] == moviename,"average_rating"]
    return(isGoodRating(rating))
}

watchMovie(my_data, "Akira")

**Make sure you take the time to understand the function above.** Notice how the function expects two inputs: `data` and `moviename`, and so when we use the function, we must also input two arguments.

*But what if we only want to watch really good movies? How do we set our rating threshold that we created earlier? *
<br>
Here's how:

In [None]:
watchMovie <- function(data, moviename, my_threshold){
    rating <- data[data$name == moviename,"average_rating"]
    return(isGoodRating(rating, threshold = my_threshold))
}

Now our watchMovie takes three inputs: **data**, **moviename** and **my_threshold**

In [None]:
watchMovie(my_data, "Akira", 7)

*What if we want to still set our default threshold to be 7?*
<br>
Here's how we can do it:

In [None]:
watchMovie <- function(data, moviename, my_threshold = 7){
    rating <- data[data[,1] == moviename,"average_rating"]
    return(isGoodRating(rating, threshold = my_threshold))
}

watchMovie(my_data,"Akira")

As you can imagine, if we assign the output to a variable, the variable will be assigned to **YES**

In [None]:
a <- watchMovie(my_data, "Akira")
a

While the **watchMovie** is easier to use, I can't tell what the movie rating actually is. How do I make it *print* what the actual movie rating is, before giving me a response? To do so, we can simply add in a **print** statement before the final line of the function.  

We can also use the built-in **`paste()`** function to concatenate a sequence of character strings together into a single string.

In [None]:
watchMovie <- function(moviename, my_threshold = 7){
    rating <- my_data[my_data[,1] == moviename,"average_rating"]

    memo <- paste("The movie rating for", moviename, "is", rating)
    print(memo)
    
    return(isGoodRating(rating, threshold = my_threshold))
}

watchMovie("Akira")

Just note that the returned output is actually the resulting value of the function:

In [None]:
x <- watchMovie("Akira")

In [None]:
print(x)

<hr>

<a id='ref6'></a>
<center><h2>Global and local variables</h2></center>

So far, we've been creating variables within functions, but did you notice what happens to those variables outside of the function?  

Let's try to see what **memo** returns:

In [None]:
watchMovie <- function(moviename, my_threshold = 7){
    rating <- my_data[my_data[,1] == moviename,"average_rating"]
    
    memo <- paste("The movie rating for", moviename, "is", rating)
    print(memo)
    
    isGoodRating(rating, threshold = my_threshold)
}

watchMovie("Akira")

In [None]:
memo

**We got an error:**  ` object 'memo' not found`. **Why?**  

It's because all the variables we create in the function remain within the function. In technical terms, this is a **local variable**, meaning that the variable assignment does not persist outside the function. The `memo` variable only exists within the function.    

But there is a way to create **global variables** from within a function -- where you can use the global variable outside of the function. It is typically _not_ recommended that you use global variables, since it may become harder to manage your code, so this is just for your information.  

To create a **global variable**, we need to use this syntax:
> **`x <<- 1`**


Here's an example of a global variable assignment:

In [None]:
myFunction <- function(){
    y <<- 3.14
    return("Hello World")
    }
myFunction()

In [None]:
y #created only in the myFunction function

<hr>

Now, we are going to take a look at how R is storing and outputing the results of its objects. In other words, we'll find out how R can handle different kinds of data.

<hr>

<a id="ref1"></a>
<center><h2>What is an Object?</h2></center>

Everthing that you manipulate in R, literally every entity in R, is considered an **object**. In real life, we think of an object as something that we can hold and look at. R objects are a lot like that. For example, vector is one of the objects in R. 

An object in R has different kinds of properties or attributes. One of the attributes in objects is called the **class** of the object. The **class** of an object is the data type of this object. For instance, the class of vector can be numeric if it's composed of numeric values or character if it's composed of string values. The various classes (data types) of objects in R are important for data science programming.

### Class
<p>The most common classes (data types) of objects in R are:</p>

<ul>
<li>numeric (real numbers)</li>
<li>character</li>
<li>integer</li>
<li>logical (True/False)</li>
<li>complex</li>
</ul>

If you want to know about the data type of your values, you can use the **"class()"** function and add the variables' name to it. Let's create a variable from the average rating of some movies and then find which data types they belong to:

In [None]:
movie_rating <- c(8.3, 5.2, 9.3, 8.0) # create a vector from average ratings 
movie_rating # print the variable

To check what is the data type, let's use **class()**

In [None]:
class(movie_rating) # show the variable's data type

As you see, the **class()** function shows that the data type of values in the vector is **numeric**.

<div class="alert alert-success alertsuccess" style="margin-top: 20px">
**Tip:** A vector can only contain objects of the same class. However, a list can have different data types. 
</div>

### Numeric

Decimal values are called numerics. They are the default computational data type in R. In the example below, If we assign a decimal value to a variable average_rating, then average_rating will be of numeric type.

In [None]:
average_rating <- 8.3       # assign a decimal value
average_rating           

Using **class** to check the data type results in **numeric**

In [None]:
class(average_rating)      

### Character

A character object is used to represent string values in R, strings are simply text values. 

In [None]:
movies <-c("Toy Story", "Akira", "The Breakfast Club", "The Artist")
movies
class(movies)

If numbers and texts are combined in a vector, everything is converted to the class **character**. Let's make a vector from combined movie names and their production year, then find the data type for the vector

In [None]:
combined <- c("Toy Story", 1995, "Akira", 1998)
combined
class(combined)

When you simply enter numbers into R, they will be saved as class **numeric** by default. For example in the following vector, even though the numbers are integers, they are stored as numeric type in R:

In [None]:
movie_length <- c(80, 110, 90, 80) # create a vector from movie length
movie_length # print the variable
class(movie_length)

### Integer

An integer is a number that can be written without a fractional component. For example, 21, 4, 0, and −2048 are integers, while 9.75, 5 1⁄2, and √2 are not. In R, when you create a variable from the mentioned numbers, they are not going to be stored as integer data type. In order to get the integer class we need to convert the variable type from numeric to integer using **as.integer()** function. Let's create a vector and check if the data type is numeric.

In [None]:
age_restriction <- c(12, 10, 18, 18) # create a vector from age restriction
age_restriction # print the vector

class(age_restriction)

In [None]:
integer_vector <- as.integer(age_restriction)
class(integer_vector)

### Logical

The logical class contains True/False values (Boolean values). Let's create a vector with logical values and check its class: 

In [None]:
logical_vector <- c(T,F,F,T,T) # creating the vector
class(logical_vector)

A logical value is often created via comparison between variables. In the below example, we will compare the length of movies **<i>Toy Story and Akira</i>**.

In [None]:
length_Akira <- 125
length_ToyStory <- 81

If we assign the result of the compare statement to a variable the variable will have FALSE if the statement was false, and TRUE if the statement is true.

In [None]:
x <- length_ToyStory > length_Akira      # is ToyStory larger than akira? 
x              

In [None]:
x <- length_Akira > length_ToyStory # is akira larger than ToyStory?               
x # print the logical value

The resulting variable is of type logical

In [None]:
class(x)       # print the class name of x

### Complex

A complex number is a number that can be expressed in the form a + bi, where a and b are real numbers and i is the imaginary unit.

In [None]:
z = 8 + 6i     # create a complex number 
z

In [None]:
class(z)

<hr>

<a id="ref2"></a>
<center><h2>Converting One Class to Another</h2></center>

We can convert (coerce) one data type to another if we desire. For example, we can convert objects from other data types into character values with the **"as.character()"** function. In the following example, we convert numeric value into character:

In [None]:
year <- as.character(1995) # convert integer into character data type
year                    # print the value of year in character data type

As we mentioned before, in order to create an integer variable in R, we can use the **"as.integer()"** function. In the following example, even though the number is an integer data type, R saves the number as numeric data type by default. So you need to change the number to integer later if it is necessary.

In [None]:
Length_ToyStory <- 81
class(81)

In [None]:
length_ToyStory <- as.integer(81) 
class(length_ToyStory)       # print the class name of length_ToyStory

<hr>

<a id="ref3"></a>
<center><h2>Difference between Class and Mode</h2></center>

For a simple vector, the class and mode of the vector are the same thing: the data type of the values inside the vector (character, numeric, integer, etc). However, in some of other objects such as matrix, array, data frame, and list, class and mode means different things. 

In those mentioned objects, the **class()** function shows the type of the data structure. What does that mean? The class of matrix will be **matrix** regardless of what data types are **inside** the matrix. The same applies to list, array and data frame.

**Mode** on the other hand, determines what types of data can be found within the object and how that values are stored. So, you need to use the **mode()** function to find the data type of values inside a matrix (character, numeric, integer, etc).

So, in addition to the classes such as numeric, character, integer, logical, and complex, we have other classes such as matix, array, dataframe, and list

### Matrix

Let’s create a matrix storing the genre for each movie. Then, we will find the class and mode of the created matrix to see which information we will get from them.

First, let's check the effect of class and mode on a **vector**.

In [None]:
movies <- c("Toy Story", "Akira", "The Breakfast Club", "The Artist") # creating two vectors
genre <- c("Animation/Adventure/Comedy", "Animation/Adventure/Comedy", "Comedy/Drama", "Comedy/Drama")

class(movies)
mode(movies)

As you see in the above, for the vector the class and mode shows the data type of values. Now lets create a matrix from these two vectors.

In [None]:
movies_genre <- cbind(movies, genre)
movies_genre 

Now **class()** shows that the data type is **matrix**.

In [None]:
class(movies_genre)

And **mode** shows the data type of the elements of the matrix

In [None]:
mode(movies_genre)

For the matrix, the __class()__ shows how the values are stored and shown in R, in this case, in a matrix. However, __mode()__ shows the data type of values in the matrix. In the above example we have made a matrix filled with __character__ values.

### Array

A slightly more complicated version of the **matrix** data type is the
**array** data type. The **array** data type can still only have one data
type inside of it, but the set of data types it can store is larger. In addition
to the data types an array can store matrices as its
elements. In the following, we are going to create the array from integer number (1 to 12) and then compare the class and mode in an array: 

In [None]:
sample_array <- array(1:12, dim = c(3, 2, 2)) # create an array with dimensions 3 x 2 x 2 
sample_array
class(sample_array)
mode(sample_array)

So, the array's class is **array** and its mode is **numeric**.

### Data Frame

Data frames are similar to arrays but they have certain advantages over arrays. Data frames allow you to associate each row and each column with a name of
your choosing and allow **each column** of the data frame to have a **different
data type** if you like. Let's create a data frame from the movie names, year and their length:

In [None]:
Name <- c("Toy Story", "Akira", "The Breakfast Club", "The Artist")
Year <- c(1995, 1998, 1985, 2011)
Length <- c(81, 125, 97, 100)
RowNames = c("Movie 1", "Movie 2", "Movie 3", "Movie 4")

sample_DataFrame <- data.frame(Name, Year, Length, row.names=RowNames) 
sample_DataFrame

class(sample_DataFrame)

So, the class of the above table is "data.frame".

### List

The final data type that we are going to go over is **list**. Lists
are similar to vectors, but they can contain multiple data types. For example:

In [None]:
sample_List = list("Star Wars", 8.7, TRUE)
sample_List

class(sample_List)

mode(sample_List)

mode(sample_List[[3]])

As you see, we have character, numeric, and logical data types in the list. The data type of third element in the list is logical as the "mode()" function shows us. The **mode** for the entire list is **list**, it could show the type of all the elements, since they don't all have the same data type.

<hr>

<a id="ref4"></a>
<center><h2>Attributes</h2></center>

Objects have one or more __attributes__ that can modify how R thinks about the object. Imagine you have a  bowl of pasta and cheese. If you add spice, you change it to something with new taste. Different spices makes different dishes. 

Attributes are like spice. You can change any individual attribute of object with the **<i>attr()<i/>** function. You also can use the **attribute()** function to return a list of all of the attributes currently defined for that object. 

For example in the following code, we will create a vector from the average ratings (8.3, 8.1, 7.9, 8) and costs of four movies (30, 10.4, 1, 15), and then we change the __dim__ attribute of the vector. R will now treat z as it were a 4-by-2 matrix.  

In [None]:
z <- c(8.3, 8.1, 7.9, 8, 30, 10.4, 1, 15)
z
attr(z, "dim") <- c(4,2)
z

Now, we can find the class and mode of the above matrix.

In [None]:
class(z)
mode(z)

<hr>

<a id='ref1'></a>
<center><h2>What is debugging and error handling?</h2></center>

*What do you get when you try to add  **`"a" + 10`**? An error!*

In [None]:
"a" + 10

*And what happens to your code if an error occurs? It halts!*

In [None]:
for(i in 1:10){
    #for every number, i, in the sequence of 1,2,3:
    print(i + "a")
    }

These are very simple examples, and the sources of the errors are easy to spot. But when it's embedded in a large chunk of code with many parts, it can be difficult to identify _when_, _where_, and _why_ an error has occurred. This process of identifying the source of the error and fixing it is called **debugging**.

<hr>

<a id='ref2'></a>
<center><h2>Error Catching</h2></center>

If you know an error may occur, the best way to handle the error is to **`catch`** the error while it's happening, so it doesn't prevent the script from halting at the error.

#### No error:

In [None]:
tryCatch(10 + 10)

#### Error:

In [None]:
tryCatch("a" + 10) #Error

<h3>Error Catching with `tryCatch`:</h3>

**`tryCatch`** first _tries_ to run the code, and if it works, it executes the code normally. **But if it results in an error**, you can define what to do instead.

In [None]:
#If tryCatch detects it will cause an error, print a message instead. Overall, no error is generated and the code continued to run successfully.

tryCatch(10 + "a", 
         error = function(e) print("Oops, something went wrong!") ) #No error

In [None]:
#If error, return "10a" without an error

x <- tryCatch(10 + "a", error = function(e) return("10a") ) #No error
x

In [None]:
tryCatch(
    for(i in 1:3){
        #for every number, i, in the sequence of 1,2,3:
        print(i + "a")
        }
    , error = function(e) print("Found error.") )

<hr>

<a id='ref3'></a>
<center><h2>Warning Catching</h2></center>

Aside from **errors**, there are also **warnings**. Warnings do not halt code, but are displayed when something is perhaps not running the way a user expects.

In [None]:
as.integer("A") #Converting "A" into an integer warns the user that the value is converted to NA

If needed, you can also use **`tryCatch`** to catch the warnings as they occur, without producing the warning message:

In [None]:
tryCatch(as.integer("A"), warning = function(e) print("Warning.") )

<hr>
#### Scaling R with big data

As you learn more about R, if you are interested in exploring platforms that can help you run analyses at scale, you might want to sign up for a free account on [IBM Watson Studio](http://cocl.us/dsx_rp0101en), which allows you to run analyses in R with two Spark executors for free.



<hr>

### Welcome!

By the end of this notebook, you will have learned how to **import and read data** from different file types in R.

## Table of Contents


<ul>
<li><a href="#About-the-Dataset">About the Dataset</a></li>
<li><a href="#Reading-CSV-Files">Reading CSV Files</a></li>
<li><a href="#Reading-Excel-Files">Reading Excel Files</a></li>
<li><a href="#Accessing-Rows-and-Columns">Accessing Rows and Columns from dataset</a></li>
<li><a href="#Accessing-Built-in-Datasets-in-R">Accessing Built-in Datasets in R</a></li>
</ul>
<p></p>
Estimated Time Needed: <strong>15 min</strong>

<hr>

<a id="ref0"></a>
<h2 align=center>About the Dataset</h2>

**Movies dataset**

Here we have a dataset that includes one row for each movie, with several columns for each movie characteristic:

- **name** - Name of the movie
- **year** - Year the movie was released
- **length_min** - Length of the movie (minutes)
- **genre** - Genre of the movie
- **average_rating** - Average rating on [IMDB](http://www.imdb.com/)
- **cost_millions** - Movie's production cost (millions in USD)
- **foreign** - Is the movie foreign (1) or domestic (0)?
- **age_restriction** - Age restriction for the movie
<br>


<img src = "https://ibm.box.com/shared/static/6kr8sg0n6pc40zd1xn6hjhtvy3k7cmeq.png" width = 90% align="left">

Let's learn how to **import and read data** from two common types of files used to store tabular data (when data is stored in a table or a spreadsheet.)
- **CSV files** (.csv)
- **Excel files** (.xls or .xlsx)

To begin, we'll need to **download the data**!

<a id="ref0"></a>
<h2 align=center>Download the Data</h2>

We've made it easy for you to get the data, which we've hosted online. Simply run the code cell below (Shift + Enter) to download the data to your current folder.

In [4]:
# Download datasets

# CSV file
download.file("https://ibm.box.com/shared/static/n5ay5qadfe7e1nnsv5s01oe1x62mq51j.csv", 
              destfile="movies-db.csv")

# XLS file
download.file("https://ibm.box.com/shared/static/nx0ohd9sq0iz3p871zg8ehc1m39ibpx6.xls", 
              destfile="movies-db.xls")

**If you ran the cell above, you have now downloaded the following files to your current folder:**
> movies-db.csv  
> movies-db.xls

In [7]:
system("ls",intern=TRUE)  # we'll cover system() in later workshops, but note that it will run shell commands for you

<a id="ref1"></a>
<center><h2>Reading CSV Files</h2></center>

#### What are CSV files?

Let's read data from a CSV file. CSV (Comma Separated Values) is one of the most common formats of structured data you will find. These files contain data in a table format, where in each row, columns are separated by a delimiter -- traditionally, a comma (hence comma-separated values).   
  
Usually, the first line in a CSV file contains the column names for the table itself. CSV files are popular because you do not need a particular program to open it.

#### Reading CSV files in R

In the **`movies-db.csv`** file, the first line of text is the header (names of each of the columns), followed by rows of movie information.

To read CSV files into R, we use the core function **`read.csv`**.  

`read.csv` easy to use. All you need is the filepath to the CSV file. Let's try loading the file using the filepath to the `movies-db.csv` file we downloaded earlier:

In [None]:
# Load the CSV table into the my_data variable.
my_data <- read.csv("movies-db.csv")
my_data

The data was loaded into the `my_data` variable. But instead of viewing all the data at once, we can use the `head` function to take a look at only the top six rows of our table, like so:

In [None]:
# Print out the first six rows of my_data
head(my_data)

Additionally, you may want to take a look at the **structure** of your newly created table. R provides us with a function that summarizes an entire table's properties, called `str`. Let's try it out.

In [None]:
# Prints out the structure of your table.
str(my_data)

When we loaded the file with the `read.csv` function, we had to only pass it one parameter -- the **path** to our desired file.

If you're using Data Scientist Workbench, it is simple to find the path to your uploaded file. In the **Recent Data** section in the sidebar on the right, you can click the arrow to the left of the filename to see extra options -- one of these commands should be **Insert Path**, which automatically copies the path to your file into Jupyter Notebooks.

-----------------

<a id="ref2"></a>
<center><h2>Reading Excel Files</h2></center>

Reading XLS (Excel Spreadsheet) files is similar to reading CSV files, but there's one catch -- R does not have a native function to read them. However, thankfully, R has an extremely large repository of user-created functions, called *CRAN*. From there, we can download a library package to make us able to read XLS files.

To download a package, we use the `install.packages` function. Once installed, you do not need to install that same library ever again, unless, of course, you uninstall it.

In [None]:
# Download and install the "readxl" library
install.packages("readxl")  # note this may take a couple of minutes to complete as there will normally be dependencies to autoinstall as well

Whenever you are going to use a library that is not native to R, you have to load it into the R environment after you install it. In other words, you need to install once only, but to use it, you must load it into R for every new session. To do so, use the `library` function, which loads up everything we can use in that library into R.

In [None]:
# Load the "readxl" library into the R environment.
library(readxl)

Now that we have our library and its functions ready, we can move on to actually reading the file. In `readxl`, there is a function called `read_excel`, which does all the work for us. You can use it like this:

In [None]:
# Read data from the XLS file and attribute the table to the my_excel_data variable.
my_excel_data <- read_excel("movies-db.xls")

Since `my_excel_data` is now a dataframe in R, much like the one we created out of the CSV file, all of the native R functions can be applied to it, like `head` and `str`.

In [None]:
# Prints out the structure of your table.
# Tells you how many rows and columns there are, and the names and type of each column.
# This should be the very same as the other table we created, as they are the same dataset.
str(my_excel_data)

Much like the `read.csv` function, `read_excel` takes as its main parameter the **path** to the desired file.

<div class="alert alert-success alertsuccess">
<b>[Tip]</b>   
A **Library** is basically a collection of different classes and functions which are used to perform some specific operations. You can install and use libraries to add more functions that are not included on the core R files.
For example, the **readxl** library adds functions to read data from excel files.
<br><br>
It's important to know that there are many other libraries too which can be used for a variety of things. There are also plenty of other libraries to read Excel files -- readxl is just one of them.
</div>

-----------------

<center><h2>Accessing Rows and Columns</h2></center>

Whenever we use functions to read tabular data in R, the default method of structuring this data in the R environment is using Data Frames -- R's primary data structure. Data Frames are extremely versatile, and R presents us many options to manipulate them.

Suppose we want to access the "name" column of our dataset. We can directly reference the column name on our data frame to retrieve this data, like this:

In [None]:
# Retrieve a subset of the data frame consisting of the "name" columns
my_data['name']

Another way to do this is by using the `$` notation which at the output will provide a vector:

In [None]:
# Retrieve the data for the "name" column in the data frame.
my_data$name

You can also do the same thing using **double square brackets**, to get a vector of `names` column.

In [None]:
my_data[["name"]]

Similarly, any particular row of the dataset can also be accessed. For example, to get the first row of the dataset with all column values, we can use:

In [None]:
# Retrieve the first row of the data frame.
my_data[1,]

The first value before the comma represents the **row** of the dataset and the second value (which is blank in the above example) represents the **column** of the dataset to be retrieved. By setting the first number as 1 we say we want data from row 1. By leaving the column blank we say we want all the columns in that row.

We can specify more than one column or row by using **`c`**, the **concatenate** function. By using `c` to concatenate a list of elements, we tell R that we want these observations out of the data frame. Let's try it out.

In [None]:
# Retrieve the first row of the data frame, but only the "name" and "length_min" columns.
my_data[1, c("name","length_min")]

-----------------

<a id="ref4"></a>
<center><h2>Accessing Built-in Datasets in R</h2></center>

R provides various built-in datasets for users to utilize for different purposes. To know which datasets are available, R provides a simple function -- `data` -- that returns all of the present datasets' names with a small description beside them. The ones in the `datasets` package are all inbuilt.

In [None]:
# Displays a list of the inbuilt datasets. Opens in a new "window".
data()

As you can see, there are many different datasets already inbuilt in the R environment. Having to go through each of them to take a look at their structure and try to find out what they represent might be very tiring. Thankfully, R has documentation present for each inbuilt dataset. You can take a look at that by using the `help` function.

For example, if we want to know more about the `women` dataset, we can use the following function:

In [None]:
# Opens up the documentation for the inbuilt "women" dataset.
help(women)

Since the datasets listed are inbuilt, you do not need to import or load them to use them. If you reference them by their name, R already has the data frame ready.

In [None]:
women

<hr>

This notebook will provide information regarding reading text files, performing various operations on Strings and saving data into various types of files like text files, CSV files, Excel files etc.

## Table of Contents


<ul>
<li><a href="#About-the-Dataset">About the Dataset</a></li>
<li><a href="#Reading-Text-Files">Reading Text Files</a></li>
<li><a href="#String-Operations">String Operations</a></li>
<li><a href="#Writing-and-Saving-to-Files">Writing and Saving to Files</a></li>
</ul>
<p></p>
Estimated Time Needed: <strong>25 min</strong>


<hr>

<a id="ref0"></a>
<h2 align=center>About the Dataset</h2>

In this module, we are going to use **The_Artist.txt** file. This file contains text data which is basically summary of the **The Artist** movie and we are going to perform various operations on this data.

This is how our data look like.

<img src = "https://ibm.box.com/shared/static/hqojozssqxupoanevcpzv4lbym7lynwa.png" width = 90% align="left">

Let's first **download** the data into your account:

In [None]:
# Download the data file
download.file("https://ibm.box.com/shared/static/l8v8g8e6uzk7yj2j1qc8ypezbhzukphy.txt", destfile="The_Artist.txt")

<hr>

<a id="ref1"></a>
<h2 align=center>Reading Text Files</h2>

To read text files in R, we can use the built-in R function **readLines()**. This function takes **file path** as the argument and read the whole file.

Let's read the **The_Artist.txt** file and see how it looks like.

In [None]:
my_data <- readLines("The_Artist.txt")
my_data

<div class="alert alert-block alert-success" style="margin-top: 20px">
**Tip:** If you got an error message here, make sure that you run the code cell above first to download the dataset.</div>

So, we get a character vector which has three elements and these elements can be accessed as we access array.

Let's check the length of **my_data** variable

In [None]:
length(my_data)

Length of **my_data** variable is **5** which means it contains 5 elements.

Similarly, we can check the size of the file by using the **file.size()** method of R and it takes **file path** as argument and returns the number of bytes. By executing code block below, we will get **1065** at the output, which is the size of the file **in bytes**.

In [None]:
file.size("/resources/data/The_Artist.txt")

There is another method **scan()** which can be used to read **.txt** files. The Difference in **readLines()** and **scan()** method is that, **readLines()** is used to read text files line by line whereas **scan()** method read the text files word by word.

**scan()** method takes two arguments. First is the **file path** and second argument is the string expression according to which we want to separate the words. Like in example below, we pass an empty string as the separator argument.

In [None]:
my_data1 <- scan("The_Artist.txt", "")
my_data1

And if we will check length of **my_data1** variable then we will get total number of elements at the output.

In [None]:
length(my_data1)

<hr>

<a id="ref2"></a>
<h2 align=center>String Operations</h2>

There are many string operation methods in R which can be used to manipulate the data. We are going to use some basic string operations on the data that we read before. 

<h3 style="font-size:120%">nchar()</h3>

The first function is **nchar()** which will return the total number of characters in the given string. Let's find out how many characters are there in the first element of **my_data** variable.

In [None]:
nchar(my_data[1])

<br>

<h3 style="font-size:120%">toupper()</h3>

Now, sometimes we need the whole string to be in Upper Case. To do so, there is a function called **toupper()** in R which takes a string as input and provides the whole string in upper case at output.

In [None]:
toupper(my_data[3])

**In above** code block, we convert the third element of the character vector in upper case.

<br>

<h3 style="font-size:120%">tolower()</h3>

Similarly, **tolower()** method can be used to convert whole string into lower case. Let's convert the same string that we convert in upper case, into lower case.

In [None]:
tolower(my_data[3])

**We can** clearly see the difference between the outputs of last two methods.

<br>

<h3 style="font-size:120%">chartr()</h3>

`what if we want to replace any characters in given string?`
This operation can also be performed in R using **chartr()** method which takes three arguments. The first argument is the characters which we want to replace in string, second argument is the new characters and the last argument is the string on which operation will be performed.

Let's replace **white spaces** in the string with the **hyphen (“-”) sign** in the first element of the **my_data** variable. 

In [None]:
chartr(" ", "-", my_data[1])

<br>

<h3 style="font-size:120%">strsplit()</h3>

Previously, we learned that we can read file word by word using **scan()** function. `But what if we want to split the given string word by word?`

This can be done using **strsplit()** method. Let's split the string according to the white spaces.

In [None]:
character_list <- strsplit(my_data[1], " ")
word_list <- unlist(character_list)
word_list

In above code block, we separate the string word by word, but **strsplit()** method provides a list at the output which contains all the separated words as single element which is more complex to read. So, to make it more easy to read each word as single element, we used **unlist()** method which converts the list into character vector and now we can easily access each word as a single element.

<br>

<h3 style="font-size:120%">sort()</h3>

Sorting is also possible in R. Let's use **sort()** method to sort elements of the **word_list** character vector in ascending order.

In [None]:
sorted_list <- sort(word_list)
sorted_list

<br>

<h3 style="font-size:120%">paste()</h3>

Now, we sort all the elements of ** word_list** character vector. Let's use **paste()** function here, which is used to concatenate strings. This method takes two arguments, the strings we want to concatenate and **collapse** argument which defines the separator in the words.

Here, we are going to concatenate all words of **sorted_list** character vector into a single string.

In [None]:
paste(sorted_list, collapse = " ")

<br>

<h3 style="font-size:120%">substr()</h3>

There is another function **substr()** in R which is used to get a sub section of the string.

Let's take an example to understand it more. In example below, we use the **substr()** method and provide it three arguments. First argument is the data string from which we want the sub string. Second argument is the starting point from where function will start reading the string and the third argument is the stopping point till where we want the function to read string.

In [None]:
sub_string <- substr(my_data[1], start = 4, stop = 50)
sub_string

So, from the character vector, we start reading the first element from 4th position and read the string till 50th position and at the output, we get the resulted string which we stored in **sub_string** variable.

<br>

<h3 style="font-size:120%">trimws()</h3>

As the sub string that we get in code block above, have some white spaces at the initial and end points. So, to quickly remove them, we can use **trimws()** method of R like shown below.

In [None]:
trimws(sub_string)

So, at the output, we get the string which does not contain any white spaces at the both ends.

<br>

<h3 style="font-size:115%">str_sub()</h3>

To read string from last, we are using **stringr** library. This library contains **str_sub()** method, which takes same arguments as **sub_stirng** method but read string from last.

Like in the example below, we provide a data string and both starting and end points with negative values which indicates that we are reading string from last.

In [None]:
library(stringr)
str_sub(my_data[1], -8, -1)

So, we read string from -1 till -8 and it gives **talkies.** with full stop mark at the output.

<hr>

<a id="ref3"></a>
<h2 align=center>Writing and Saving to Files</h2>

After reading files, we can also write data into files and save them in different file formats like **.txt, .csv, .xls (for excel files) etc**. Let's take a look at some examples.

<h3 style="font-size:115%">Exporting as Text File</h3>

Suppose we want to export a matrices or String in **.txt** file. To do so, we can use **write()** method which writes into file and save that on to disk.

Let's create a matrix and try to save it into file.

In [None]:
m <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, ncol = 3)
m

In [None]:
write(m, file = "my_text_file.txt", ncolumns = 3, sep = " ")

In above code block, we provide the input data, file name in which we want to store data along with its path and as we are using matrices to output in file, we provide **ncolumns** argument value and **sep** argument.

Let's try to write a string from our **my_data** variable into file named as **my_text_file2.txt**

In [None]:
write(my_data[1], file = "my_text_file2.txt", ncolumns = 1, sep = " ")

So, we get the first element from **my_data** variable and provide it as input to write function and this time we assign **ncolumn** argument with value 1 because we want a single column for string.

<br>

<h3 style="font-size:115%">Exporting as CSV File</h3>

As we export data in text files, we can export data into **CSV** files also. To do so, we need a data frame which have data. For this, we can use built-in datasets.

Let's use **CO2** dataset of R which contains data about Carbon Dioxide Uptake in Grass Plants. Let's see how data look like in **CO2**dataset.

In [None]:
head(CO2)

Now, let's export this data into **CSV** file. We will use **write.csv()** method which takes data frame as input and a **file** argument to specify output filename.

In [None]:
write.csv(CO2, file = "my_csv.csv")

Now, when we will execute above code block, all data will be exported in CSV file. Now, the first column of CSV file contains row numbers which we do not want in our CSV file. So, we have to define **row.names** to **FALSE** in **write.csv()** method.

In [None]:
write.csv(CO2, file = "my_csv.csv", row.names = FALSE)

Similarly, to remove column names just make **col.names** to **FALSE**. 

<br>

<h3 style="font-size:115%">Exporting as Excel File</h3>

To save data into excel files, we have to install an external library called **xlsx**, which will provide us easy methods to export data into **.xlsx** files.

Let's install this library. (This may take a minute or two)

In [None]:
install.packages("xlsx")

In [None]:
library(xlsx)
write.xlsx(CO2, file = "my_excel.xlsx", row.names = FALSE)

So, exporting data in **.xlsx** files is similary to **.csv** files just function name is different plus we had to install external library to do this operation.

<br>

<h3 style="font-size:115%">Exporting as .RData File</h3>

In R, we can also save files in **.RData** format. **.RData** format provides a way to save and load our R objects.

Let's create simple variable objects and save that into file with extension**.RData**.

In [None]:
var1 <- "var1"
var2 <- "var2"
var3 <- "var3"

Now, to write in **.RData** file, we will use **save()** method of R. It has a **list** argument which is the list containing the variable names of all the objects we want to save (which in this case are three vaiables), **file** argument which contains file name in which we are going to write/save data and **safe** argument is to specify whether you want the saving to be performed atomically.

In [None]:
save(list = c("var1", "var2", "var3"), file = "variables.RData", safe = T)

The file with name **variables.RData** is generated on the provided location.

<hr>

<h2 align=center>Regular Expressions (Regex)</h2>

In this notebook, we will study some simple Regular Expression terms and apply them with R functions.

### Table of contents

- <p><a href="#Loading-in-Data">Loading in Data</a></p>
- <p><a href="#Regular-Expressions">Regular Expressions</a></p>
- <p><a href="#Regular-Expression-in-R">Regular Expression in R</a></p>
<p></p>
<hr>

<a id="ref9001"></a>
# Loading in Data

Let's load in a small list of emails to perform some data analysis and take a look at it.

In [None]:
email_df <- read.csv("https://ibm.box.com/shared/static/cbim8daa5vjf5rf4rlz11330lvqbu7rk.csv")
email_df

So our simple dataset contains a list of names and a list of their corresponding emails. Let's say we want to simply count the frequency of email domains. But several problems arise before we can even attempt this. If we attempt to simply count the email column, we won't end up with what we want since every email is unique. And if we split the string at the '@', we still won't have what we want since even emails with the same domains might have different regional extensions. So how can we easily extract the necessary data in a quick and easy way?

<a id="funyarinpa"></a>
# Regular Expressions

Regular Expressions are generic expressions that are used to match patterns in strings and text. A way we can exemplify this is with a simple expression that can match with an email string. But before we write it, let's look at how an email is structured:

<code>$test@testing.com$</code>

So, an email is composed by a string followed by an '@' symbol followed by another string. In R regular expressions, we can express this as:

<code>$.+@.+$</code>

Where:
* The '.' symbol matches with any character.
* The '+' symbol repeats the previous symbol one or more times. So, '.+' will match with any string.
* The '@' symbol only matches with the '@' character.

Now, for our problem, which is extracting the domain from an email excluding the regional url code, we need an expression that specifically matches with what we want:

<code>$@.+\\.$</code>

Where the <code>'\\.'</code> symbol specifically matches with the '.' character.

<a id="imyourpoutine"></a>
# Regular Expressions in R

Now let's look at some R functions that work with R functions.

The grep function below takes in a Regular Expression and a list of strings to search through and returns the positions of where they appear in the list.

In [None]:
grep("@.+",  c("test@testing.com" , "not an email", "test2@testing.com"))

Grep also has an extra parameter called 'value' that changes the output to display the strings instead of the list positions. 

In [None]:
grep("@.+",  c("test@testing.com", "not an email", "test2@testing.com"), value=TRUE)

The next function, 'gsub', is a substitution function. It takes in a Regular Expression, the string you want to swap in with the matches and a list of strings you want to perform the swap with. The code cell below updates valid emails with a new domain:

In [None]:
gsub("@.+", "@newdomain.com", c("test@testing.com", "not an email", "test2@testing.com"))

The functions below, 'regexpr' and 'regmatches', work in conjunction to extract the matches found by a regular expression specified in 'regexpr'.

In [None]:
matches <- regexpr("@.*", c("test@testing.com", "not an email", "test2@testing.com"))
regmatches(c("test@testing.com", "not an email", "test2@testing.com"), matches)

This function is actually perfect for our problem since we simply need to extract the specific information we want. So let's use it with the Regular Expression we defined above and store the extracted strings in a new column in our dataframe.

In [None]:
matches <- regexpr("@.*\\.", email_df[,'Email'])
email_df[,'Domain'] = regmatches(email_df[,'Email'], matches)

And this is the resulting dataframe:

In [None]:
email_df

Now we can finally construct the frequency table for the domains in our dataframe!

In [None]:
table(email_df[,'Domain'])

<hr>

### Excellent! You have just completed the R basics notebook! 

#### Scaling R with big data

As you learn more about R, if you are interested in exploring platforms that can help you run analyses at scale, you might want to sign up for a free account on [IBM Watson Studio](http://cocl.us/dsx_rp0101en), which allows you to run analyses in R with two Spark executors for free.



<hr>

### the beginning ...:  
I hope you found R easy to learn! There's lots more to learn about R but you're well on your way.

<hr>
Copyright &copy; [IBM Cognitive Class](https://cognitiveclass.ai). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).