# Introduction to R

<img src="https://i.imgur.com/HeSzToG.png" width=400 align="center">

# Contents
1. CRISP DM
2. What is R used for? 
3. Downloading Anaconda 
4. IDEs for R 
5. Basics of Jupyter Notebook
6. File Types 
7. Data Terminology
8. Installing and adding packages 
9. Importing Data 
10. Displaying Data 

## 1. CRISP DM 


<img src='https://st3.ning.com/topology/rest/1.0/file/get/2808314343?profile=original' width=400 align = 'center'>

<p style="text-align:center;"><font size="2"> <i>A useful codification of the data mining process is given by the Cross Industry Standard Process for Data Mining
(CRISP-DM; Shearer, 2000)</i> </font> </p>

- <b>Business Understanding</b> <br> 
 - Framing a business problem in terms of expected value can allow us to systematically decompose it into data mining tasks. 
- <b>Data Understanding</b> <br> 
 - Understanding the strengths and limitations of the data because there is rarely an exact match with the problem <br>
 - Purchasing or obtaining data <br>
 - Identifying and understanding variables to use 
- <b>Data Preparation</b> <br> 
 - Removing missing values
 - Sorting, replacing, filtering, grouping, mapping data
 - Adding variables from other datasets 
- <b>Modeling</b> <br>
 - Machine Learning
- <b>Evaluation</b> <br> 
 - Assessing the data mining results rigorously to gain confidence that they are valid and reliable
- <b>Deployment</b> <br> 
 - Implementing the model into the business process

In [3]:
library(datasets)
library(help = "datasets")

In [4]:
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


# 2. What is R used for? 

In [5]:
# Notes 
# Data cleaning 
# Modeling 
# Comparison to Python 


#  3. Downloading Anaconda

- Downloading Anaconda <br> 
https://www.anaconda.com/distribution/ <br>
- Adding R into Anaconda <br> 
https://docs.anaconda.com/anaconda/navigator/tutorials/r-lang/

# 4. Comparison between Python & R

| Parameter | R | Python | 
|------|:------|:------|
| Objective | Data Analysis and Statistical Modeling | Data Science, Web Development, Embedded Systems |
| Workability | Consist of many easy to use packages | Can easily perform matrix computation as well as optimization | 
| Integration | Locally run programs | Programs integrated with web-app for easy deployment | 
| Database Handling Capacity | Poses problem for handling large dataset | Can handle large data easily without any fault | 
| IDE | RStudio, R GUI | Spyder, IPython, Juypter Notebook | 
| Essential Packages and library | ggplot2, tidyverse, caret | Numpy, pandas, scipy, scikit-learn, Tensorflow | 

# 5. Basics of Juypter Notebook

- <b>Restarting Kernel </b> 
 - Kernel > Restart 
- <b>Running cells </b>
 - Cell > Run All 
- <b>Code vs. Markdown </b>
 - Markdown has its roots in HTML 
 - Markdown is for formatting and commenting 

| Shortcut | Description |
|------|:------|
| Tab | Code Completion |
| a | Insert cell above |
| b | Insert cell below |
| Enter | Edit cell |
| d + d | Delete cell | 
| Shift + Enter | Run individual cell | 
| Esc + M | Change to Markdown | 
| Ctrl + Shift + ← | Highlight group of words | 
| Ctrl + Backspace | Delete word by word | 
| Ctrl + / | Make comment |

In [6]:
# Test box 

# Header 1 
## Header 2 
### Header 3 
#### Header 4 
<b> bold text </b> <br> 
<i> italicized text </i> <br> 
<br> newline <br>
- bullet point <br> 

<p> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. In ante metus dictum at tempor commodo ullamcorper a lacus. Lorem mollis aliquam ut porttitor leo a diam. Neque aliquam vestibulum morbi blandit cursus. Nunc faucibus a pellentesque sit amet porttitor. Id semper risus in hendrerit gravida rutrum quisque non tellus. Proin nibh nisl condimentum id venenatis a condimentum vitae sapien. Aliquet nec ullamcorper sit amet risus nullam eget. Augue eget arcu dictum varius. Donec ac odio tempor orci dapibus ultrices in iaculis. Diam sit amet nisl suscipit adipiscing bibendum est ultricies integer. </p> <p> Leo duis ut diam quam nulla. Tellus at urna condimentum mattis pellentesque id nibh. Proin fermentum leo vel orci. Mi sit amet mauris commodo quis imperdiet massa tincidunt nunc. Venenatis cras sed felis eget velit aliquet sagittis id consectetur. Fringilla urna porttitor rhoncus dolor purus non enim praesent. Consectetur purus ut faucibus pulvinar elementum integer enim neque. Ullamcorper malesuada proin libero nunc consequat interdum varius sit amet. Nibh praesent tristique magna sit amet purus. Egestas maecenas pharetra convallis posuere. Sed faucibus turpis in eu mi bibendum neque egestas. Pharetra diam sit amet nisl suscipit. Sed pulvinar proin gravida hendrerit. Risus at ultrices mi tempus imperdiet nulla malesuada pellentesque elit. Malesuada bibendum arcu vitae elementum curabitur. Cursus mattis molestie a iaculis at erat pellentesque. Pellentesque sit amet porttitor eget dolor morbi non. Lacus suspendisse faucibus interdum posuere lorem ipsum dolor sit. Tempor nec feugiat nisl pretium fusce id velit. Viverra justo nec ultrices dui sapien eget mi proin. </p> 

## Exercise 1: Using Markdown 

Format the below text using Markdown 

### RStudio

RStudio is an <b>integrated development environment</b> (IDE) that streamlines the R programming workflow into an easy to read layout. RStudio also includes useful tools (referred to as packages) for data manipulation (dplyr), cleaning (tidyr), visualizations (ggplot2), report writing (rmarkdown & knitr), and publishing to the web (shiny & ggviz).

# 6. File Types

- comma separated values (.csv)
- rdata. 
- Microsoft Excel Open XML Spreadsheet (.xlsx)

# 7. Data Terminology

https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/

- Data > Dataset > Database 
- Vector > List > DataFrame  
- Observations & Variables
- Levels 
- Libraries / Packages 

## Exercise 2: Creating vectors 

Create a vector with 2 elements A and B and store in ```grade``` 

In [7]:
# Answer here
# After part 10 then back here

# 8. Installing and adding packages

In [None]:
install.packages('something')

In [None]:
library(something)

## Exercise 3: Importing packages 

Import dplyr, tidyr, ggplot 

In [9]:
library(dplyr)
library(tidyr)
library(ggplot2)

# 9. Reading documentation

https://www.rdocumentation.org/packages/readxl/versions/1.3.1/topics/read_excel

- Default values 
- Checking the package name 
- Importing package

In [10]:
library(readxl)

# 10. Importing Data 

| Function | What it does | Example | 
|------|:------|:------|
| read.csv() | Read csv file | read.csv('states.csv') |
| read_excel() | Read excel file | read_excel('filename.xlsx', sheet = 'Sheet 2') | 

## Exercise 4: Import csv file

Import ```loans-25k.csv``` and store in ```loans_df```

In [12]:
# Explain location and backslash 
# Store in variable 
loans_df <- read.csv('loans-25k.csv')
# Answer here

Display the head of ```loans_df```

In [13]:
head(loans_df)

id,member_id,annual_inc,home_ownership,purpose,loan_amnt,int_rate,term,grade
63459002,67801839,131000,MORTGAGE,credit_card,35000,8.18,36 months,B
2224979,2637150,125000,MORTGAGE,debt_consolidation,29700,15.31,60 months,C
5979460,7451906,36070,MORTGAGE,debt_consolidation,6700,7.9,36 months,A
57126822,60829645,90000,RENT,credit_card,18000,8.18,60 months,B
1445209,1697667,90000,MORTGAGE,small_business,18000,7.62,36 months,A
55078160,58648875,50000,RENT,credit_card,10000,13.33,36 months,C


## Exercise 5: Import excel file 

In [14]:
terminology <- read_excel('excel_cheatsheet_copy.xlsx', sheet = 2)
terminology

LABEL,DESCRIPTION,COMMENTS
Regression,Used to predict data by line of best fit,
Eqn of Linear Regression Line,y = INTERCEPT + SLOPE(X),
Eqn of Exponential Regression Line,y = ae(bx),
Eqn of Logarithmic Regression Line,y = aLN(x) + b,Part of the analysis is to find the values of a and b
Eqn of Power Regression Line,y = axb,Part of the analysis is to find the values of a and b
Residual,Distance from point to regression line,
Residual Variance,Variance of residual Is also MS(residual) Is also (Standard Error of Estimate)^2,
MS,Mean Square,
ANOVA,Analysis of Variance,
Independent variable,What we are manipulating,Also known as factor


# 11. Displaying Data

Checking data type of each column and shape of column

In [15]:
# str
str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...


Better version of str() from dplyr

In [16]:
# glimpse
glimpse(iris)

Observations: 150
Variables: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
$ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
$ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
$ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa...


Summary of dataset 

In [17]:
# summary
summary(iris)

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

Checking shape of dataframe

In [18]:
# dim
dim(iris)

Showing dataset

In [19]:
iris

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa
4.6,3.4,1.4,0.3,setosa
5.0,3.4,1.5,0.2,setosa
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


Showing full dataset

In [20]:
# Set max no. to be high
options(max.print=25000) 
print.data.frame(iris)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
6            5.4         3.9          1.7         0.4     setosa
7            4.6         3.4          1.4         0.3     setosa
8            5.0         3.4          1.5         0.2     setosa
9            4.4         2.9          1.4         0.2     setosa
10           4.9         3.1          1.5         0.1     setosa
11           5.4         3.7          1.5         0.2     setosa
12           4.8         3.4          1.6         0.2     setosa
13           4.8         3.0          1.4         0.1     setosa
14           4.3         3.0          1.1         0.1     setosa
15           5.8         

Showing first few values of dataset

In [21]:
# head and tail
head(iris)
tail(iris)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
145,6.7,3.3,5.7,2.5,virginica
146,6.7,3.0,5.2,2.3,virginica
147,6.3,2.5,5.0,1.9,virginica
148,6.5,3.0,5.2,2.0,virginica
149,6.2,3.4,5.4,2.3,virginica
150,5.9,3.0,5.1,1.8,virginica


Looking at range of data 

In [24]:
# max
max(iris$Petal.Length)
# min
min(iris$Petal.Length)
# unique
unique(iris$Species)
# n_distinct
unique(iris$Petal.Length)

## Exercise 6: Looking at loans dataframe

In [25]:
head(loans_df)

id,member_id,annual_inc,home_ownership,purpose,loan_amnt,int_rate,term,grade
63459002,67801839,131000,MORTGAGE,credit_card,35000,8.18,36 months,B
2224979,2637150,125000,MORTGAGE,debt_consolidation,29700,15.31,60 months,C
5979460,7451906,36070,MORTGAGE,debt_consolidation,6700,7.9,36 months,A
57126822,60829645,90000,RENT,credit_card,18000,8.18,60 months,B
1445209,1697667,90000,MORTGAGE,small_business,18000,7.62,36 months,A
55078160,58648875,50000,RENT,credit_card,10000,13.33,36 months,C


a. Find the range of annual income

In [26]:
# Answer here 
range(loans_df$annual_inc)

b. Find the unique labels for ```purpose``` variable

In [27]:
# Answer here 
unique(loans_df$purpose)

c. Get the summary of ```loans_df```

In [28]:
# Answer here 
summary(loans_df)

       id             member_id          annual_inc       home_ownership 
 Min.   :   57245   Min.   :   98268   Min.   :      0   ANY     :    1  
 1st Qu.: 9252315   1st Qu.:11030128   1st Qu.:  45000   MORTGAGE:12519  
 Median :34423708   Median :37086239   Median :  64000   NONE    :    2  
 Mean   :32487959   Mean   :35026396   Mean   :  74928   OTHER   :    1  
 3rd Qu.:54931016   3rd Qu.:58484786   3rd Qu.:  90000   OWN     : 2531  
 Max.   :68616060   Max.   :73518868   Max.   :2548000   RENT    : 9946  
                                                                         
               purpose        loan_amnt        int_rate             term      
 debt_consolidation:14802   Min.   :  500   Min.   : 5.32    36 months:17566  
 credit_card       : 5805   1st Qu.: 8000   1st Qu.: 9.99    60 months: 7434  
 home_improvement  : 1398   Median :12838   Median :12.99                     
 other             : 1179   Mean   :14697   Mean   :13.23                     
 major_purcha

d. Get the total number of observations and variables 

In [29]:
# Answer here 
dim(loans_df)

e. How many grades are there in total? 

In [30]:
# Answer here 
unique(loans_df$grade)

## [Back to top](#Contents) 