# **Welcome to MXB107: Introduction to Statistical Modelling**

```
.------------------------------------.
|   __  ____  ______  _  ___ _____   |
|  |  \/  \ \/ / __ )/ |/ _ \___  |  |
|  | |\/| |\  /|  _ \| | | | | / /   |
|  | |  | |/  \| |_) | | |_| |/ /    |
|  |_|  |_/_/\_\____/|_|\___//_/     |
'------------------------------------'

```


In this unit, you will be introduced to the R programming language through Google Colab, with a focus on its applications in statistical modeling. There's no need to install R or [RStudio](https://posit.co/download/rstudio-desktop/) on your computer—just an internet connection and a web browser are all you need.

## **About Google Colab**

Google Colab is a free, cloud-based platform that allows you to write and run code directly from your browser, with no setup required. Originally designed for Python, it also supports other languages like R. Colab provides access to powerful computing resources, including GPUs, making it ideal for data analysis, machine learning, and interactive coding tutorials. Its integration with Google Drive ensures that your notebooks are automatically saved and easily shareable, making collaboration seamless and efficient.

## **Switching to the R Kernel in Google Colab**

By default, Google Colab uses Python as its programming language. To use R instead, you’ll need to manually switch the kernel by going to **Runtime > Change runtime type**, and selecting R as the kernel. This allows you to run R code in the Colab environment.

However, our notebook is already configured to use R by default. Unless something goes wrong, you shouldn’t need to manually change runtime type.

### **Executing your first R code cell**

**To run a code cell**:

- Click the ▶️ button to the left of the cell,  
- Or press **Ctrl + Enter** (on Windows),  
- Or press **Cmd + Return** (on macOS).


This code cell will output **something** only if the R kernel is active; otherwise, it will produce an error.

In [1]:
if (!require("cowsay")) install.packages("cowsay"); library("cowsay")
say("Welcome to MXB107!", by = "pig")

Loading required package: cowsay

“there is no package called ‘cowsay’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)




 ____________________ 
< Welcome to MXB107! >
 -------------------- 
     \
      \

       _//| .-~~~-.
     _/oo  }       }-@
    ('')_  }       |
     `--'| { }--{  }
          //_/  /_/ [nosig]

### **Creating your first R code cell**

To create a new code cell, click the `+ Code` button at the top-left of the notebook. To insert a code cell immediately below an existing one, first click inside that code cell.



#### **Exercise**

Create a new code cell below this, enter `say("Your first code cell!", by = "dog")`, and execute it.

<details>
<summary>▶️ Click to show the solution</summary>

```r
say("Your first code cell!", by = "dog")
```

</details>

## **Introduction to the R Programming Languague**

## **Atomic Data Types in R**


Atomic types are the simplest building blocks in R that store individual values such as numbers, text, or logicals. These atomic types form the foundation for creating more complex data structures such as vectors, lists, and data frames.

Think of atomic types like the individual cells in an Excel spreadsheet, each holding a single piece of data. Just as cells combine to create sheets and workbooks, atomic types combine in R to build complex datasets.

### **Double**

Double (double-floating point) values represent real numbers (with or without decimals) and are one of the most common atomic types in R. For example, `0.1` represents 0.1.

In [2]:
0.1
typeof(0.1)

#### **Is This Exactly 0.1**?

Might not be the case (see [IEEE 754 - Wikipedia](https://en.wikipedia.org/wiki/IEEE_754)). Real numbers are uncountably infinite, but computers have only finite memory, so there might be some "generally acceptable" representation error. However, they can propagate and affect the accuracy of your results if not handled correctly (beyond the scope of this unit).

#### **Is `0.1 + 0.2` equal to `0.3`**?

In [3]:
0.1 + 0.2 == 0.3

**FYI: What about their numerical representations in our computers?**

In [4]:
writeBin(0.1 + 0.2, raw(), size = 8)  # 8 bytes = 64 bits
writeBin(0.3, raw(), size = 8)

[1] 34 33 33 33 33 33 d3 3f

[1] 33 33 33 33 33 33 d3 3f

They differ only in the least significant byte, which is `34` vs `33`. This is the very smallest possible change in a double.

**`all.equal()` to the Rescue: Safer Number Comparison in R**

In [5]:
all.equal(0.1 + 0.2, 0.3)

### **Integer**

In R, integers represent whole numbers without decimal points. You can create them using the `L` suffix, like `5L`, and check their type with `typeof(5L)`.

In [6]:
5L
typeof(5L)
typeof(5)

**FYI**: Integers and doubles are stored differently in memory. Integers use a fixed number of bits to represent whole numbers exactly, while doubles (floating-point numbers) follow the IEEE 754 standard to represent real numbers approximately.

### **Character**

Characters in R represent text data, such as letters, words, or sentences. They are enclosed in either single or double quotes, for example, `"hello"` or `'world'`.  

This data type is common in any programming languages and is used for many basic but essential tasks, such as specifying paths to a directory (e.g., `"content/path"`), specifying a name (e.g., an R package's name like `"ggplot2"`), storing information).

In [7]:
"Welcome to MXB107!"
typeof("Welcome to MXB107!")

#### **Print a `character`**

Printing characters is useful when you want to display messages during execution, such as status updates or error messages (there are dedicated functions in R for issuing warnings and errors).

In [8]:
print("Faculty of Science")

[1] "Faculty of Science"


#### **What Does \# Do**?

In R, anything following # on a line is treated as a comment and is not executed. It’s good practice to include comments in your code to explain what it does, making it easier to understand and maintain.

### **Logical**

Logical values in R are `TRUE` (or equivalently, `T`) and `FALSE` (or equivalently, `F`) . These are used for decisions, filtering, and control flow and can be created via comparision operators or more generally, evaluation of some expressions in R.


In [9]:
T
TRUE
F
FALSE

R can generate logical results via comparison operators:

- `==` for equality
- `<` / `<=` for less than and less than or equal to
- `>` / `>=` for greater than and greater than or equal to

In [10]:
5 < 3
"a" == "a"
"a" < "b"

More generally, R can generate logical results by evaluating expressions. For example, functions like `is.numeric()` return `TRUE` or `FALSE` depending on the input.


In [11]:
is.numeric("a")
is.numeric(1)

## **R as a Calculator**

R is essentially just a really powerful calculator, kind of like those you may have been required to purchase for high school maths, only better and with a much higher limit on what it can achieve.

Doing computation in R involves writing code. We can create a code cell, type directly into that and run code, and it will give us an answer like a calculator.

### **Double Operations**

Common double operators include:

- `+` for addition (e.g., `2.5 + 1.5`)
- `-` for subtraction (e.g., `5.0 - 2.0`)
- `*` for multiplication (e.g., `3.0 * 4.0`)
- `/` for division (e.g., `10.0 / 2.0`)
- `^` for exponentiation (e.g., `10.0 ^ 2.0`)

These operations follow standard arithmetic rules and return results as doubles. You can use parentheses `()` to change the order of evaluation and control operator precedence in expressions. See [R: Operator Syntax and Precedence](https://stat.ethz.ch/R-manual/R-devel/library/base/html/Syntax.html) for more details.

In [12]:
2.5 + 1.5
typeof(2.5+1.5)

2*(5 - 2)^2 + 3
typeof((5 - 2)^2 + 3)

#### **Exercise**

Work out how to divide 50 by 4, then compute 14 multiplied by the sum of 4 and 6, divided by 2. What are the data types of the results?

<details>
<summary>▶️ Click to show the solution</summary>

```r
50/4
typeof(50/4)

14*(4+6)/2
typeof(14*(4+6)/2)
```

</details>

### **Integer Operations**

All double operators apply to integers as well, with some caveats.

#### **Integer Addition/Subtraction/Multiplication**

The output type stays `"integer"` as long as both operands are integers, unlike double operations.




In [14]:
1L+1L
typeof(1L+1L)
typeof(1+1)

In [15]:
1L*1L
typeof(1L*1L)

#### **Integer division**

Unlike C/C++, R does not perform 'int' division, so the result is not 1L but `1.666...` as we desire.

In [16]:
5L/3L
typeof(5L/3L)

5/3
typeof(5/3)

#### **Exercise**



What is the output of integer 5 raised to the power of integer 3? What is the data type of the result?

<details>
<summary>▶️ Click to show the solution</summary>

```r
5L^3L
typeof(5L^3L)
```

</details>

### **Logical Operators**

Logical operators allow you to combine or modify logical values (`TRUE`/`FALSE`):

- `&` : Element-wise AND  
- `|` : Element-wise OR  
- `!` : NOT (negation)  
- `&&` : Short-circuit AND (evaluates only the first element)  
- `||` : Short-circuit OR (evaluates only the first element)

The difference between `&` and `&&` (and similarly `|` vs `||`) is subtle and mainly relevant when working with vectors.  We’ll explore that in more detail later.


In [18]:
T & F
!(T&F)

#### **Exercise**

Combine `3^2 < 2^3` and `log(3) > 1.5` via the logical `AND` operator. What is the result? Verify the result explicitly in R.

<details>
<summary>▶️ Click to show the solution</summary>

```r
(3^2 < 2^3) & (log(3) > 1.5)
```

</details>

## **Working with the R File System**

Knowing your current working directory is important when working with files in R. You can check it using the `getwd()` function.

In [20]:
getwd()


On Colab, R runs on a temporary virtual Linux environment, so `getwd()` will typically return `/content`, which is the default working directory during the session.

If you need to change the working directory. For example, to access data stored in a different folder—you can use the `setwd("your/path/here")` function. Always ensure the path is correct, especially when running notebooks on different systems.

#### **Exercise**

Create a new directory named `our_working_directory` via `dir.create("our_working_directory")`. It will be a sub-directory of `/content`. Then, change the working directory to the newly created sub-directory.



<details>
<summary>▶️ Click to show the solution</summary>

```r
dir.create("our_working_directory")
setwd("./our_working_directory")
```

</details>

Check the current working directory:

In [22]:
getwd()

All datasets and core functionalities needed for this unit are hosted online in the following GitHub repository:
`"https://github.com/edelweiss611428/MXB107-Notebooks/tree/main"`
We will fork (i.e., make a copy of) the repository onto our virtual computer and change the working directory to access its contents.

**Do not modify the following**:

In [23]:
setwd("/content")

# Remove `MXB107-Notebooks` if exists,
if (dir.exists("MXB107-Notebooks")) {
  system("rm -rf MXB107-Notebooks")
}

# Fork the repository
system("git clone https://github.com/edelweiss611428/MXB107-Notebooks.git")

# Change working directory to "MXB107-Notebooks"
setwd("MXB107-Notebooks")

# Print the current working directory
getwd()

# List contents inside the directory
list.files()

## **Packages in R**

R packages add new features that aren’t included in base R. You install packages via `install.packages("pkg_name")` to access these extra tools and load them with `library("pkg_name")`.

Many popular R packages are already available on Colab, so you often don’t need to install them, saving time. Therefore, it’s best to check if a package is installed before attempting to install or load it blindly.

Here, the `MASS` package is available in base R, so the system won't attempt to install `MASS`.

In [24]:
if (!require("MASS")) install.packages("MASS"); library("MASS")

Loading required package: MASS



Most reviewed R packages are available on [CRAN](https://cran.r-project.org/web/packages/available_packages_by_name.html) and [Bioconductor](https://bioconductor.org/packages/release/bioc/), but many are also available as GitHub repositories.

For CRAN packages, it is generally sufficient to use `install.packages("pkg_name")` to install them, but installing from other repositories like GitHub can be more complicated.

#### **Exercise**

Check if the CRAN package `"e1071"` is installed. If it is, load the package; if not, install it first and then load it.

<details>
<summary>▶️ Click to show the solution</summary>

```r
if (!require("e1071")) install.packages("e1071"); library("e1071")
```

</details>

In general, you won’t need to manually install any packages in this unit. If you fork the unit's GitHub repository, change the directory to `"MXB107-Notebooks"` and run the `"R/preConfigurated.R"` script, it will automatically load and install all the necessary packages for you.

**Do not modify the following**:

In [26]:
source("R/preConfigurated.R")

Loading required package: ggplot2

Loading required package: dplyr


Attaching package: ‘dplyr’


The following object is masked from ‘package:MASS’:

    select


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Loading required package: magrittr

Loading required package: e1071

“there is no package called ‘e1071’”
Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependencies ‘proxy’, ‘mlbench’, ‘randomForest’, ‘SparseM’, ‘slam’




This will also load some prepared utility functions that are not available in base R.

## **Getting Help in R**

If you find yourself forgetting what a particular function does or what the names of the arguments (inputs) you can pass it are, you can use the `help()` function or equivalently `?a_function`. For example, try running the following:

In [27]:
help(mean)

In [28]:
?mean

What you notice is that it prints the help info about calculating the arithmetic mean, where we are told that:
- It requires an R object x which is either a numeric vector, logical vector or date, date-time or time interval,
- And we can optionally tell it how much of the data to trim,
- And whether or not we want drop any missing values from the calculation.

One disadvantage of Colab is that it does not render R documentation files very well, as it is primarily designed for Python. However, the interface and documentation display improve significantly when using dedicated IDEs like [RStudio](https://posit.co/download/rstudio-desktop/).

## **Tests (Optional)**

**Do not modify the following**:

In [41]:

if (!require("testthat")) install.packages("testthat"); library("testthat")

test_that("Test if all packages have been installed", {
  expect_true(require("ggplot2"))
  expect_true(require("dplyr"))
  expect_true(require("magrittr"))
})

test_that("Test if all utility functions have been loaded", {
  expect_true(exists("skewness"))
})

[32mTest passed[39m 😀
