# **Multiple Choice**

In [1]:
# Which function in R is used to combine vectors by rows or columns?
# A) bind() --> answer
# B) add()
# C) c()
# D) merge()
# https://www.spsanderson.com/steveondata/posts/2024-11-19/
x <- c(1, 2, 3)
y <- c(4, 5, 6)
rbind(x, y)  # Combines x and y as rows
cbind(x, y)  # Combines x and y as columns

0,1,2,3
x,1,2,3
y,4,5,6


x,y
1,4
2,5
3,6


In [2]:
x + y  # Performs element-wise addition

In [3]:
c(x, y)  # Concatenates x and y into a single vector

In [4]:
df1 <- data.frame(id = c(1, 2, 3), value = c("a", "b", "c"))
df2 <- data.frame(id = c(2, 3, 4), value2 = c("d", "e", "f"))
merge(df1, df2, by = "id")  # Merges df1 and df2 based on the "id" column

id,value,value2
<dbl>,<chr>,<chr>
2,b,d
3,c,e


In [5]:
# What is the correct way to assign the value 10 to a variable named x in R?
# A) x <- 10
# B) x = 10
# C) Both A and B --> answer
# D) None of the above

In [6]:
x <- 10
print(x)
x = 10
print(x)

[1] 10
[1] 10


In [7]:
# What is a data.frame in R?
# A) A list of vectors of the same length
# B) A collection of numeric values
# C) A single vector
# D) A multi-dimensional array

**B) A collection of numeric values**

While `data.frames` can store numeric values, they are not limited to just numbers. `Data.frames` can contain various data types, including numeric, character, logical, and factors.

**C) A single vector**

A `data.frame` is more complex than a single vector. It's a structure that holds multiple vectors, organized into rows and columns.

**D) A multi-dimensional array**

Although a `data.frame` shares some similarities with a multi-dimensional array, it's a distinct data structure in R. Data.frames are more flexible in terms of data types within columns and are specifically designed for tabular data.

In [8]:
# Create a data frame with three columns: Name, Age, and City
valid_df <- data.frame(
  Name = c("Alice", "Bob", "Charlie"),
  Age = c(25, 30, 28),
  City = c("New York", "London", "Paris")
)

# Display the data frame
valid_df


# Attempt to create a data frame with uneven vector lengths
invalid_df <- data.frame(
  Name = c("Alice", "Bob"),
  Age = c(25, 30, 28),
  City = c("New York", "London", "Paris")
)

# This will result in an error due to unequal vector lengths

Name,Age,City
<chr>,<dbl>,<chr>
Alice,25,New York
Bob,30,London
Charlie,28,Paris


ERROR: Error in data.frame(Name = c("Alice", "Bob"), Age = c(25, 30, 28), City = c("New York", : arguments imply differing number of rows: 2, 3


In [9]:
# Which operator is used for modulo in R?
# A) *
# B) %*%
# C) %
# D) %% --> answer

In [10]:
x <- 10
y <- 3

result <- x * y

# Print the result
print(result)

[1] 30


In [11]:
x <- 10
y <- 3

result <- x %*% y
print(result)

     [,1]
[1,]   30


In [13]:
x <- 10
y <- 3

result <- x % y
print(result)

ERROR: Error in parse(text = input): <text>:4:13: unexpected input
3: 
4: result <- x % y
               ^


In [14]:
x <- 10
y <- 3

result <- x %% y  # Calculate the remainder of 10 divided by 3
print(result)

[1] 1


**Which of the following statements is false about the importance of statistical analysis?**

- A) Statistical analysis offers valuable insights into patterns, trends, and relationships within datasets.

- B) Statistical analysis helps in identifying and handling missing values, outliers, and inconsistencies in the data.

- C)  It eliminates the need for additional techniques, such as machine learning, for predictive modeling

- D) Statistical optimization techniques based on data-driven insights enhance procedures, increase efficiency, and optimize resource allocation.

In [15]:
## Answer is C

**Which term best describes the process of collecting, analyzing, and interpreting data to uncover patterns, trends, and insights, often involving mathematical techniques and probability?**

- A) Data Mining

- B) Statistical Analysis

- C) Predictive Modeling

- D) Machine Learning

In [16]:
## Answer is B

In [17]:
# What does the apply() function do in R?
# A) Splits data into subsets
# B) Aggregates data into summaries
# C) Applies a function over rows or columns of a matrix or array --> answer
# D) Combines datasets

In [18]:
# Create a sample matrix
matrix_data <- matrix(1:9, nrow = 3, ncol = 3)

# Define a function to calculate the sum of elements
sum_function <- function(x) {
  sum(x)
}

# Apply the function to the rows of the matrix
apply(matrix_data, MARGIN = 1, FUN = sum_function)

# Apply the function to the columns of the matrix
apply(matrix_data, MARGIN = 2, FUN = sum_function)

In [19]:
# Which function from the dplyr package is used to summarize data after splitting it into groups?
# A) group_by()
# B) filter()
# C) summarize() --> answer
# D) mutate()

In [20]:
library(dplyr)


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [22]:
# Sample data
data <- data.frame(
  group = c("A", "A", "B", "B", "B"),
  value = c(10, 15, 20, 25, 30)
)

# Group the data by the 'group' column
grouped_data <- data %>%
  group_by(group)

# View the grouped data (notice the grouping message)
grouped_data

group,value
<chr>,<dbl>
A,10
A,15
B,20
B,25
B,30


In [23]:
# Filter rows where 'value' is greater than 20
filtered_data <- data %>%
  filter(value > 20)

# View the filtered data
filtered_data

group,value
<chr>,<dbl>
B,25
B,30


In [24]:
# Calculate the mean 'value' for each group
summarized_data <- data %>%
  group_by(group) %>%  # Group by 'group'
  summarize(mean_value = mean(value))  # Calculate mean for each group

# View the summarized data
summarized_data

group,mean_value
<chr>,<dbl>
A,12.5
B,25.0


In [25]:
# Add a new column 'double_value'
mutated_data <- data %>%
  mutate(double_value = value * 2)

# View the mutated data
mutated_data

group,value,double_value
<chr>,<dbl>,<dbl>
A,10,20
A,15,30
B,20,40
B,25,50
B,30,60


In [26]:
# Which function in R returns the square root of a number?
# A) sqrt() --> answer
# B) log()
# C) abs()
# D) exp()

In [27]:
# Calculate the square root of 9
result <- sqrt(9)
print(result)  # Output: 3

[1] 3


In [28]:
# Calculate the natural logarithm of 10
result <- log(10)
print(result)

[1] 2.302585


In [29]:
# Calculate the absolute value of -5
result <- abs(-5)
print(result)  # Output: 5

[1] 5


In [30]:
# Calculate the exponential of 2
result <- exp(2)
print(result)

[1] 7.389056


In [31]:
# What is the result of the following R expression: 5 %% 2?
# A) 2
# B) 1 --> answer
# C) 0
# D) 5

In [32]:
5%%2

# **IDENTIFICATION**

1.  This function is used in R to fit a linear regression model?


In [33]:
#Answer: lm()

2.   This function provides the low-level data type of an object, such as "integer", "double", or "character".



In [34]:
#Answer: typeof()

3.   This function returns the number of elements in a one-dimensional object or the total number of elements in a multi-dimensional object.



In [35]:
#Answer: length()

4.   This function retrieves metadata associated with an object, such as dimensions, names, or class information. If no metadata exists, it returns NULL.  


In [36]:
#Answer: attributes()

5.   This is the most basic data structure in R. It is a one-dimensional, homogenous collection of elements, all of which must be of the same data type (e.g., numeric, character, or logical).



In [37]:
#Answer: vector

6.   This data structure in R can hold elements of different data types (e.g., numeric, character, logical). It is essentially a collection of objects that are indexed.



In [38]:
#Answer: list

7.   This is a two-dimensional, homogeneous data structure where all elements are of the same data type. It is organized into rows and columns.



In [39]:
#Answer: matrix

8.   This two-dimensional data structure is widely used for storing and analyzing datasets. It can hold elements of different data types (e.g., numeric, character, factor) in each column, but all elements in a single column must be of the same type.





In [40]:
#Answer: data frame

9.   How to get the range of array <- c(10, 20, 5, 15, 25) using R?



In [41]:
#Answer: max(array) - min(array)

10. Complete the R - script

x <- c(10, 20, 5, 15, 25, 10, 20, 10)

get_mode <- function(x){

   uniq <- unique(x)

   uniq[_______(tabulate(match(x, uniq)))]

}

get_mode(x)

In [42]:
#Answer: which.max

In [43]:
x <- c(10, 20, 5, 15, 25, 10, 20, 10)

get_mode <- function(x){

  uniq <- unique(x)

  uniq[which.max(tabulate(match(x, uniq)))]

}

get_mode(x)

# **ESSAY**

1. Discuss how splitting, applying transformations, and combining data impact the efficiency and scalability of data analysis in R. Include examples of how these steps contribute to solving real-world problems.

The split-apply-combine strategy in R significantly enhances data analysis efficiency and scalability by dividing data into smaller subsets, performing operations on each subset, and merging the results. This approach allows for faster computations, better code organization, and handling large datasets, as seen in calculating group statistics (e.g., average sales per region), cleaning and transforming data (e.g., imputing missing values), and financial modeling (e.g., computing moving averages for different stocks).  This approach enables efficient and scalable data manipulation; hence, the analysts are empowered to solve real-world problems effectively.

2. R is traditionally considered better for small to medium-sized datasets. Discuss strategies and tools available in R (such as `data.table` and `dplyr`) for handling large datasets efficiently.

R has long been a go-to tool for analyzing small to medium-sized datasets, offering a rich array of functions for statistical analysis and data manipulation. For instance, base R functions like mean(), median(), and summary() are widely used for descriptive statistics, while lm() is a staple for linear regression analysis. Visualization tools such as plot(), boxplot(), and hist() help uncover patterns and trends in smaller datasets. These functions make it easy to perform quick, in-depth analyses, helping users efficiently extract insights. When combined with packages like dplyr, which provides functions such as filter(), select(), and mutate(), R offers an intuitive, efficient workflow for exploring and transforming data frames with millions of rows.

As datasets grow larger, R scales its capabilities with tools like data.table and dplyr, which are optimized for speed and memory efficiency. These packages enhance the performance of common tasks such as grouping data using group_by(), summarizing it with summarize(), or joining datasets with inner_join(). For even larger datasets, R integrates with databases through packages like DBI and RSQLite, allowing users to query data directly without loading everything into memory. Distributed computing tools such as SparkR or sparklyr enable R to handle big data challenges by leveraging cluster computing frameworks. Together, these functions and tools make R a versatile platform, equally adept at small-scale exploratory analysis and large-scale data processing.

3.  Explain the concept of linear regression and its purpose in statistical analysis. How does R facilitate the implementation of linear regression models? Discuss the use of the `lm()` function and its key arguments.  

Linear regression is a fundamental statistical method used to understand the relationship between two or more variables. It models the dependent variable (Y) as a linear combination of one or more independent variables (X), enabling predictions and insights into how changes in X influence Y. For example, in examining how age impacts lung capacity, linear regression can determine the strength and nature of this relationship. Key outputs from a regression model include the slope (indicating the rate of change in Y per unit change in X), the intercept (value of Y when X is zero), and metrics like R-squared that measure how well the model explains the variation in Y.

R makes implementing linear regression straightforward, primarily through the `lm()` function. This function fits a linear model by specifying the formula (`Y ~ X`) and the dataset as its key arguments. Once the model is created, R provides tools like `summary()` for a detailed evaluation of the regression results, including coefficients, p-values, and residual errors. Additionally, functions like `anova()` help perform hypothesis testing, and the use of attributes (e.g., `attributes(mod)`) allows for extracting specific model components like residuals or fitted values. These features make R a powerful tool for performing and interpreting linear regression in data analysis.