In [None]:
library(data.table)
library(datasets)
options(repr.matrix.max.rows=10, repr.matrix.max.cols=15) # for limiting the number of top and bottom rows of tables printed 

Note that the data should be accessed from anywhere without authentication and the direct url to the file (csv or xsls, etc) should be available (so kaggle datasets might not be appropriate since they require authentication)

Import the dataset into the session either through loading the package with library() command and creating a new copy of the dataset or by reading the dataset through its url with data.table::fread() function for csv/tsv files or xlsx::read_xlsx() function for excel files

Make sure the imported dataset is in data.table format. fread() does it by default. In other methods, make the object a data.table

In [None]:
data <- data.table(quakes)
data

<b>(Question)</b> Make an "i" operation for filtering rows of the dataset, the filter should not be trivial (maybe combining more than one conditions with a logical operator) and explain what the filter does above the code cell in a markdown cell


<b>(Explanation)</b> I am filtering the earthquakes dataset to include only those earthquakes that meet the following conditions:

1. <b>Magnitude Greater Than 5.5:</b> We are interested in earthquakes with a magnitude greater than 5.5..

2. <b>Depth Greater Than 100:</b> We want to focus on earthquakes with a depth greater than 100.

In [None]:
data[ (mag > 5.5) & (depth > 100)]

<b>(Question)</b> Make a "j" operation by selecting and/or calculating more than one columns with .(...) notation. So you can for example select one column, calculate a new column by using existing ones and return them together, for example: .(existing_column, new_column = a_second_existing_column / a_third_existing_column)

<b>(Explanation)</b> Selects existing columns "mag" and "depth" as is. Creates a new column "mag_times_depth" by multiplying "mag" and "depth" values.

In [None]:
data[, .(mag, depth, mag_times_depth = mag * depth)]

<b>(Question)</b> Make a "j" operation again but this time assign the calculated new column back to the object with := notation. Then print the object to show that the new column is added

<b>(Explanation)</b>
1. <b>how_serious_quake:</b> Categorizes earthquake seriousness based on magnitude using the `how_big_quake` function.
2. <b>depth_category:</b> Categorizes earthquake depth using the `categorize_depth` function.

In [None]:
how_big_quake <- function(x) {
    if (x < 5) {return ("No damage!")}
    else if (x < 6) {return ("Minor damage!")}
    else {return ("Slight or serious damage!")}
}

categorize_depth <- function(depth) {
  if (depth < 50) {
    return("Shallow")
  } else if (depth < 200) {
    return("Moderate")
  } else {
    return("Deep")
  }
}

data[, how_serious_quake:= sapply(data[,mag], how_big_quake)]
data[, depth_category := sapply(data[, depth], categorize_depth)]

data

<b>(Question)</b> Make a "by" operation to calculate a summary measure on one or two columns using .(...) notation separately for each distinct value of one or more discrete/categoric/factor variables (note that dates or times can also be considered discrete variables)

<b>(Explanation)</b>
- Calculates the average depth (`avg_depth`) separately for each distinct value of the `how_serious_quake` factor variable.

This operation allows for the analysis of average depth based on the categorization of earthquake seriousness.

In [None]:
data[, .(avg_depth = mean(depth)), by = .(how_serious_quake)]

<b>(Question)</b> Make a dcast() operation so that the values of a column are summarized into rows and columns across two discete/categoric variables. Assign the results to a new object and print the object

<b>(Explanation)</b>


- Creates a new object `dcast_data` by summarizing the count of occurrences for each combination of `how_serious_quake` and `depth_category`.

This operation provides a tabular summary, showing the distribution of earthquakes based on seriousness and depth categories.

In [None]:
dcast_data <- dcast(data, how_serious_quake ~ depth_category, fun.aggregate = length)
dcast_data

<b>(Question)</b> Make a melt() operation so that the cast object in the previous step is transformed into a long object again

<b>(Explanation)</b>
- Creates a new object `melt_data` by melting the `dcast_data`, using "how_serious_quake" as the identifier variable, and separating the columns into "stations" and "count."

This operation reverts the tabular summary back to a long format for easier analysis and visualization.

In [None]:
melt_data <- melt(dcast_data, id.vars = "how_serious_quake", variable.name = "stations", value.name = "count")
melt_data