# Midterm | R Lab for Applied Data Science

## General Instructions

*   This is an open-book, open-notes assignment. You are encouraged to use your course materials, online resources, and R documentation.
*   Show all your R code and output clearly within your Jupyter Notebook. Use markdown cells to provide explanations and interpretations of your code and results.
*   For each task, clearly state the task number and provide a concise answer or interpretation. I'll have code blocks below each task.
*   Justify your choices of data manipulation techniques, visualizations, and analytical approaches. There is often more than one "correct" way to approach a data science problem.
*   Pay attention to code clarity, readability, and proper commenting.
*   The assignment is designed to assess your understanding and application of the concepts and techniques covered in Modules 1-6.

## Datasets

We'll be using several datatsets for this midterm. Be sure to download all the datasets from the Canvas assignment before you begin. The data will be in a zip folder called `midterm-data.zip`. The datasets we'll be using are:

-  `penguins.rds` (from the `palmerpenguins` package)
-  `diamonds.rds` (from the `ggplot2` package)
-  `sleep.rds` (from the `VIM` package)
-  `sms-spam.rds` (from [kaggle](https://www.kaggle.com/datasets/thedevastator/sms-spam-collection-a-more-diverse-dataset))

There are a total of **13 tasks**. Be sure to read through and answer every step for each task.

## Tasks

### Task 1: Exploring the Penguins Dataset

1. Load the `penguins.rds` dataset.
2. Examine the structure of the penguins dataset. What are the different variable types present (e.g., numeric, factor)? List at least three variables of different types and identify their types.
3. Calculate and present the mean and standard deviation of the `body_mass_g` (body mass in grams) for each penguin species. Briefly interpret these statistics - which species tends to be heavier on average?
4. Create a vector containing the unique island names where penguins were observed. How many unique islands are in the dataset?

In [None]:
# your code here...

### Task 2: Conditional Data Analysis

1. Using the penguins dataset, identify penguins that are of the `Gentoo` species and have a flipper length greater than 200 mm.
2. For these identified penguins, calculate the minimum, 1st quartile, median, mean, 3rd quartile and max (`summary()`) bill depth (`bill_depth_mm`).

In [None]:
# your code here...

### Task 3: Diamonds Data Exploration and Filtering

1. Load the `diamonds.rds` dataset.
2. Filter the diamonds dataset to include only diamonds with a `cut` quality of `Ideal`.
3. Produce a scatterplot showing the `caret` vs the `price`. Color of the points given the `color` variable. 

In [None]:
# your code here...

### Task 4: Data Transformation and Summarization

1. Using the original diamonds dataset, create a new variable called `volume` which represents the volume of the diamond, calculated as `x` * `y` * `z`.
2. Group the diamonds data by `cut` quality.
3. For each cut quality, calculate the median `price` and the median `volume`.
4. Present the results in ascending order of median price. Which diamond cut quality has the lowest median price? 

In [None]:
# your code here...

### Task 5: Creating Categories and Summarizing Price per Carat

1. Using the diamonds dataset, create a new categorical variable called `carat_category` based on the `carat` variable. Define the categories as follows:
    1. "Small": diamonds with carat less than 0.5
    2. "Medium": diamonds with carat between 0.5 and 1 (inclusive of 0.5, exclusive of 1)
    3. "Large": diamonds with carat 1 or greater
2. Group the diamonds data by this newly created `carat_category`.
3. Plot the distribution of price per carat category.
4. Within each carat_category, calculate the average price per carat (i.e., price / carat).
5. Present the results showing the average price per carat for each category, in descending order of average price per carat.

In [None]:
# your code here

### Task 6: Data Joining

Imagine you had two datasets related to diamonds: one with diamond characteristics (diamonds dataset as we have) and another hypothetical dataset called `diamond_ratings` that contains information about diamond certifications and ratings, linked by a common diamond ID.

1. Explain conceptually how you would combine these two datasets based on a common ID. Describe which type of join you would use if you wanted to keep all rows from the diamonds dataset, even if some diamonds don't have a matching rating in `diamond_ratings`.
2. Provide example R code demonstrating the syntax of how you would perform this join, assuming both datasets exist and have a common ID column named 'diamond_id'. (You don't need to actually run code on hypothetical data, just show the syntax.)

In [None]:
# your code here...

### Task 7: A Simple model of Diamond Prices

1. Build a simple linear model of `price` as a function of any one or more of the other variables. Apply any transformations as you see fit.
2. Produce an "actual vs prediction" plot.

### Task 8: Acquiring Financial Data for Multiple Stocks with `tidyquant`

1. Using the `tidyquant` package, retrieve the daily stock price data for Microsoft Corp. (MSFT) and Alphabet Inc. (GOOG) from January 1, 2021, to December 31, 2021.
2. For each stock, calculate the daily percentage change in the 'adjusted' closing price during this period.
3. Create a plot showing the distribution of the daily percentage change for each stock.
4. For each stock, calculate the cumulative product of daily percentage change (don't forget to add 1 first, then subtract at the end).
5. Create a single time series plot showing the cummulative percent change for both MSFT and GOOG on the same plot over the specified period, using different colors for each stock. 

In [None]:
# your code here...

### Task 9: Interactive charts

1. Turn the plots you produced in task 8 into interactive plots with plotly.

In [None]:
# your code here...

### Task 10: Identifying and Visualizing Missing Data in Sleep Dataset

1. Load the `sleep.rds` dataset from the `VIM` package (`library(VIM); data(sleep)`).
2. Create a visualization using either the `visdat` or `naniar` package to visualize the pattern of missing data in the `sleep` dataset. Which pair of variables have the most concurrent missing observations? 

In [None]:
# your code here...

### Task 11: Imputation of Missing Values in Sleep Dataset and Comparison 

Focus on the `NonD` column in the `sleep` dataset, which has missing values. Implement two different methods to impute these missing values:
1. **Method 1: Median Imputation:** Replace missing `NonD` values with the median of the available `NonD` values.
2. **Method 2: Regression Imputation (Simple):**  Use a linear regression model to predict `NonD`. Choose one or more relevant predictors, but keep in mind you want to fill in all the NAs.
3. After performing both imputations, produce a plot that shows the distribution of `NonD`, the filled values using the median, and the filled values using linear regression.

In [None]:
# your code here...

### Task 12: Building a More Complex Model of Diamond Prices

Begin by running the code below. We'll use a subset of the diamonds dataset so you don't have to wait so long for the model to train:

In [None]:
set.seed(2025)

tb_diamonds = readRDS("midterm-data/diamonds.rds") # change this path if you need to

tb_diamonds_subset = tb_diamonds %>% 
    slice_sample(prop = 0.1)

Using `tb_diamonds_subset`, you want to predict `price` using other diamond characteristics. 

1. Begin by engineering some features (consider taking the logarithm of `price`).
2. Next do 80/20 split of the data (80% training, 20% testing).
3. Using the 5-fold CV of the training set, apply either a **wrapper method** (e.g. random forest) or a **embedded method** (e.g. LASSO) to select features.
4. With the selected features, fully train the model on the training set (if not done during the 5-fold CV training), then make predictions on the test set. Calculate the NRMSE.
5. Using the your model formulation in task 7, train the linear model on the 80% data, then test it on the 20%. Calculate the NRMSE. How do your models compare?

In [None]:
# your code here...

### Task 13: Text Tokenization and Visualization of SMS Spam Data

**There is some coarse language in this dataset, so apologies beforehand!**

1. Load the `sms-spam.rds` dataset.
2. Tokenize the texts (`sms`) column by word. 
3. Remove stop words, and add a column for the sentiment of each word. Add a numeric value for the sentiment (e.g. -1 for negative and 1 for positive).
4. Perform a student's t-Test (`t.test()`) on a vector containing the sentiment scores of no-spam text messages (`label == 0`) and spam text messages (`label == 1`). Is there a "statistically significant" (p-value < 0.05) difference between the mean sentiments? Interpret the mean sentiment of the spam messages.
5. Produce a wordcloud using the positive words in regular text messages and the positive words in the spam text messages. Do the same for the negative words.

In [None]:
# your code here...