# Embarking on a Practical R Exercise Adventure

Welcome to today's R exercise session! This is your chance to apply your newfound knowledge and skills in R to real-world problems. Throughout this exercise, you'll tackle challenges, experiment with data, and hone your analytical abilities. Think of it as a journey – one where you'll encounter obstacles, make discoveries, and emerge with a deeper understanding of R and its applications.

As you navigate through the exercises, remember to embrace the process. Don't be afraid to try new things, make mistakes, and learn from them. Each step you take brings you closer to mastery, so keep pushing forward and exploring the possibilities. Your persistence and dedication will undoubtedly pay off, both in your academic pursuits and future endeavors.

So, let's embark on this journey together and see where it takes us. Get ready to unleash the power of R and uncover valuable insights from the data. Enjoy the adventure, and remember, you've got this!

### Task:
In this exercise, you'll dive into a series of R tasks designed to challenge your skills and reinforce your understanding of the language. Your mission is to fill in the missing code snippets (???) where indicated and complete each task.

### Submission:
Fill in the missing code snippets (???) provided in the Markdown format below to complete the exercise.


## Exercise Overview: Preparing Data

Prepare the `Hitters` dataset for analysis, focusing on predicting the number of assists made by players. Your goal is to clean the dataset and ensure it's ready for analysis.

To begin, load the `ISLR` library and access the `Hitters` dataset. Remove any rows with missing values (NA) and prepare a predictors matrix (`X`) without an intercept term. Convert this matrix to a dataframe named `data`. Next, append the assists column from the original dataset to the `data` dataframe. Remove the unnecessary `NewLeague` column and any remaining rows with missing values.

Complete the ??? task by filling in the missing code snippets to ensure the dataset is clean and ready for further analysis.


In [1]:
# Load the ISLR library
library(???)

# Access the Hitters dataset
data(????)

# Remove rows with NA values to ensure clean data
HittersClean <- na.omit(Hitters)

# Prepare the predictors matrix without intercept
X <- model.matrix(??? ~ . - 1, data = HittersClean) # Preparing predictors

# Convert the matrix to a dataframe
data <- as.data.frame(X)

# Append the assist as outcome variable
data$????? <- HittersClean$???


# Remove row names

rownames(data) <- NULL

# Now 'data' is a dataframe that includes both predictors and the outcome variable

data = as.data.frame(data)

# Remove the NewLeague column
data <- data[, !names(data) %in% "NewLeague"]

# Remove rows with missing values
data <- na.omit(data)



### Exercise Overview: Splitting Data

In this exercise, you will split the cleaned `Hitters` dataset into training and test sets for further analysis. This step is crucial for evaluating the performance of machine learning models.

To begin, set the seed for reproducibility using `set.seed(123)`. Then, calculate the size of the training set based on a split ratio of 0.8. Next, randomly sample row indices to create the training set, ensuring that it comprises 80% of the data. The remaining rows will form the test se sets.


In [2]:

# Set the seed for reproducibility
??????

# Calculate the size of the training set
??? = 0.8
training_size <- ?????

# Randomly sample row indices for the training set
training_indices <- sample(???? size = training_size)

# Create ???

# Create the test set
test <-????

### Exercise Overview: OLS

In this exercise, you will perform Ordinary Least Squares (OLS) regression to predict the `assist` variable in the given dataset. You will then evaluate the performance of the model by calculating the Mean Squared Error (MSE) and printing it to the console.

1. **Fit OLS Regression Model:**
   - Use the `lm()` function to fit a linear regression model to predict `assist` using all available predictors.

2. **Evaluate Model Performance:**
   - Display the summary of the OLS model to understand its performance.
   - Predict the `assist` values for the test dataset using the fitted OLS model.
   - Calculate the Mean Squared Error (MSE) for the predictions.
   - Print the MSE value to the console.

In [3]:
library(glmnet)
library(dplyr)

# Fit the linear regression model
ols <- lm(???? ~ ., data = train)
# Display the summary of the model to understand its performance
summary(ols)

# Predicting the assist for the test dataset
ols$predict_outcome <- ???

# Calculating the Mean Squared Error (MSE) for our predictions
??? <- mean((test$?? - ols$predict_outcome)^2)
# Print the MSE to the console
print(predMSEols)

Loading required package: Matrix

Loaded glmnet 4.1-8


Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union





Call:
lm(formula = salary ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-753.89 -173.98  -11.87  159.32 1815.62 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  214.09783  124.12599   1.725  0.08618 . 
AtBat         -1.68499    0.75317  -2.237  0.02644 * 
Hits           5.06495    2.89376   1.750  0.08168 . 
HmRun         -2.68840    6.86520  -0.392  0.69579   
Runs          -0.95353    3.41442  -0.279  0.78034   
RBI            1.41046    3.00406   0.470  0.63924   
Walks          6.35620    2.11308   3.008  0.00299 **
Years        -13.14742   14.12616  -0.931  0.35318   
CAtBat        -0.25953    0.15700  -1.653  0.09996 . 
CHits          0.65158    0.80755   0.807  0.42076   
CHmRun         1.29483    1.86077   0.696  0.48737   
CRuns          1.25425    0.89170   1.407  0.16119   
CRBI           0.46318    0.84092   0.551  0.58242   
CWalks        -0.68669    0.39171  -1.753  0.081

“prediction from a rank-deficient fit may be misleading”


[1] 138149.8


### Exercise Overview: Lasso and Ridge Regression

In this exercise, you will estimate Lasso and Ridge regression models to predict the `assist` variable using all available predictors except the outcome variable in the training dataset. You will also perform cross-validation to find the optimal lambda value for each model.

1. **Lasso Model:**
   - Estimate a Lasso model using all predictors except the outcome variable (`assist`) in the `train` dataset.
   - Perform cross-validation to find the optimal lambda value for the Lasso model.

2. **Ridge Model:**
   - Estimate a Ridge model using all predictors except the outcome variable (`assist`) in the `train` dataset.
   - Perform cross-validation to find the optimal lambda value for the Ridge model.

Complete the ??? task by filling in the missing co

In [8]:
# Estimate a Lasso model using all predictors except the outcome variable
lasso <- ???(as.matrix(train[, -ncol(train)]), train$???, alpha = ???)

# Perform cross-validation to find the optimal lambda value
lasso.cv <- cv.glmnet(as.matrix(train[, -ncol(train)]), train$????, type.measure = "mse", nfolds = 5, alpha = 1)



ridge <- glmnet(as.matrix(train[, -ncol(train)]), train$????, alpha = ???)
# Cross-validate the Ridge model 
ridge.cv <- cv.glmnet(as.matrix(train[, -ncol(train)]), train$????, type.measure = "mse", nfolds = 5, alpha = 0)

ERROR: Error in parse(text = x, srcfile = src): <text>:2:46: unexpected ','
1: # Estimate a Lasso model using all predictors except the outcome variable
2: lasso <- ???(as.matrix(train[, -ncol(train)]),
                                                ^


In [5]:


# Predict using the fitted Lasso model
lasso$predict_outcome <- predict(lasso, newx = as.matrix(test[, -ncol(test)]), s = lasso.cv$lambda.min)

# Calculate the MSE
predMSElasso <- mean((t????? - lasso$predict_outcome)^2)
print(paste0("MSE: ", predMSElasso))



[1] "MSE: 130313.877940595"


In [6]:
# Predict using the fitted Lasso model
ridge$predict_outcome <- predict(ridge, newx = as.matrix(test[, -ncol(test)]), s = ridge.cv$lambda.min)

# Calculate the MSE
???? - ridge$predict_outcome )^2)
print(paste0("MSE: ", predMSEridge))

[1] "MSE: 132630.476586086"


### Exercise Overview: Comparing

In this exercise, you will compare the Mean Squared Error (MSE) values obtained from three different regression models: Ordinary Least Squares (OLS), Lasso, and Ridge regression. Additionally, you will determine which model is most suitable based on their respective MSE values.

1. **Comparing MSE:**
   - Print the MSE values obtained from the OLS, Lasso, and Ridge regression models.

2. **Selecting the Best Model:**
   - Based on the MSE values, determine which regression model is most appropriate for predicting the `assist` variable in the given dataset.

Complete the ??? task by filling in the

In [7]:
????

[1] 138149.8 130313.9 132630.5
