<div style="text-align: center;"> <h3>Linear Model</h3>
<h5>Problem Set 2</h5>
<h5>April 22, 2025</h5>    
<h5><u>By Romand Lansangan</u></h5>
    </div>
    
---

## Internet Use and GDP

The U.S. Central Intelligence Agency (2010) contains information on topics such as geography, people, the economy, communications, and transportation for most countries in the world. For example, the INTERNET data file contains data from 2010 relating to gross domestic product (GDP) per capita in thousands of dollars (Gdp) and the percentage of the population that are Internet users (Int) for 212 countries. Here, GDP is based on purchasing power parities to account for between-country differences in price levels. This problem investigates whether there is a linear association between these two variables. In particular, **how effective is it to use Gdp to predict Int using simple linear regression?**

In [None]:
library(tidyverse)
library(ggplot2)

options(repr.plot.width=10, repr.plot.height=5)


In [None]:
df <- read_csv("internet.csv")
head(df)

In [None]:
nrow(df)

In [None]:
summary(df)

#### (a) Find the least squares line for the data, that is, use statistical software to find the intercept and slope of the least squares line, and write out the equation of that line.

In [None]:
first <- lm(Int ~ Gdp, data = df)
summary(first)
plot(first)

The `Residuals vs Fitted` plot showed a randomly distributed residuals around the fitted line as there's no discernable pattern of either decreasing or increasing in residuals as the fitted values increases. This indicates that the errors terms are independent. 

The `Normal Q-Q` plot on the other hand estimates that the residual distribution does not stray too far away from normal distribution. 

The `Scale-Location` shows a relatively horizantal line accross all range of fitted values indicating an approximately equal in variance.

Therefore, we could get `Int` as a linear function of `Gdp` plus some error.

In [None]:
coefs <- coef(first)

slope = coefs[2]
intercept = coefs[1]

print(paste("Slope:", slope))
print(paste("Intercept:", intercept))

The regression line can then be written as:

$$
E(INT) = 1.36092659265485\cdot GDP +  12.3627753941145
$$

#### (b) Interpret the estimates of the slope and the intercept in the context of the problem.

The slope of $1.36092659265485$ indicates that the `Int` (or the percentage of population in a country that are internet users) increases by $1.36092659265485$ unit for every unit increase in `GDP` (or the gross domestic product per capita in thousands of dollars). 

The intercept of $12.3627753941145$ indicates that if `GDP=0`,. `INT` has a default value of $12.3627753941145$ which doesn't make sense but the intercept of the linear model is simply a mathematical extrapolation beyond the observable data range, at least in this case. This interpretation is more of theorethical than practical due to the fact that the range of the independent variable doesn't include 0.

#### (c) Predict the percentage of Internet users if GDP per capita is $20,000.

In [None]:
gdp = 20000 

manual_prediction <- intercept + slope * (gdp/1000) # Gdp is in thousands
print(manual_prediction)

#### (d) Draw a scatterplot with Int on the vertical axis and Gdp on the horizontal axis, and add the least squares line to the plot.

In [None]:
ggplot(df) +
  geom_point(aes(x = Gdp, y = Int), col = "blue") +
  geom_smooth(aes(x = Gdp, y = Int, color = "Regression Line (LSM)"), 
              method = "lm", se = FALSE, formula = y ~ x, linewidth = 1) +
  ggtitle("GDP vs Internet") + 
  ylab("Internet") + 
  xlab("GDP") +
  theme(plot.title = element_text(hjust = 0.5)) 

## Movie Data and Box Office Receipts

The MOVIES data file contains data on 25 movies from “The Internet Movie Database” (www.imdb.com). Based on this dataset, we wish to investigate whether all-time US box office receipts (Box, in millions of US dollars unadjusted for inflation) are associated with any of the following variables:
* Rate = Internet Movie Database user rating (out of 10)
* User = Internet Movie Database users rating the movie (in thousands)
* Meta = “Metascore” based on 35 critic reviews (out of 100)
* Len = Runtime (in minutes)
* Win = Award wins
* Nom = Award nominations

Theatrical box office receipts (movie ticket sales) may include theatrical re-release receipts, but exclude video rentals, television rights, and other revenues.

In [None]:
df_2 <- read_csv("movies.csv")

head(df_2)

summary(df_2)

#### (a) Write out a regression equation for a multiple linear regression model for predicting response Box from just three predictors: Rate, User, and Meta.

In [None]:
predictors = c('Rate', 'User', 'Meta')

X <- df_2 %>%
    select(all_of(predictors))

y <- df_2$Box

second <- lm(y ~ . , data=X)

summary(second)

$$
E(Box) = 35.4962 \cdot Rate + 0.4328 \cdot User + 1.2462 \cdot Meta - 169.0862 
$$

#### (b) Interpret the estimated regression parameter for Rate in the context of the problem.

The coefficient of the `Rate` indicates that the `Box` value of a specific movie increase by $35.4962$ unit for every increase in its rating on Internet Movie Database (`Rate`).


#### (c) Look at the graph produced by the ”plot” function to do assumption checks for the model in item (a)

In [None]:
plot(second)

The `Residuals vs Fitted` plot showed a randomly distributed residuals around the fitted line as there's no discernable pattern of either decreasing or increasing in residuals as the fitted values increases. This indicates that the **errors terms are independent**. 

The `Normal Q-Q` plot on the other hand estimates that the residual distribution does not stray too far away from normal distribution or is **approximately normally distribute**. 

The `Scale-Location` shows a relatively horizantal line accross all range of fitted values indicating that the errors are **approximately equal in variance**.

Therefore, we could represent `Box` as a linear function of `Rate`, `User`, and `Meta` plus some error.

#### (d) Use statistical software to fit the following complete model for Box as a function of all six predictor variables :
$$E(Box) = \beta_0 + \beta_1 · Rate + \beta_2 · User + \beta_3 · Meta + \beta_4 · Len + \beta_5 · Win + \beta_6 · Nom$$

In [None]:
X <- df_2 %>%
    select(-c("Box", "Movie"))

third <- lm(y ~ ., data=X)
summary(third)

$$E(Box) = -172.28110 + 
35.34769 \cdot Rate + 
0.38894 \cdot User + 
1.25615 \cdot Meta + 
0.02473 \cdot Len + 
-0.02080 \cdot Win + 
0.37261 \cdot Nom$$

In [None]:
rss_3 <- sum(residuals(third)^2)

print(paste("RSS:", rss_3))

#### (e) Use statistical software to fit the following reduced model:
$$E(Box) = \beta_0 + 
\beta_1 \cdot Rate + 
\beta_2 \cdot User + 
\beta_3 \cdot Meta$$

In [None]:
second

$$E(Box) = -169.0862 + 
35.4962 \cdot Rate + 
0.4328 \cdot User + 
1.2462 \cdot Meta$$

In [None]:
rss_2 <- sum(residuals(second)^2)

print(paste("RSS:", rss_2))

#### (f) Compare the values of residual standard error (s) and adjusted R2 in the reduced and complete models

In [None]:
second_summary <- summary(second)

second_adj_r_squared <- second_summary$adj.r.squared

second_residual_se <- second_summary$sigma

print(paste("Adjusted R-squared (Reduced Model):", second_adj_r_squared))
print(paste("Residual standard error (Reduced Model):", second_residual_se))

In [63]:
summary(third)

ERROR: Error in summary(third): object 'third' not found
