# Final
### Author: Sahand Motameni
### Date: 12/3/2024

## Guidelines

**Run All** in the document which loads these packages with the `import` function.

Remember that continuing to develop a sound workflow for reproducible data analysis is important as you complete the lab and other assignments in this course.
There will be periodic reminders in this assignment to remind you to **Run all, commit, and sync** your changes to GitHub.
You should have at least 3 commits with meaningful commit messages by the end of the assignment.


Use the remainder of this `.ipynb` to answer the questions however you see fit. Show all code and outputs where applicable

# Part 1 - Employment

A large university knows that about 70% of the full-time students are employed at least 5 hours per week.
The members of the Statistics Department wonder if the same proportion of their students work at least 5 hours per week.
They randomly sample 25 majors and find that 15 of the students (60%) work 5 or more hours each week.

## Question 1



To set up a simulation for estimating the proportion of statistics majors who work 5 or more hours weekly, we can model the sampling process based on the university-wide assumption that 70% of full-time students work at least 5 hours. Using a binomial distribution, we simulate 25 students repeatedly (e.g., 10,000 trials) with a 70% probability of working. For each trial, we calculate the proportion of students working at least 5 hours to create a distribution of sample proportions.

We then compare the observed proportion (60%) to this simulated distribution. This comparison helps assess whether the observed value aligns with what we’d expect if the true population proportion were 70%. If the observed proportion is within the likely range of outcomes, it suggests consistency with the assumption. Alternatively, a formal hypothesis test using the simulation results can determine whether there is evidence to reject the assumption.

## Question 2



To approximate the bounds of the 95% confidence interval from the provided bootstrap distribution:

1. The 95% confidence interval corresponds to the range between the 2.5th percentile and the 97.5th percentile of the distribution.
2. Observing the histogram, the distribution appears roughly symmetric. Visually, the range of the bootstrap statistics seems concentrated between approximately 0.45 and 0.75.

The precise boundaries of the interval could be calculated numerically from the bootstrap results, but based on the graph, a reasonable approximation of the 95% confidence interval is:

[0.45, 0.75].

## Question 3


e: None of the above

# Part 2 - Blizzard

In 2020, employees of Blizzard Entertainment circulated a spreadsheet to anonymously share salaries and recent pay increases amidst rising tension in the video game industry over wage disparities and executive compensation. (Source: [Blizzard Workers Share Salaries in Revolt Over Pay](https://www.bloomberg.com/news/articles/2020-08-03/blizzard-workers-share-salaries-in-revolt-over-wage-disparities))

The name of the data frame used for this analysis is `blizzard_salary` and the variables are:

- `percent_incr`: Raise given in July 2020, as percent increase with values ranging from 1 (1% increase) to 21.5 (21.5% increase)

- `salary_type`: Type of salary, with levels `Hourly` and `Salaried`

- `annual_salary`: Annual salary, in USD, with values ranging from $50,939 to $216,856.

- `performance_rating`: Most recent review performance rating, with levels `Poor`, `Successful`, `High`, and `Top`. The `Poor` level is the lowest rating and the `Top` level is the highest rating.

The top ten rows and `.info` of `blizzard_salary` are shown below:

```
   percent_incr salary_type  annual_salary performance_rating
0           1.0        year            1.0               High
1           1.0        year            1.0         Successful
2           1.0        year            1.0               High
3           1.0      Hourly        33987.2         Successful
4           NaN      Hourly        34798.4               High
5           NaN      Hourly        35360.0                NaN
6           NaN      Hourly        37440.0                NaN
7           0.0      Hourly        37814.4                NaN
8           4.0      Hourly        41100.8                Top
9           1.2      Hourly        42328.0                NaN


<class 'pandas.core.frame.DataFrame'>
Index: 409 entries, 0 to 465
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   percent_incr        370 non-null    float64
 1   salary_type         409 non-null    object 
 2   annual_salary       409 non-null    float64
 3   performance_rating  298 non-null    object 
dtypes: float64(2), object(2)
memory usage: 16.0+ KB
None
```

## Question 4

C: For every additional $1,000 of annual salary, the model predicts the raise to be higher, on average, by 0.016%.

## Question 5

C: Adjusted R2 of raise_2_fit is higher than adjusted R2 of raise_1_fit.

## Question 6

My teammate’s interpretation is incorrect because the coefficient for performance_rating_Successful represents a comparison to the reference category, which is likely High (the category not shown in the output). A negative coefficient means that individuals with a Successful rating receive, on average, raises that are 1.73 percentage points lower than those with a High rating, after accounting for annual salary. It doesn’t imply that Successful ratings lead to lower raises in an absolute sense, only that they are smaller compared to the reference group. To interpret the coefficients correctly, it’s essential to understand that they describe differences relative to the baseline group, not absolute values.

## Question 7

b: “Successful”, “High”, “Top”

## Question 8

c: Figure 3

## Question 9

A parsimonious model is a model that achieves the best balance between simplicity and explanatory power. It includes only the most relevant predictors necessary to describe the relationship between variables, avoiding overfitting by excluding unnecessary complexity. From a data science perspective, it’s about finding the sweet spot where the model performs well on both training and unseen data while remaining interpretable and computationally efficient.

## Question 10

c. The model predicts that the percentage increase employees with Successful performance get, on average, is higher by a factor of 6.502427 compared to the employees with Poor performance rating.

## Question 11

a & d

# Part 3 - Calculus

## Question 12

$$
g(x) = \left( \sin(x^2) + \cos(ax) \right)^k
$$

Using the chain rule:

$$
\frac{d}{dx} g(x) = k \left( \sin(x^2) + \cos(ax) \right)^{k-1} \cdot \frac{d}{dx} \left( \sin(x^2) + \cos(ax) \right)
$$

Derivative of the inner function:

$$
\frac{d}{dx} \left( \sin(x^2) + \cos(ax) \right) = \cos(x^2) \cdot 2x - \sin(ax) \cdot a
$$

Substitute:

$$
\frac{d}{dx} g(x) = k \left( \sin(x^2) + \cos(ax) \right)^{k-1} \left( 2x \cos(x^2) - a \sin(ax) \right)
$$

## Question 13

$$
\int_a^b \left( e^{cx} + \frac{1}{x^n} \right) dx
$$

Now: 

$$
\int_a^b e^{cx} dx + \int_a^b \frac{1}{x^n} dx
$$

   $$
   \int e^{cx} dx = \frac{1}{c} e^{cx} \quad \text{so} \quad \left[ \frac{1}{c} e^{cx} \right]_a^b = \frac{1}{c} e^{cb} - \frac{1}{c} e^{ca}
   $$

   $$
   \int \frac{1}{x^n} dx = \frac{x^{1-n}}{1-n} \quad \text{so} \quad \left[ \frac{x^{1-n}}{1-n} \right]_a^b = \frac{b^{1-n}}{1-n} - \frac{a^{1-n}}{1-n}
   $$

Thus:

$$
\frac{1}{c} \left( e^{cb} - e^{ca} \right) + \frac{b^{1-n} - a^{1-n}}{1-n}
$$

# Part 4 - Linear Algebra

## Question 14

$$
x^\top = \begin{bmatrix} 
x_1 & x_2 & x_3 & x_4 
\end{bmatrix}
$$

## Question 15

$$
N^\top = \begin{bmatrix} 
n_{11} & n_{21} & n_{31} & n_{41} \\ 
n_{12} & n_{22} & n_{32} & n_{42}
\end{bmatrix}
$$

## Question 16

1. C has 3 rows and 2 columns, so its dimensions are 3×2.
2. D has 2 rows and 3 columns, so its dimensions are 2×3

3.1. Yes, because the number of columns of the first matrix is equal to the number of rows of the second matrix.


3.2. The dimensions of CD will be 3×3.

## Question 17

1. E is a 3×2 matrix because it has 3 rows and 2 columns

2. F is a 2×1 matrix because it has 2 rows and 1 column

3.1  Yes, because the number of columns of the first matrix is equal to the number of rows of the second matrix.


3.2: 
   $$
   EF = \begin{bmatrix} 
   e_{11}f_{11} + e_{12}f_{21} \\ 
   e_{21}f_{11} + e_{22}f_{21} \\ 
   e_{31}f_{11} + e_{32}f_{21} 
   \end{bmatrix}
   $$

# Wrap-up

## Submission

Before you wrap up the final assignment, make sure all of your documents are updated on your GitHub repo.
We will be checking these to make sure you have been practicing how to commit and push changes.

You must turn in the `.ipynb` file by the submission deadline to be considered "on time".


## Checklist

Make sure you have:

-   attempted all questions
-   run all code in your Jupyter notebook
-   committed and pushed everything to your GitHub repository such that the Git pane in VS Code is empty


## Grading

The midterm is graded out of a total of 100 points.


1. **Question 1**: 5 points

2. **Question 2**: 5 points

3. **Question 3**: 5 points

4. **Question 4**: 5 points

5. **Question 5**: 5 points

6. **Question 6**: 5 points
   
7. **Question 7**: 5 points

8. **Question 8**: 5 points

9. **Question 9**: 5 points

10. **Question 10**: 5 points

11. **Question 11**: 5 points

12. **Question 12**: 10 points

13. **Question 13**: 10 points

14. **Question 14**: 5 points

15. **Question 15**: 5 points

16. **Question 16**: 7.5 points

17. **Question 17**: 7.5 points

Total: **100 points**
