# Assignment 3

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

In [109]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import warnings

### Question 1: Resampling via Bootstrapping

Now, we'll use the `iris` dataset, which we will add to Python using the `statsmodels` library. As always, start by reviewing a description of the dataset, by printing the dataset.

In [56]:
# Import
iris = sm.datasets.get_rdataset('iris', 'datasets')
df_full = pd.DataFrame(iris.data)
df_full

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [57]:
df = pd.DataFrame({'X': df_full['Sepal.Width'], 'Y': df_full['Sepal.Length']})
df

Unnamed: 0,X,Y
0,3.5,5.1
1,3.0,4.9
2,3.2,4.7
3,3.1,4.6
4,3.6,5.0
...,...,...
145,3.0,6.7
146,2.5,6.3
147,3.0,6.5
148,3.4,6.2


_(i)_ Create an `alpha_func(D, idx)` function which takes the `Sepal`'s `width` and `length` to calculate for alpha

In [63]:
# Your code here

def alpha_func(D, idx):
   cov_ = np.cov(D[['X','Y']].loc[idx], rowvar=False)
   return ((cov_[1,1] - cov_[0,1]) /
           (cov_[0,0]+cov_[1,1]-2*cov_[0,1]))

Test the code below

In [64]:
alpha_func(df, range(100))

0.6189498510165619

_(ii)_ Construct a new bootstrap data set and recompute alpha

In [65]:
rng = np.random.default_rng(5)
alpha_func(df,
           rng.choice(100,
                      100,
                      replace=True))

0.602762653525732

Imagine we are analysts working for a shipping company. The company wants to know the average length of iris' petals, to inform space allotment on an upcoming shipment. The relevant variable in the dataset is `Sepal.Length`. 

_(iii)_ Why is it (perhaps) not sufficient to simply calculate the mean of `Sepal.Length`? What more information will preforming a bootstrap provide to us? Calculating the mean of a single sample provides information specific to that particular sample from the whole population. The bootstrap method, by generating many resamples, allows for the calculation of metrics such as the standard error and confidence intervals, offering insights based on a wider range of variance.

_(iv)_ We can perform bootstrapping in Python by defining a simple function using `boot_SE()` for computing the bootstrap standard error. Remember, because bootstrapping involves randomness, we must first set a seed for reproducibility!

In [125]:
# Add your code here to set the seed


# Function that params D - 1 bootstrap sample, idx -  array of indexes. Return mean of sample
def one_param_mean (D, idx):
    return np.mean(D.loc[idx])

def boot_SE(func,
            D,
            n=None,
            B=1000,
            seed=0):
    
    warnings.filterwarnings(action='ignore', category=FutureWarning, module='ISLP.models.columns', lineno=151)
    
    rng = np.random.default_rng(seed)
    first_, second_ = 0, 0
    bootstr_values = list()
    bootstr_values_arr = np.array([])
    n = n or D.shape[0]
    for i in range(B):
        idx = rng.choice(D.index,
                         n,
                         replace=True)
        value = func(D, idx)                              # value of each sample
        bootstr_values_arr = np.append(bootstr_values_arr , value)
    #    first_ += value
    #    second_ += value**2

    return bootstr_values_arr#np.sqrt(second_ / B - (first_ / B)**2)          # SE


sepal_df = df_full['Sepal.Length']                         # make array with our one column
bootstr_arr = boot_SE( one_param_mean, sepal_df, B=10000, seed=0 )
ste = np.std(bootstr_arr)
ste

0.06770609155591377

_(v)_ Evaluate the accuracy of our alpha estimate with B = 1000

In [123]:
bootstr_arr = boot_SE( one_param_mean, sepal_df, B=1000, seed=0 )
ste = np.std(bootstr_arr)
ste

0.06649295592099295

_(vi)_ What is the original mean value of `Sepal.Length`?

In [91]:
# Your code here
np.mean ( sepal_df )

5.843333333333334

Next, let's create a new bootstrapping to bootstrap samples (`boot_se_samples`) of `Sepal.Length`, in order to compute its bootstrapped mean and standard deviation.

_(vii)_. Write code to review the bootstrapped mean value, and the standard deviation of the bootstrapped samples. Compare the mean against its original value. Then, review the bootstrapped range, by using `t_range = np.ptp(boot_se_samples)`.

In [132]:
# Add your code here
bootstr_arr_new = boot_SE( one_param_mean, sepal_df, B=1000, seed=5 )
ste = np.mean(bootstr_arr_new)
t_range = np.ptp(bootstr_arr_new)

ste, t_range

(5.843912666666667, 0.3680000000000003)

_(viii)_ Next, let's compute 95% confidence intervals, for the mean value of iris sepal length. (Hint: use the `np.percentile` function)

In [135]:
# Add your code here
conf_inter = np.percentile(bootstr_arr_new, [2.5,97.5])
conf_inter

array([5.71731667, 5.96735   ])

_(ix)_. Use the plot function to create an histogram of the bootstrapped samples. What does this histogram show ?

In [None]:
#Complete this

# Create a figure and axis
fig, ax = plt.subplots()

# Create the histogram
#Add your code here

# Add a title
#Add your code here

# Add a label to the x-axis
#Add your code here

# Add a label to the y-axis
#Add your code here

# Show the plot
plt.show()

_(x)_ Given your bootstrapped analysis, what do you recommend to shipping company? 

In [None]:
# Write your answer here

# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Bootstrapping|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Note:

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.


### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-3`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_3.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applying_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
