# Assignment 3

As before, if a question can be answered with 'yes/no', or a numeric value, you may simply state as much. If you incorporate code from the internet (which is not required and generally not advisable), please cite the source within your code (providing a URL is sufficient).

We will go through comparable code and concepts in the live learning session. If you run into trouble, start by using the help `help()` function in Python, to get information about the datasets and function in question. The internet is also a great resource when coding (though note that no outside searches are required by the assignment!). If you do incorporate code from the internet, please cite the source within your code (providing a URL is sufficient).

Please bring questions that you cannot work out on your own to office hours, work periods or share with your peers on Slack. We will work with you through the issue.

In [27]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

ModuleNotFoundError: No module named 'numpy'

### Question 1: Resampling via Bootstrapping

Now, we'll use the `iris` dataset, which we will add to Python using the `statsmodels` library. As always, start by reviewing a description of the dataset, by printing the dataset.

In [28]:
# Import
iris = sm.datasets.get_rdataset('iris', 'datasets')
df = pd.DataFrame(iris.data)

NameError: name 'sm' is not defined

_(i)_ Create an `alpha_func(D, idx)` function which takes the `Sepal`'s `width` and `length` to calculate for alpha

In [26]:
# Your code here
def alpha_func(D, idx):
    sepal_length = D.iloc[idx]['Sepal.Length']
    sepal_width = D.iloc[idx]['Sepal.Width']
    alpha = (sepal_length + sepal_width).mean()
    return alpha

Test the code below

In [None]:
alpha_func(df, range(100))

_(ii)_ Construct a new bootstrap data set and recompute alpha

In [None]:
rng = np.random.default_rng(0)
alpha_func(df,
      rng.choice(100,
            100,
            replace=True))

Imagine we are analysts working for a shipping company. The company wants to know the average length of iris' petals, to inform space allotment on an upcoming shipment. The relevant variable in the dataset is `Sepal.Length`. 

_(iii)_ Why is it (perhaps) not sufficient to simply calculate the mean of `Sepal.Length`? What more information will preforming a bootstrap provide to us?  

_(iv)_ We can perform bootstrapping in Python by defining a simple function using `boot_SE()` for computing the bootstrap standard error. Remember, because bootstrapping involves randomness, we must first set a seed for reproducibility!

In [None]:
# Add your code here to set the seed
# Set the seed for reproducibility
np.random.seed(0)

# Function to compute bootstrap standard error
def boot_SE(data, num_bootstrap=1000):
    boot_means = []
    for _ in range(num_bootstrap):
        boot_sample = data.sample(frac=1, replace=True)
        boot_mean = boot_sample['Sepal.Length'].mean()
        boot_means.append(boot_mean)
    return np.std(boot_means)

# Compute bootstrap standard error for Sepal.Length
se_bootstrap = boot_SE(df)
print(f'Bootstrap standard error of Sepal.Length: {se_bootstrap}')


_(v)_ Evaluate the accuracy of our alpha estimate with B = 1000

In [29]:
# Your code here
# Number of bootstrap samples
B = 1000

# Array to store bootstrap alpha estimates
bootstrap_alphas = []

# Perform bootstrap resampling
for _ in range(B):
    bootstrap_idx = np.random.choice(df.index, size=len(df), replace=True)
    alpha_value = alpha_func(df, bootstrap_idx)
    bootstrap_alphas.append(alpha_value)

# Calculate the mean and standard error of the bootstrap alpha estimates
bootstrap_alphas = np.array(bootstrap_alphas)
alpha_mean = bootstrap_alphas.mean()
alpha_se = bootstrap_alphas.std()

print(f'Mean of bootstrap alpha estimates: {alpha_mean}')
print(f'Standard error of bootstrap alpha estimates: {alpha_se}')

NameError: name 'np' is not defined

_(vi)_ What is the original mean value of `Sepal.Length`?

Next, let's look _inside_ our bootstrapping to understand the new, bootstrapped sample we have created. Let's review the bootstrapped range, by using `t_range = np.ptp(boot_se_samples)`.

_(vii)_. Write code to review the bootstrapped mean value, and the standard deviation of the bootstrapped samples. Compare the mean against its original value.

In [30]:
# Add your code here
# Array to store bootstrap mean values of Sepal.Length
boot_se_samples = []
# Bootstrap resampling
for _ in range(B):
    boot_sample = df.sample(frac=1, replace=True)
    boot_sample_mean = boot_sample['Sepal.Length'].mean()
    boot_se_samples.append(boot_sample_mean)

# Convert to numpy array for further calculations
boot_se_samples = np.array(boot_se_samples)

# Calculate the range of the bootstrapped samples
t_range = np.ptp(boot_se_samples)
print(f'Bootstrapped range (ptp) of Sepal.Length: {t_range}')

# Calculate the mean and standard deviation of the bootstrapped samples
boot_mean = boot_se_samples.mean()
boot_std = boot_se_samples.std()

print(f'Bootstrapped mean value of Sepal.Length: {boot_mean}')
print(f'Standard deviation of bootstrapped samples: {boot_std}')

# Compare the bootstrapped mean against the original mean value
print(f'Original mean value of Sepal.Length: {original_mean_sepal_length}')
print(f'Difference between bootstrapped mean and original mean: {boot_mean - original_mean_sepal_length}')


NameError: name 'df' is not defined

_(viii)_ Next, let's compute 95% confidence intervals, for the mean value of iris petal length. (Hint: use the `np.percentile` function)

In [31]:
# Add your code here
conf_interval = np.percentile(boot_se_samples, [2.5, 97.5])
print(f'95% Confidence Interval for the mean Sepal.Length: {conf_interval}')

NameError: name 'np' is not defined

_(ix)_. Use the plot function to create an histogram of the bootstrapped samples. What does this histogram show ?

In [None]:
#Complete this
import matplotlib.pyplot as plt
# Create a figure and axis
fig, ax = plt.subplots()

# Create the histogram
#Add your code here
ax.hist(boot_se_samples, bins=30, edgecolor='black')

# Add a title
#Add your code here
ax.set_title('Histogram of Bootstrapped Sepal.Length Means')

# Add a label to the x-axis
#Add your code here
ax.set_xlabel('Bootstrapped Sepal.Length Mean')

# Add a label to the y-axis
ax.set_ylabel('Frequency')
#Add your code here

# Show the plot
plt.show()

_(x)_ Given your bootstrapped analysis, what do you recommend to shipping company? 

In [None]:
# Write your answer here
# Using the Average, bootstrapping shows variability in petal lengths, using the standard error and confidence intervals to understand the range, plan for extra space to accommodate variations, risk management, the bootstrap analysis proves the reliability of the mean Sepal.Length.

# Criteria

|Criteria            |Complete           |Incomplete          |
|--------------------|---------------|--------------|
|Bootstrapping|All steps are done correctly and the answers are correct.|At least one step is done incorrectly leading to a wrong answer.|

## Submission Information

🚨 **Please review our [Assignment Submission Guide](https://github.com/UofT-DSI/onboarding/blob/main/onboarding_documents/submissions.md)** 🚨 for detailed instructions on how to format, branch, and submit your work. Following these guidelines is crucial for your submissions to be evaluated correctly.

### Note:

If you like, you may collaborate with others in the cohort. If you choose to do so, please indicate with whom you have worked with in your pull request by tagging their GitHub username. Separate submissions are required.


### Submission Parameters:
* Submission Due Date: `HH:MM AM/PM - DD/MM/YYYY`
* The branch name for your repo should be: `assignment-3`
* What to submit for this assignment:
    * This Jupyter Notebook (assignment_3.ipynb) should be populated and should be the only change in your pull request.
* What the pull request link should look like for this assignment: `https://github.com/<your_github_username>/applied_statistical_concepts/pull/<pr_id>`
    * Open a private window in your browser. Copy and paste the link to your pull request into the address bar. Make sure you can see your pull request properly. This helps the technical facilitator and learning support staff review your submission easily.

Checklist:
- [ ] Created a branch with the correct naming convention.
- [ ] Ensured that the repository is public.
- [ ] Reviewed the PR description guidelines and adhered to them.
- [ ] Verify that the link is accessible in a private browser window.

If you encounter any difficulties or have questions, please don't hesitate to reach out to our team via our Slack at `#cohort-3-help`. Our Technical Facilitators and Learning Support staff are here to help you navigate any challenges.
