## Question 1:
A government studied the output of farms in their country over the course of a year in an attempt to understand the impact of new policies. They want to estimate the average change in yield from Jan. 1, 2015 to Dec. 31, 2015. A simple random sample of 70 farms (out of more than 700 total) was taken, and the percent change in total yield was measured over that period. A 95% confidence interval based on this sample is (-2%, 4.5%), which is based on the normal model for the mean.

### Part A: 

#### a)
This confidence interval is not valid as we do not know the population distribution of the change in yield of all farms in the country.
>Disagree. We do not need to know the population parameters to be able to create a confidence interval from our sample. 

#### b)
We are 95% confident that the average change in yield for these 70 farms is between -2% and 4.5%. (Be sure to include a precise description of what 95% confident means in this context.)
>Wrong. The sample mean (average change in yield) is *exactly* between -2% and 4.5%. 95% confidence refers to reptetition of the sampling process and how many of those sample confidence intervals would include the population mean.

#### c)
We are 95% confident that the average change in yield for all farms in the country is between -2% and 4.5%. (Be sure to include a precise description of what 95% confident means in this context.)
>Yes, to be more precise, 95% confident here means that of 95% of samples we take, the population mean will lie within their confidence intervals. There is, however, still a chance that the population mean does not lie within this specific samples' confidence interval.

#### d)
95% of the samples have a sample mean between -2% and 4.5%.
>Wrong. 95% of the samples have the population mean within their confidence intervals.

#### e)
A 99% confidence interval would be narrower than the 95% confidence interval since we need to be more sure of our estimate.
>A 99% confidence interval is wider in term of upper and lower bound, because this increases the possibility of the true population mean falling within that range.

#### f)
In order to decrease the margin of error of a 95% confidence interval to half of what it is now we would need to double the sample size.
>The margin of error in our confidence interval is 5% here. Doubling the sample size will decrease the Standard Error and make the confidence interval narrower, but not by as much as half.


### Part B:

#### a)
Compute the sample mean and margin of error.
<center>Sample Mean:</center>
<br></br>
\begin{align}
\frac{(-2\%)-4.5\%}{2}= -3.25
\end{align}

\begin{align}
\text{4.5%+(-3.25%) = 1.25%}
\end{align}
<br></br>

<center>Margin of Error:</center>
<br></br>
\begin{align}
\text{Confidence Interval =}\sigma\pm\text{Zscore}\times\text{SE}
\end{align}
<br></br>
\begin{align}
\text{4.5% = 1.25% + 1.96}\times\text{SE}
\end{align}
<br></br>
\begin{align}
\text{SE = 1.66%}
\end{align}


#### b)
Describe some scenarios in which it might only be possible to gather data from 70 farms, rather than every farm, such that we need to use statistical inference.
>Given a limitation on time and/or resources our researchers might only be able to gather data from some of the farms.
_______________
>Not all farms might have data on their yield changes over time readily available and thus we can only gather data from farms that have doucmentation on their yield.


### Part C:

Suppose a more comprehensive study shows the average yield for all farms in the country during 2015 turns out to be -4% with a standard deviation of 2.5%, and the data look nearly normal. Give both a qualitative and quantitative argument about what is likely to happen to the yield on the 70 farms from the first study during 2016. Be sure to discuss the merits of any assumptions you make.
>This would suggest that the farms in our sample are somewhat outliers when compared to the entire population, as their yield is on average much better than that of other farms. Given regression to the mean, it is much more likely that these farms will fall closer to the mean when tested again in the following year, thus the sample's average yield levels are very likely to lie closer towards -4%. This hypothesis assumes, that there is a good portion of luck involved in the farming process, such as change of weather, parasites, etc.


## Question 2:
Create a simulation that approximates the *standard error* for the sampling distribution of some distribution for a fixed sample size.
By drawing further samples construct a confidence interval around each sample mean, then keep track of the number of times the population mean falls within the confidence intervals. Try this for several different confidence levels.

Write a paragraph describing how this simulation illustrates the correct interpretation of confidence intervals.
>Our simulation shows that, if we repeat the sampling procedure many times, the true population mean will lie within the confidence interval x amounts times, where x is the percentage we chose for our confidence interval. The more often we repeat this process, the more exactly the simulation will approximate the confidence interval.


In [7]:
import numpy as np
import random
import scipy.stats as st

#sample size is fixed at 1.000 and population size is 10.000

def create_unif_pop(n):
    distrib = 1000 * np.random.random_sample((n, ))
    return distrib

def sample_procedure(dist):
    sample_size = 1000
        
    sample = np.random.choice(dist, (sample_size, ))
    return sample

def standard_error(sample):
    sample_mean = np.mean(sample)
    stdev = np.std(sample, ddof=1)
    standard_error = stdev / np.sqrt(1000)
    return standard_error

def confidence_interval(sample, conf_percent, dist):
    se = standard_error(sample)
    ci_top = np.mean(sample) + (st.norm.ppf(conf_percent)*se)
    ci_bottom = np.mean(sample) - (st.norm.ppf(conf_percent)*se)
    
    pop_mean = np.mean(dist)
    
    if ci_bottom <= pop_mean and pop_mean <= ci_top:
        return True 
    return False

#main loop
population_dist = create_unif_pop(10000)
in_interval = 0
repetitions = float(raw_input("How many tests would you like to run? "))
conf_percentage = float(raw_input("Put in your desired confidence percentile as a decimal. "))

for _ in range (int(repetitions)):
    sample = sample_procedure(population_dist)
    if confidence_interval(sample, conf_percentage, population_dist):
        in_interval += 1.0

print "Your true population mean lies within the confidence interval approximately " + str((in_interval / repetitions)*100) + "% of the time." 

How many tests would you like to run? 1000
Put in your desired confidence percentile as a decimal. 0.99
Your true population mean lies within the confidence interval approximately 98.2% of the time.
