1\. **Hurricanes per Year**

The number of hurricanes in 2005 was 15. The historic average is 6.3. Is this number signficantly different?
- Assume the number of hurricanes is random, i.e. follows the Poisson distribution.
- Assume as statistically significant a probability that has a Z score of 3 or larger with respect a normal distribution.

**Hint**: compute the probability that in a single year are observed 15 or more hurricances.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy as sp
import pickle
import math

In [None]:
x0=15
mu=6.3
p=1-sp.stats.poisson.cdf(x0,mu)
print("The probability of having more than 15 hurricanes is: ",p)
Z=abs(sp.stats.norm.ppf(p))
print("Its Z score with respect to a normal distribution is: ",Z)
pvalue = sp.stats.norm.cdf(-Z) + (1. - sp.stats.norm.cdf(Z))
print("pvalue:",pvalue)

2\. **Pairwise t-test**

In an experiment, a group of 10 individuals agreed to participate in a study of blood pressure changes following exposure to halogen lighting. Resting systolic blood pressure was recorded for each individual. The participants were then exposed to 20 minutes in a room lit only by halogen lamps. A post-exposure systolic blood pressure reading was recorded for each individual. The results are presented in the following data set:

```python
pre = np.array([120, 132, 120, 110, 115, 128, 120, 112, 110, 100])
post = np.array([140, 156, 145, 130, 117, 148, 137, 119, 127, 135])
```

Determine whether the change in blood pressures within our sample was statistically significant.

**Hint:**
in this case, the Student's $t$-test should be performed to compare the two datasets.
Use the following test statistics:

$$T = \frac{\bar{x}_1 - \bar{x}_2}{\sigma \sqrt{\frac{2}{n}}}$$

and 

$$\sigma = \sqrt{\frac{\sigma_1^2 + \sigma_2^2}{2}}$$

In [None]:
pre = np.array([120, 132, 120, 110, 115, 128, 120, 112, 110, 100])
post = np.array([140, 156, 145, 130, 117, 148, 137, 119, 127, 135])

sigma1=np.var(pre, ddof=1)
sigma2=np.var(post,ddof=1)
n=len(pre)

mean1=np.mean(pre)
mean2=np.mean(post)

sigma=np.sqrt((sigma1+sigma2)/2)

T=abs(mean1-mean2)/(sigma*np.sqrt(2/n))

print("T =",T)

p=1-sp.stats.t.cdf(T,n-1)+sp.stats.t.cdf(-T,n-1)

print("pvalue =",p)

3\. **Curve fitting of temperature in Alaska** 

The temperature extremes in Alaska for each month, starting in January, are given by (in degrees Celcius):

max:  `17,  19,  21,  28,  33,  38, 37,  37,  31,  23,  19,  18`

min: `-62, -59, -56, -46, -32, -18, -9, -13, -25, -46, -52, -58`

* Plot these temperatures.
* Find a suitable a function that can describe min and max temperatures. 
* Fit this function to the data with `scipy.optimize.curve_fit()`.
* Plot the result. Is the fit reasonable? If not, why?
* Is the time offset for min and max temperatures the same within the fit accuracy?

In [None]:
months=['January','February','March','April','May','June','July','August','September','October','November','December']
tmin=[-62, -59, -56, -46, -32, -18, -9, -13, -25, -46, -52, -58]
tmax=[17,  19,  21,  28,  33,  38, 37,  37,  31,  23,  19,  18]
ax=plt.figure(figsize=(13,8))
plt.plot(months,tmin,'bo',label='Tmin')
plt.plot(months,tmax,'ro',label='Tmax')

plt.xlabel('Month')
plt.ylabel('Degrees[C°]')
plt.title("Temperature in Alaska")
#looking at the plot it seems like a sin function

def approx(t,a,b,c):
    return (a * np.sin(2*np.pi*(t + b)/max(t)) + c)
    
popt1,pcov1=sp.optimize.curve_fit(approx, np.arange(12), tmin , full_output=False)
popt2,pcov2=sp.optimize.curve_fit(approx, np.arange(12), tmax,  full_output=False)
print(popt1)
plt.plot(months,approx(np.arange(12),*popt1),label='approx Tmin',color='b')
plt.plot(months,approx(np.arange(12),*popt2),label='approx Tmax',color='r')
plt.legend()
#plt.plot(months,sp.stats.norm.pdf(popt1))
#plt.plot(months,sp.stats.norm.pdf(popt2))

4\. **Fit the residues**

Read the `data/residuals_261.pkl` file. If you haven't got it already, download it from here:

```bash
wget https://www.dropbox.com/s/3uqleyc3wyz52tr/residuals_261.pkl -P data/
```

The feature named `residual` contains the residuals (defined as $y_i - \hat{y}_i$) of a linear regression as a function of the independent variable `distances`.

- Considering only the "residual" feature, create an histogram with the appropriate binning and plot it.
- Set the appropriate Poisson uncertainty for each bin (thus, for each bin, $\sigma_i = \sqrt{n_i}$, where $n_i$ is the number of entries in each bin)
- By looking at the distribution of the residuals, define an appropriate function and fit it to the histogram of the residuals
- Perform a goodness-of-fit test. Is the p-value of the fit satisfactory?

In [None]:
#!wget https://www.dropbox.com/s/3uqleyc3wyz52tr/residuals_261.pkl -P data/
infile = open('data/residuals_261.pkl','rb')
new_dict = pickle.load(infile).item()
#print(new_dict)
res=np.array(new_dict['residuals'])
print(res)
ax=plt.figure(figsize=(13,8))
mask=abs(res)<2
residuals=pd.DataFrame(res[mask],columns=['residuals'])
y, bins, _ = plt.hist(residuals, bins=40,color='blue')
bins = (bins[1:] + bins[:-1])/2
plt.xlabel('Residuals')
plt.ylabel('N. of occurences')
err_y=np.zeros(len(y))
#poisson uncertainty
for i in range(len(y)):
    if y[i]==0:
        err_y[i]=0
    else:
        err_y[i]=math.sqrt(y[i])
plt.errorbar(bins,y , yerr=err_y, fmt='.r')
    
def fit(x, mu, b,sigma,o):
    return (o + b * np.exp(-(x - mu) ** 2 / (2 * sigma ** 2)))/(sigma*2*math.pi)

popt, pcov = sp.optimize.curve_fit(fit, bins, y)
curve_fit=fit(bins,*popt)
plt.plot(bins,curve_fit,color='black',label='optimized curve')
plt.legend()
#use t-test for p-value
print("Using t-test:")
n=len(bins)
res_var=y.var(ddof=n-1)
fit_mean=curve_fit.mean()
res_mean=y.mean()
T1=(fit_mean-res_mean)/(np.sqrt(res_var)/np.sqrt(n))
print("T:", T1)
pvalue1 = 1-sp.stats.t.cdf(T1,n-1)+sp.stats.t.cdf(-T1,n-1)
print("p-value =", pvalue1)
#we use chi2 as the data distribution is approximately normal and we have a meaningful number of samples(>=5)
chi2=np.sum((y-curve_fit)**2/(err_y**2))
print("Using chi2:")
print("Chi2:",chi2)
pvalue2=1-sp.stats.chi2.cdf(chi2, n-1)
print("p-value =", pvalue2) # if the p-value is < 0.05, the fit is considered unsatisfactory

5\. **Temperatures in Munich**

Get the following data file:

```bash
https://www.dropbox.com/s/7gy9yjl00ymxb8h/munich_temperatures_average_with_bad_data.txt
```

which gives the temperature in Munich every day for several years.


Fit the following function to the data:

$$f(t) = a \cos(2\pi t + b)+c$$

where $t$ is the time in years.

- Make a plot of the data and the best-fit model in the range 2008 to 2012.

   - What are the best-fit values of the parameters?

   - What is the overall average temperature in Munich, and what are the typical daily average values predicted by the model for the coldest and hottest time of year?

   - What is the meaning of the $b$ parameter, and what physical sense does it have?


- Now fit the data with the function $g(x)$, which has 1 more parameter than $f(x)$.
$$g(x) = a \cos(2\pi b t + c)+d$$
   - What are the RSS for $f(x)$ and $g(x)$?
   - Use the Fisher F-test to determine whether the additional parameter is motivated.

In [None]:
#!wget https://www.dropbox.com/s/7gy9yjl00ymxb8h/munich_temperatures_average_with_bad_data.txt -P data/
path='data/munich_temperatures_average_with_bad_data.txt'
date=np.loadtxt(path,usecols=0)
temp=np.loadtxt(path,usecols=1)
indexes=np.where((date>=2008) & (date<=2012))
def f(t,a,b,c):
    return a*(np.cos(2*np.pi*t+b))+c

temp_red=temp[indexes]
date_red=date[indexes]
popt, pcov = sp.optimize.curve_fit(f, date_red,temp_red)
curve_fit=f(date_red, *popt)
ax=plt.figure(figsize=(13,8))
plt.plot(date_red,temp_red,'.b',label='Temperature measurements')
plt.plot(date_red, curve_fit, 'red',label='Best-fit')
plt.xlabel('Year')
plt.ylabel('Temperature[C°]')
plt.legend()
plt.title('Temperature in Munich(2008-2012)')


print('best-fit parameters:',popt)
print('average temperature[C°]:',np.mean(curve_fit))
print('hottest temperature predicted by the model[C°]:',np.max(curve_fit))
print('coldest temperature predicted by the model[C°]:',np.min(curve_fit))

#the b parameters stands for the phase
#so its phisycal meaning is identifying when the coldest and hottest days happen during the year 

def g(t,a,b,c,d):
    return a*(np.cos(2*b*np.pi*t+c))+d

poptg,pcovg=sp.optimize.curve_fit(g, date_red,temp_red)
curveg_fit=g(date_red,*poptg)
#Residuals sum of squares
RSSf=np.sum((curve_fit-temp_red)**2)
RSSg=np.sum((curveg_fit-temp_red)**2)
print('RSSf:',RSSf)
print('RSSg:',RSSg)
#Fisher test

def Ftest(ssr_1, ssr_2, ndof_1, ndof_2, nbins, verbose=True):
    F = ((ssr_1 - ssr_2)/(ndof_2 - ndof_1)) / (ssr_2/(nbins - ndof_2))
    pval = 1. - sp.stats.f.cdf(F, ndof_2 - ndof_1, nbins - ndof_2)
    alpha=0.05
    if verbose: print("p-value: %.3f" % pval, ", additional parameter necessary:", "YES" if pval < alpha else "NO")
    return pval

ndoff=3-1
ndofg=4-1
N=len(temp_red)
pval=Ftest(RSSf,RSSg,ndoff,ndofg,N)