# Bonus: Temperature Analysis I

In [1]:
# import the built-in modules as required
import pandas as pd
from datetime import datetime as dt

In [2]:
# import the required csv file
df = pd.read_csv("./Resources/hawaii_measurements.csv")
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [3]:
# get the datatypes of the columns within the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19550 entries, 0 to 19549
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   station  19550 non-null  object 
 1   date     19550 non-null  object 
 2   prcp     18103 non-null  float64
 3   tobs     19550 non-null  int64  
dtypes: float64(1), int64(1), object(2)
memory usage: 611.1+ KB


In [4]:
# Convert the date column format from string to datetime
df["date"] = pd.to_datetime(df["date"])

In [5]:
# verify if the datatype of the date column is updated as expected
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19550 entries, 0 to 19549
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   station  19550 non-null  object        
 1   date     19550 non-null  datetime64[ns]
 2   prcp     18103 non-null  float64       
 3   tobs     19550 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 611.1+ KB


In [6]:
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [7]:
# Set the date column as the DataFrame index
df.set_index(["date"],inplace=True)
df.head()

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.0,63
2010-01-03,USC00519397,0.0,74
2010-01-04,USC00519397,0.0,76
2010-01-06,USC00519397,,73


### Compare June and December data across all years 

In [8]:
from scipy import stats
import numpy as np

In [9]:
# Filter data for desired months
june_data = df.loc[df.index.month == 6]
june_data

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-06-01,USC00519397,0.00,78
2010-06-02,USC00519397,0.01,76
2010-06-03,USC00519397,0.00,78
2010-06-04,USC00519397,0.00,76
2010-06-05,USC00519397,0.00,77
...,...,...,...
2017-06-26,USC00516128,0.02,79
2017-06-27,USC00516128,0.10,74
2017-06-28,USC00516128,0.02,74
2017-06-29,USC00516128,0.04,76


In [10]:
dec_data = df.loc[df.index.month == 12]
dec_data

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-12-01,USC00519397,0.04,76
2010-12-03,USC00519397,0.00,74
2010-12-04,USC00519397,0.00,74
2010-12-06,USC00519397,0.00,64
2010-12-07,USC00519397,0.00,64
...,...,...,...
2016-12-27,USC00516128,0.14,71
2016-12-28,USC00516128,0.14,71
2016-12-29,USC00516128,1.03,69
2016-12-30,USC00516128,2.37,65


In [11]:
# Identify the average temperature for June
average_temp_june = june_data["tobs"].mean()
print(f"The Average June temperature in Hawaii is {average_temp_june}")

The Average June temperature in Hawaii is 74.94411764705882


In [12]:
# Identify the average temperature for December
average_temp_dec = dec_data["tobs"].mean()
print(f"The Average December temperature in Hawaii is {average_temp_dec}")

The Average December temperature in Hawaii is 71.04152933421226


In [13]:
# Create collections of temperature data
jun_collection = june_data["tobs"]
jun_collection

date
2010-06-01    78
2010-06-02    76
2010-06-03    78
2010-06-04    76
2010-06-05    77
              ..
2017-06-26    79
2017-06-27    74
2017-06-28    74
2017-06-29    76
2017-06-30    75
Name: tobs, Length: 1700, dtype: int64

In [14]:
dec_collection = dec_data["tobs"]
dec_collection

date
2010-12-01    76
2010-12-03    74
2010-12-04    74
2010-12-06    64
2010-12-07    64
              ..
2016-12-27    71
2016-12-28    71
2016-12-29    69
2016-12-30    65
2016-12-31    65
Name: tobs, Length: 1517, dtype: int64

- **Null Hypothesis**: Hawaii is reputed to enjoy mild weather all year round. The temperatures in June and December similar.
- **Alternate Hypothesis**: The temperatures in June and December are different.

### Run a paired t-test
A paired t-test cannot be run as both the sample sizes aren't equal. 
So an unpaired (Independent/ 2-Sample) t-test can be performed.

### Run an unpaired t-test

Before conducting the two-sample t-test, find if both the samples have same variance. 
If the ratio is less than 4:1 then it can be considered that the samples have equal variance.

In [15]:
# June tobs data variance
np.var(jun_collection)

10.604524221453236

In [16]:
# December tobs data variance
np.var(dec_collection)

14.022665558302293

In [17]:
# find the ratio between the 2 samples
ratio = np.var(dec_collection) / np.var(jun_collection)
ratio

1.3223285897102357

As the ratio between the 2 samples is less than 4:1, we can run an unpaired t-test.

In [18]:
# Run a 2-sample t-test
stats.ttest_ind(jun_collection, dec_collection)

Ttest_indResult(statistic=31.60372399000329, pvalue=3.9025129038616655e-191)

### Analysis

### Use the t-test to determine whether the difference in means, if any, is statistically significant. 


The unpaired t-test performed on the June and December collections gives the following results:
- The p-value (3.9025129038616655e-191) is less than 0.05, so the null hypothesis can be rejected. 
- This means that the alternate hypothesis stands true. 
- This is also confirmed by the mean June and December temperatures calculated above. The December average temperature is around 4F less than that of June.

### Will you use a paired t-test or an unpaired t-test? Why?

- An unpaired t-test should be used here, as the size of the June and December tobs data are different.