# Temperature Analysis I

In [1]:
import pandas as pd
from datetime import datetime as dt

In [2]:
# "tobs" is "temperature observations"
df = pd.read_csv('../Resources/hawaii_measurements.csv')
df.head()

Unnamed: 0,station,date,prcp,tobs
0,USC00519397,2010-01-01,0.08,65
1,USC00519397,2010-01-02,0.0,63
2,USC00519397,2010-01-03,0.0,74
3,USC00519397,2010-01-04,0.0,76
4,USC00519397,2010-01-06,,73


In [3]:
# Convert the date column format from string to datetime
df['date'] = pd.to_datetime(df['date'])
df.dtypes

station            object
date       datetime64[ns]
prcp              float64
tobs                int64
dtype: object

In [4]:
# Set the date column as the DataFrame index
df = df.set_index('date')
df

Unnamed: 0_level_0,station,prcp,tobs
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2010-01-01,USC00519397,0.08,65
2010-01-02,USC00519397,0.00,63
2010-01-03,USC00519397,0.00,74
2010-01-04,USC00519397,0.00,76
2010-01-06,USC00519397,,73
...,...,...,...
2017-08-19,USC00516128,0.09,71
2017-08-20,USC00516128,,78
2017-08-21,USC00516128,0.56,76
2017-08-22,USC00516128,0.50,76


In [5]:
# Drop the date column
# When date is set as the index, date cannot be dropped, unless you want me to drop the index

### Compare June and December data across all years 

In [6]:
from scipy import stats

In [7]:
# Filter data for desired months
june_df = df.loc[(df.index.month == 6)]
december_df = df.loc[(df.index.month == 12)]
print(june_df)
print(december_df)

                station  prcp  tobs
date                               
2010-06-01  USC00519397  0.00    78
2010-06-02  USC00519397  0.01    76
2010-06-03  USC00519397  0.00    78
2010-06-04  USC00519397  0.00    76
2010-06-05  USC00519397  0.00    77
...                 ...   ...   ...
2017-06-26  USC00516128  0.02    79
2017-06-27  USC00516128  0.10    74
2017-06-28  USC00516128  0.02    74
2017-06-29  USC00516128  0.04    76
2017-06-30  USC00516128  0.20    75

[1700 rows x 3 columns]
                station  prcp  tobs
date                               
2010-12-01  USC00519397  0.04    76
2010-12-03  USC00519397  0.00    74
2010-12-04  USC00519397  0.00    74
2010-12-06  USC00519397  0.00    64
2010-12-07  USC00519397  0.00    64
...                 ...   ...   ...
2016-12-27  USC00516128  0.14    71
2016-12-28  USC00516128  0.14    71
2016-12-29  USC00516128  1.03    69
2016-12-30  USC00516128  2.37    65
2016-12-31  USC00516128  0.90    65

[1517 rows x 3 columns]


In [8]:
# Identify the average temperature for June
june_avg = june_df["tobs"].mean()
june_avg = round(june_avg,2)
june_avg

74.94

In [9]:
# Identify the average temperature for December
dec_avg = december_df["tobs"].mean()
dec_avg = round(dec_avg,2)
dec_avg

71.04

In [10]:
# Create collections of temperature data
# from above december has 1517 rows. I have to limit the larger june set to 1517 rows.
december_row_count = len(december_df)
june_temps = june_df['tobs'].sample(n=december_row_count, random_state=1)
dec_temps = december_df["tobs"]

print(june_temps)
print(dec_temps)

date
2012-06-21    71
2011-06-26    72
2012-06-28    74
2012-06-19    72
2016-06-09    76
              ..
2010-06-09    80
2010-06-22    78
2015-06-23    78
2010-06-26    75
2016-06-09    71
Name: tobs, Length: 1517, dtype: int64
date
2010-12-01    76
2010-12-03    74
2010-12-04    74
2010-12-06    64
2010-12-07    64
              ..
2016-12-27    71
2016-12-28    71
2016-12-29    69
2016-12-30    65
2016-12-31    65
Name: tobs, Length: 1517, dtype: int64


In [11]:
# Run paired t-test
stats.ttest_rel(june_temps, dec_temps)

Ttest_relResult(statistic=30.394318428923157, pvalue=7.506857985454749e-159)

### Analysis

Paired vs unpaired t-test
The key differences between a paired and unpaired t-test are summarized below.

A paired t-test is designed to compare the means of the same group or item under two separate scenarios. An unpaired t-test compares the means of two independent or unrelated groups.

In an unpaired t-test, the variance between groups is assumed to be equal. In a paired t-test, the variance is not assumed to be equal.

The pvalue is extremely small, less than 1%, meaning we reject the null hypothesis of equal averages.

However as the test sets are different, it's better to use an unpaired t-test.