# basic statistical analysis with python

In this exercise, we'll take a look at some basic statistical analysis with python - starting with using python and `pandas` to calculate descriptive statistics for our datasets, before moving on to look at a few common examples of hypothesis tests using `statsmodels`.
 
## data

The data used in this exercise are the historic meteorological observations from the [Armagh Observatory](https://www.metoffice.gov.uk/weather/learn-about/how-forecasts-are-made/observations/recording-observations-for-over-100-years) (1853-present), the Oxford Observatory (1853-present), the Southampton Observatory (1855-2000), and Stornoway Airport (1873-present), downloaded from the [UK Met Office](https://www.metoffice.gov.uk/research/climate/maps-and-data/historic-station-data) that we used in previous exercises. I have copied the **combined_stations.csv** data into this folder - this is the same file that you created in the process of working through the "pandas" exercise.


## loading libraries

As before, we load the packages that we will use in the exercise at the beginning:

In [2]:
import pandas as pd
from pathlib import Path

Next, we'll use `pd.read_csv()` to load the combined station data. We'll also use the `parse_dates` argument to tell `pandas` to read the `date` column as a date:

In [9]:
station_data = pd.read_csv(Path('data', 'combined_stations.csv'), parse_dates=['date'])

## descriptive statistics

Before diving into statistical tests, we'll spend a little bit of time expanding on calculating *descriptive* statistics using `pandas`. We have seen a little bit of this already, using `.groupby()` and `.mean()` to calculate the mean value of `rain` for each station.

### describing variables using .describe()

First, we'll have a look at `.describe()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)), which provides a summary of each of the (numeric) columns in the table:

In [10]:
station_data.describe()

Unnamed: 0,date,year,month,tmax,tmin,air_frost,rain,sun
count,7617,7617.0,7617.0,7456.0,7456.0,7431.0,7613.0,4529.0
mean,1937-12-30 20:08:36.108704256,1937.539976,6.500591,13.109804,6.015102,3.2464,71.944266,117.97006
min,1853-01-01 00:00:00,1853.0,1.0,-0.2,-5.8,0.0,0.0,10.6
25%,1898-04-01 00:00:00,1898.0,3.0,8.8,2.8,0.0,40.6,65.6
50%,1937-12-01 00:00:00,1937.0,7.0,12.7,5.4,1.0,64.6,109.9
75%,1977-08-01 00:00:00,1977.0,10.0,17.2,9.5,5.0,94.9,162.0
max,2022-12-01 00:00:00,2022.0,12.0,27.4,16.2,29.0,377.5,350.3
std,,46.697064,3.452793,5.079977,3.916344,4.792938,42.623085,62.756761


In the output above, we can see the count (**count**) minimum (**min**), 1st quartile (**25%**), median (**50%**), mean (**mean**), 3rd quartile (**75%**), maximum (**max**), and standard deviation (**std**) values of each numeric variable.

With this, we can quickly see where we might have errors in our data - for example, if we have non-physical or nonsense values in our variables. When first getting started with a dataset, it can be a good idea to check over the dataset using `.describe()`.

### using .describe() to summarize groups

What if we wanted to get a summary based on some grouping - for example, for each station? We could use `filter()` to create an object for each value of `station`, then call `summary()` on each of these objects in turn.

Not surprisingly, however, there is an easier way, using `split()` ([documentation](https://rdrr.io/r/base/split.html)) and `map()` ([documentation](https://purrr.tidyverse.org/reference/map.html)). First, `split()` divides the table into separate tables based on some grouping:

In [11]:
station_data.groupby('station').describe()

Unnamed: 0_level_0,date,date,date,date,date,date,date,date,year,year,...,rain,rain,sun,sun,sun,sun,sun,sun,sun,sun
Unnamed: 0_level_1,count,mean,min,25%,50%,75%,max,std,count,mean,...,max,std,count,mean,min,25%,50%,75%,max,std
station,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
armagh,2040,1937-12-16 05:40:14.117646848,1853-01-01 00:00:00,1895-06-23 12:00:00,1937-12-16 12:00:00,1980-06-08 12:00:00,2022-12-01 00:00:00,,2040.0,1937.5,...,226.4,33.23648,1112.0,104.207374,17.8,62.3,100.7,137.4,256.0,50.4036
oxford,2040,1937-12-16 05:40:14.117646848,1853-01-01 00:00:00,1895-06-23 12:00:00,1937-12-16 12:00:00,1980-06-08 12:00:00,2022-12-01 00:00:00,,2040.0,1937.5,...,192.9,31.436236,1128.0,128.725621,18.2,72.675,123.3,174.175,322.8,62.917199
southampton,1743,1927-08-01 19:10:01.032702208,1855-01-01 00:00:00,1891-04-16 00:00:00,1927-08-01 00:00:00,1963-11-16 00:00:00,2000-03-01 00:00:00,,1743.0,1927.125645,...,280.7,42.11573,1163.0,137.346518,22.7,75.0,132.1,191.2,350.3,69.970405
stornoway,1794,1948-03-16 22:46:57.391304448,1873-07-01 00:00:00,1910-11-08 12:00:00,1948-03-16 12:00:00,1985-07-24 06:00:00,2022-12-01 00:00:00,,1794.0,1947.749164,...,377.5,49.329277,1126.0,100.773801,10.6,52.1,97.4,140.25,294.1,57.731746


In [12]:
group_summary = station_data.groupby('station').describe()

In [21]:
group_summary.loc['armagh'].index

MultiIndex([(     'date', 'count'),
            (     'date',  'mean'),
            (     'date',   'min'),
            (     'date',   '25%'),
            (     'date',   '50%'),
            (     'date',   '75%'),
            (     'date',   'max'),
            (     'date',   'std'),
            (     'year', 'count'),
            (     'year',  'mean'),
            (     'year',   'min'),
            (     'year',   '25%'),
            (     'year',   '50%'),
            (     'year',   '75%'),
            (     'year',   'max'),
            (     'year',   'std'),
            (    'month', 'count'),
            (    'month',  'mean'),
            (    'month',   'min'),
            (    'month',   '25%'),
            (    'month',   '50%'),
            (    'month',   '75%'),
            (    'month',   'max'),
            (    'month',   'std'),
            (     'tmax', 'count'),
            (     'tmax',  'mean'),
            (     'tmax',   'min'),
            (     'tmax',   

In [8]:
pd.to_datetime(station_data['date'])

0      1853-01-01
1      1853-02-01
2      1853-03-01
3      1853-04-01
4      1853-05-01
          ...    
7612   2022-08-01
7613   2022-09-01
7614   2022-10-01
7615   2022-11-01
7616   2022-12-01
Name: date, Length: 7617, dtype: datetime64[ns]