## File I/O Revisited


### Data processing

Often it is useful to store datasets in Numpy arrays. Numpy provides a number of functions to calculate statistics of datasets in arrays.

For example, let's calculate some properties from the Stockholm temperature dataset used above.

http://bolin.su.se/data/stockholm/homogenized_daily_mean_temperatures.php

To download the data, click [here](http://bolin.su.se/data/stockholm/files/stockholm-historical-weather-observations-ver-1.0.2016/temperature/daily/stockholm_daily_mean_temperature_1756_2016.txt)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

#store data from dat format to 'data' variable
data = np.genfromtxt('Data/stockholm_daily_mean_temperature_1756_2017.txt')

In [None]:
#96594 rows and 7 columns
data.shape

In [None]:
print(data)

### To understand the data and what its columns represent to, you can always check the website that published that dataset

https://bolin.su.se/data/stockholm-thematic/files/stockholm-historical-weather-observations-2017/temperature/daily/README_stockholm_daily_mean_temperature.txt

|column	 |	data|
|--------|-------|
|1-3     |     Year, month, day|
|4       |    Daily average temperature according to observations. Unit: C, Missing values: -999.0|
|5       |     Daily average temperature after homogenization and with gaps filled in using data from Uppsala. (see Moberg et al. 2002)|
|6       |     Daily average temperatures after adjustment before September 1858 for a supposed warm bias of May-August temperatures. (see Moberg et al. 2003)|
|7       |     Data id no. meaning data from: 1=Stockholm 2=Uppsala (ajusted to represent Stockholm) 3=Stockholm, automatic station (used from 2013 onwards)|


#### Another example
* a popular data analysis / competition website with a lot of datasets: https://www.kaggle.com/datasets


In [None]:
#visualize the data

### Creating time series data (for x-axis)
# year
print(data[:,0])

# month
print(data[:,1]/12.0)

# day
print(data[:,2]/365)

# combined year,month,day into float number (1st Jan 1700 = 1/365 + 1/12 + 1700 = 1700.0861)
print(data[:,0]+data[:,1]/12.0+data[:,2]/365)

# initializing a subplot with figure size 14 inch x 4 inch
fig, ax = plt.subplots(2,2,figsize=(14,4))

# plotting data (time (year+month+day) vs daily average temperature (column 6)) in subplot
ax[0,0].plot(data[:,0]+data[:,1]/12.0+data[:,2]/365, data[:,5])

# set the labels and title
ax[0,0].set_title('temperatures in Stockholm')
ax[0,0].set_xlabel('year')
ax[0,0].set_ylabel('temperature (C)');


### mean

In [None]:
# the temperature data is in column 4
np.mean(data[:,3])

The daily mean temperature in Stockholm since year 1756 has been about 6.11 C.

### standard deviations and variance

In [None]:
np.std(data[:,3]), np.var(data[:,3])

### min and max

In [None]:
# lowest daily average temperature
np.min(data[365*50:,3])
np.min(data[:,3])

In [None]:
# lowest daily average temperature
np.max(data[:,3])

## Computations on subsets of arrays

We can compute with subsets of the data in an array using indexing, fancy indexing, and the other methods of extracting data from an array (described above).

For example, let's go back to the temperature dataset:

The dataformat is: year, month, day, daily average temperature, low, high, location.

If we are interested in the average temperature only in a particular month, say April, then we can create a index mask and use it to select only the data for that month using:

In [None]:
np.unique(data[:,1]) # the month column takes values from 1 to 12

In [None]:
# to get April only data
mask_april = data[:,1] == 4 

In [None]:
# the temperature data is in column 3
np.mean(data[mask_april,3])

In [None]:
# Quick exercise: get July only data



# Expected output: 17.46297709923664

# Exercise: generate a Bar chart to show the mean value for 12 months

![image.png](attachment:image.png)

