# Lab 8 Tasks - Solution

In this lab we will use NumPy to load and analyse daily temperature data collected for Sydney, Australia in 2016. The dataset contains 4 columns:
- *MinTemp*: Minimum daily temperature (Degrees Celsius)
- *MaxTemp*: Maximum daily temperature (Degrees Celsius)
- *Temp9am*: Temperature at 9am (Degrees Celsius)
- *Temp3pm*: Temperature at 3pm (Degrees Celsius)

## Task 1 

Use the Python *urllib.request* to download a numeric dataset in CSV (comma-separated) format and save it to disk: 

http://mlg.ucd.ie/modules/COMP41680/temperature.csv

Use NumPy to load this dataset into a 2D NumPy array. Note that you should skip the first row of the file. See:

https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html

Check the size of the array which has been loaded.

In [1]:
# retrieve the remote file
import urllib.request
url = "http://mlg.ucd.ie/modules/COMP41680/temperature.csv"
response = urllib.request.urlopen(url)
raw_csv = response.read().decode()

In [2]:
# save the file
fout = open("temperature.csv","w")
fout.write(raw_csv)
fout.close()

In [3]:
import numpy as np
data = np.loadtxt("temperature.csv", delimiter=",", skiprows=1)
data

array([[21.4, 28.4, 25.2, 27.5],
       [21.2, 28.5, 23.2, 26.3],
       [22. , 28.8, 26.3, 27.7],
       ...,
       [22.6, 36.6, 28.1, 31.8],
       [23.9, 33.3, 27.3, 32.1],
       [24.1, 30. , 27.7, 26.4]])

In [4]:
print("Data has %d rows and %d columns" % data.shape)

Data has 731 rows and 4 columns


## Task 2

Calculate basic summary statistics for the overall data.

In [5]:
print("Range is [%.1f, %.1f]" % (data.min(), data.max()))
print("Mean is %.1f" % data.mean() )
print("Standard deviation is %.1f" % data.std())

Range is [5.0, 40.9]
Mean is 19.7
Standard deviation is 5.7


Calculate basic summary statistics for each column in the data (corresponding to MinTemp, MaxTemp, Temp9am, Temp3pm):

In [6]:
# get the mean value of each column
col_mean = np.mean(data, axis=0)
# get the minimum value of each column
col_min = np.min(data, axis=0)
# get the maximum value of each column
col_max = np.max(data, axis=0)
# get the standard deviation of each column
col_std = np.std(data, axis=0)
# display the results
for col in range(data.shape[1]):
    print("Column %d:  Min=%.1f\tMax=%.2f\tMean=%.1f\tStd=%.1f" % 
          (col, col_min[col], col_max[col], col_mean[col], col_std[col]))

Column 0:  Min=5.0	Max=27.10	Mean=15.2	Std=4.6
Column 1:  Min=11.7	Max=40.90	Mean=23.5	Std=4.7
Column 2:  Min=6.7	Max=32.40	Mean=18.3	Std=5.0
Column 3:  Min=11.0	Max=40.70	Mean=21.9	Std=4.5


Calculate the number of days where the maximum temperature was above 30 degrees Celsius.

In [7]:
above_30 = data[:, 1] > 30
days_above_30 = np.sum(above_30)
print("Number of days > 30 degrees: %d" % days_above_30)

Number of days > 30 degrees: 50


## Task 3

Create a scatter plot comparing the values the minimum and maximum temperatures for each day.

In [8]:
# we can do this using Matplotlib
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
# create the figure
plt.figure(figsize=(6.5, 5.5))
# draw the scatter plot
ax = plt.scatter(data[:,0], data[:,1], color="darkorange", s=20)
plt.xlabel("Minimum Temperature (C)", fontsize=13);
plt.ylabel("Maximum Temperature (C)", fontsize=13);

Calculate the temperature range for each day (i.e. the difference between the maximum and minimum temperature).

Plot these range values visually using a histogram containing 5 bins.

In [9]:
# calculate the ranges
temperature_range = data[:, 1] - data[:, 0]
# create the histogram
plt.figure(figsize=(7, 4.5))
plt.hist(temperature_range, bins=5, edgecolor='black')
plt.xlabel("Temperature Range (C)", fontsize=13);
plt.ylabel("Frequency", fontsize=13);

Create a scatter plot comparing the values the temperatures at 9am and 3pm for each day.

In [10]:
# create the figure
plt.figure(figsize=(6.5, 5.5))
# draw the scatter plot
ax = plt.scatter(data[:,2], data[:,3], color="darkred", s=20)
plt.xlabel("9am Temperature (C)", fontsize=13);
plt.ylabel("3pm Temperature (C)", fontsize=13);

Calculate the differences between the temperatures at 9am and 3pm for each data. What is the mean difference?

On how many days was the temperature warmer at 9am than 3pm?

In [11]:
# calculate the ranges
temperature_diff = data[:, 3] - data[:, 2]
print("Mean temperature difference between 3pm and 9am: %.2f degrees" % temperature_diff.mean())
# check for negative difference values
warmer_9am = temperature_diff < 0
days_warmer_9am = np.sum(warmer_9am)
print("Number of days warmer at 9am than 3pm: %d" % days_warmer_9am)

Mean temperature difference between 3pm and 9am: 3.66 degrees
Number of days warmer at 9am than 3pm: 51


## Task 4

Create a new version of the dataset where all of the Celsisus temperatures have been converted to Fahrenheit (see [here](https://www.metric-conversions.org/temperature/celsius-to-fahrenheit.htm)).

In [12]:
# apply the conversion formula
data2 = (data * 1.8) + 32
data2

array([[70.52, 83.12, 77.36, 81.5 ],
       [70.16, 83.3 , 73.76, 79.34],
       [71.6 , 83.84, 79.34, 81.86],
       ...,
       [72.68, 97.88, 82.58, 89.24],
       [75.02, 91.94, 81.14, 89.78],
       [75.38, 86.  , 81.86, 79.52]])

Create a scatter plot of the maximum daily temperature in Celsius and maximum daily temperature in Fahrenheit.

In [13]:
# create the figure
plt.figure(figsize=(6.5, 5.5))
# draw the scatter plot
ax = plt.scatter(data[:,0], data2[:,0], color="darkgreen", s=20)
plt.xlabel("Maximum Temperature (C)", fontsize=13);
plt.ylabel("Maximum Temperature (F)", fontsize=13);

## Task 5

Normalise the new Fahrenheit version of the data by applying **min-max normalisation** to each of the columns in the DataFrame.

Then display updated summary statistics for the data.

In [14]:
# calculate the minimum and maximum values per columns
min_values = np.min(data2, axis=0)
print(min_values)
max_values = np.max(data2, axis=0)
print(max_values)

[41.   53.06 44.06 51.8 ]
[ 80.78 105.62  90.32 105.26]


In [15]:
# now apply the normalisation
data3 = (data2 - min_values) / (max_values - min_values)
data3

array([[0.74208145, 0.57191781, 0.71984436, 0.55555556],
       [0.73303167, 0.57534247, 0.64202335, 0.51515152],
       [0.76923077, 0.58561644, 0.76264591, 0.56228956],
       ...,
       [0.79638009, 0.85273973, 0.83268482, 0.7003367 ],
       [0.85520362, 0.73972603, 0.80155642, 0.71043771],
       [0.86425339, 0.62671233, 0.81712062, 0.51851852]])

In [16]:
print("Range is [%.1f, %.1f]" % (data3.min(), data3.max()))
print("Mean is %.1f" % data3.mean() )
print("Standard deviation is %.1f" % data3.std())

Range is [0.0, 1.0]
Mean is 0.4
Standard deviation is 0.2


## Task 6

Export the final normalized version of the Fahrenheit data as a comma-separated text file, where values are stored rounded to 3 decimal places.

In [17]:
np.savetxt("temperature-norm-fahrenheit.csv", data3, "%.3f", ",")