# Week 9 (Mon) - Introduction to Numpy 2 & Plotting Intro

Python lists:

* are very flexible
* don't require uniform numerical types
* are very easy to modify (inserting or appending objects).

However, flexibility often comes at the cost of performance, and lists are not the ideal object for numerical calculations.

This is where **Numpy** comes in. Numpy is a Python module that defines a powerful n-dimensional array object that uses C and Fortran code behind the scenes to provide high performance.

The downside of Numpy arrays is that they have a more rigid structure, and require a single numerical type (e.g. floating point values), but for a lot of scientific work, this is exactly what is needed.

The Numpy module is imported with:

In [None]:
import numpy

Although in the rest of this course, and in many packages, the following convention is used:

In [None]:
import numpy as np

This is because Numpy is so often used that it is shorter to type ``np`` than ``numpy``.

A very useful function in Numpy is [numpy.loadtxt](http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html) which makes it easy to read in data from column-based data. 


For example, we can either read it in using a single multi-dimensional array given the following file:

In [None]:
data = np.loadtxt('columns.txt')
data

Or we can read the individual columns:

In [None]:
date, temperature = np.loadtxt('columns.txt', unpack=True)

In [None]:
date

In [None]:
temperature

There are additional options to skip header rows, ignore comments, and read only certain columns. See the documentation for more details.

## Masking

The index notation ``[...]`` is not limited to single element indexing, or multiple element slicing, but one can also pass a discrete list/array of indices:

In [None]:
x = np.array([1,6,4,7,9,3,1,5,6,7,3,4,4,3])
x[[1,2,4,3,3,2]]

which is returning a new array composed of elements 1, 2, 4, etc from the original array.

Alternatively, one can also pass a boolean array of ``True/False`` values, called a **mask**, indicating which items to keep:

In [None]:
x[np.array([True, False, False, True, True, True, False, False, True, True, True, False, False, True])]

Now this doesn't look very useful because it is very verbose, but now consider that carrying out a comparison with the array will return such a boolean array:

In [None]:
x > 3.4

It is therefore possible to extract subsets from an array using the following simple notation:

In [None]:
x[x > 3.4]

Conditions can be combined:

In [None]:
x[(x > 3.4) & (x < 5.5)]

Of course, the boolean **mask** can be derived from a different array to ``x`` as long as it is the right size:

In [None]:
x = np.linspace(-1., 1., 14)
y = np.array([1,6,4,7,9,3,1,5,6,7,3,4,4,3])

In [None]:
y[(x > -0.5) & (x < 0.4)]

Since the mask itself is an array, it can be stored in a variable and used as a mask for different arrays:

In [None]:
keep = (x > -0.5) & (x < 0.4)
x_new = x[keep]
y_new = y[keep]

In [None]:
x_new

In [None]:
y_new

A mask can also appear on the left hand side of an assignment:

In [None]:
y[y > 5] = 0.

In [None]:
y

## Exercise 1

The [data/munich_temperatures_average_with_bad_data.txt](data/munich_temperatures_average_with_bad_data.txt) data file gives the temperature in Munich every day for several years:

In [None]:
!head munich_temperatures_average_with_bad_data.txt  # shows the 10 first lines of a file

Read in the file using ``np.loadtxt``. The data contains bad values, which you can identify by looking at the minimum and maximum values of the array. Use masking to get rid of the bad temperature values.

In [None]:
# your solution here

### NaN values

In arrays, some of the values are sometimes NaN - meaning *Not a Number*. If you multiply a NaN value by another value, you get NaN, and if there are any NaN values in a summation, the total result will be NaN. One way to get around this is to use ``np.nansum`` instead of ``np.sum`` in order to find the sum:

In [None]:
x = np.array([1,2,3,np.nan])

In [None]:
np.nansum(x)

In [None]:
np.nanmax(x)

You can also use ``np.isnan`` to tell you where values are NaN. For example, ``array[~np.isnan(array)]`` will return all the values that are not NaN (because ~ means 'not'):

In [None]:
np.isnan(x)

In [None]:
x[np.isnan(x)]

In [None]:
x[~np.isnan(x)]

## Exercise 2

The [data/munich_temperatures_average_with_bad_data_nan.txt](data/munich_temperatures_average_with_bad_data_nan.txt) data file gives the temperature in Munich every day for several years:

In [None]:
!head munich_temperatures_average_with_bad_data_nan.txt  # shows the 10 first lines of a file

The data contains bad values AND nans! Use masking to get rid of the bad temperature values.

In [None]:
# your solution here

## Plotting data is one of the easiest way to determine if something may be odd/wonrg with your data.  

Also visuallying data is a key way to help interpret things in science (__e.g., a picture is worth a thousand words, especially when your data is 100,000+ data points!__). 

In [None]:
# Import the needed fuctions
from matplotlib import pyplot as plt 
#matplotlib.pyplot is the standard for making plots and is derived from Matlab
import numpy as np

### Now lets plot our data to see if we got all of the outliers!!!  YAY WE GET TO FINALLY MAKE PLOTS!

In [None]:
# READ IN DATA FILE/
data = np.genfromtxt("munich_temperatures_average_with_bad_data.txt")
Date = data[:,0]
Temp = data[:,1] 

# PLOT TO VERIFY FILE READ 
# This creates a figure object and sets the size
fig1 = plt.figure(figsize=(10,6))
# This allows subplots, for now this is set to 1 plot (1x1, plot 1)
ax1 = fig1.add_subplot(111)

# This makes a scatter plot of the data (typically what is used in science data)
ax1.scatter(Date,Temp,s=15,c='b',alpha=0.3)

# add the labels
ax1.set_xlabel("Date $(years)$",size=16)   # allows LaTeX style formating
ax1.set_ylabel("Temperature $(C)$",size=16)   # allows LaTeX style formating

# This increases the size of the labels to make them bigger
ax1.xaxis.set_tick_params(labelsize=12)
ax1.yaxis.set_tick_params(labelsize=12)
plt.show()

We can limit the range to make sure we are seeing the individual data points

In [None]:
# READ IN DATA FILE/
data = np.genfromtxt("munich_temperatures_average_with_bad_data.txt")
Date = data[:,0]
Temp = data[:,1] 

# PLOT TO VERIFY FILE READ 
# This creates a figure object and sets the size
fig1 = plt.figure(figsize=(10,6))
# This allows subplots, for now this is set to 1 plot (1x1, plot 1)
ax1 = fig1.add_subplot(111)

# This makes a scatter plot of the data (typically what is used in science data)
ax1.scatter(Date,Temp,s=15,c='b',alpha=0.3)

# add the labels
ax1.set_xlabel("Date $(years)$",size=16)   # allows LaTeX style formating
ax1.set_ylabel("Temperature $(C)$",size=16)   # allows LaTeX style formating

# This increases the size of the labels to make them bigger
ax1.xaxis.set_tick_params(labelsize=12)
ax1.yaxis.set_tick_params(labelsize=12)

# LIMIT TO 1.5 YEARS
ax1.set_xlim(2001.5,2003.5)
plt.show()

## Exercise 3

Now remake the plot using the 'masked' (above) data to remove the bad data points.  

NOTE the change in the scale of the Y-axis SHOULD change.

In [None]:
#code here