<a href="https://colab.research.google.com/github/m-edal/Earth-Env-DS-MSc-Course/blob/main/labs/W2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W2: NumPy, pandas, xarray

- Contributer: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 4 February, 2024

## Intended Learning Outcomes
- NumPy and Pandas: understand and apply NumPy for numerical computations, leveraging vectorization and broadcasting, along with pandas for handling and analyzing data.
- Time Series Analysis and Visualization: employ pandas for time series data analysis using datetime functionalities and visualize the results with pandas' built-in plotting tools.
- xarray: conduct data operations on multidimensional datasets using Xarray, integrating it seamlessly with NumPy and pandas workflows.

## 1. NumPy and pandas (20 mins)
**NumPy:**
- NumPy (Numerical Python) is the fundamental package for scientific computing in Python: https://numpy.org/doc/stable/.
- NumPy is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems: https://numpy.org/doc/stable/user/absolute_beginners.html.
- NumPy can be used to perform a wide variety of mathematical operations on arrays.

**pandas:**
- pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive: https://pandas.pydata.org/docs/getting_started/overview.html
- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a DataFrame.
- The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

In [1]:
# import package
import numpy as np
import pandas as pd
print(np.__version__)
print(pd.__version__)

2.4.2
3.0.0


### 1.1 reshaping, indexing, and slicing

In [2]:
arr = np.arange(24)
print(arr.shape)
arr

(24,)


array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23])

In [3]:
# reshape
arr = arr.reshape(6,4)
print(arr.shape)
arr

(6, 4)


array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [4]:
sliced_arr0 = arr[1,1] 
sliced_arr0

np.int64(5)

In [5]:
sliced_arr0 = len(arr[arr>5])
sliced_arr0

18

In [6]:
sliced_arr0 = arr[1:] # Rows 1 to the end
sliced_arr0

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [7]:
sliced_arr1 = arr[1:4] # Rows 1 to 3
sliced_arr1

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [8]:
sliced_arr1_1 = arr[1:4, :] # Rows 1 to 3, All columns
sliced_arr1_1

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [9]:
sliced_arr2 = arr[1:4,2:4] # Rows 1 to 3, Columns 2 to 3
sliced_arr2

array([[ 6,  7],
       [10, 11],
       [14, 15]])

In [10]:
sliced_arr3 = arr[1:4, 2] # Rows 1 to 3, Columns 2
sliced_arr3

array([ 6, 10, 14])

In [11]:
sliced_arr4 = arr[1:4:2] # Rows 1 and Rows 3
sliced_arr4

array([[ 4,  5,  6,  7],
       [12, 13, 14, 15]])

### 1.2 find the indices (row and column)

In [12]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [13]:
# find a certain value
value_to_find = 15
(row_indices, col_indices) = np.where(arr == value_to_find)
print(row_indices, col_indices)

[3] [3]


How to find the indices of the maximal value?

### 1.3 vectorization

In [14]:
arr = np.random.rand(1000000)

In [15]:
%%time
squares_loop = [x**2 for x in arr]

CPU times: total: 266 ms
Wall time: 268 ms


In [16]:
%%time 
squares_vectorized = arr**2

CPU times: total: 0 ns
Wall time: 3.31 ms


### 1.4 broadcasting

In [17]:
a = np.random.rand(100000000)
b1 = 5
b2 = np.full(100000000, 5)
b2

array([5, 5, 5, ..., 5, 5, 5], shape=(100000000,))

In [18]:
%%time
c1 = a*b1
c1

CPU times: total: 3.53 s
Wall time: 3.93 s


array([1.60161186, 1.39081297, 1.01801909, ..., 4.10923003, 3.15633526,
       4.22002058], shape=(100000000,))

In [19]:
%%time
c2 = a*b2
c2

CPU times: total: 8.7 s
Wall time: 11.1 s


array([1.60161186, 1.39081297, 1.01801909, ..., 4.10923003, 3.15633526,
       4.22002058], shape=(100000000,))

### 1.5 create series and dataframe

In [20]:
# create a date as a series
dates = pd.date_range('20240208', periods = 6)
dates

DatetimeIndex(['2024-02-08', '2024-02-09', '2024-02-10', '2024-02-11',
               '2024-02-12', '2024-02-13'],
              dtype='datetime64[us]', freq='D')

In [21]:
# create a dataframe with series
df = pd.DataFrame(np.random.randn(6,4), 
                  index = dates, 
                  columns=["a", "b", "c", "d"])
df

Unnamed: 0,a,b,c,d
2024-02-08,0.796912,-1.024625,-1.179377,0.440488
2024-02-09,-1.197907,-0.545967,1.600085,1.331685
2024-02-10,-0.202527,-0.881864,-0.735409,-0.604624
2024-02-11,-1.440737,-0.316153,0.798047,0.362789
2024-02-12,-1.211179,-0.209103,0.782553,0.910664
2024-02-13,2.133906,1.975542,1.544847,0.136753


In [22]:
# show the index of a dataframe
df.index

DatetimeIndex(['2024-02-08', '2024-02-09', '2024-02-10', '2024-02-11',
               '2024-02-12', '2024-02-13'],
              dtype='datetime64[us]', freq='D')

In [23]:
# quickly describe the dataframe
df.describe()

Unnamed: 0,a,b,c,d
count,6.0,6.0,6.0,6.0
mean,-0.186922,-0.167028,0.468458,0.429626
std,1.414048,1.095872,1.167145,0.6641
min,-1.440737,-1.024625,-1.179377,-0.604624
25%,-1.207861,-0.79789,-0.355919,0.193262
50%,-0.700217,-0.43106,0.7903,0.401638
75%,0.547053,-0.235866,1.358147,0.79312
max,2.133906,1.975542,1.600085,1.331685


In [24]:
# transposition
df.T

Unnamed: 0,2024-02-08,2024-02-09,2024-02-10,2024-02-11,2024-02-12,2024-02-13
a,0.796912,-1.197907,-0.202527,-1.440737,-1.211179,2.133906
b,-1.024625,-0.545967,-0.881864,-0.316153,-0.209103,1.975542
c,-1.179377,1.600085,-0.735409,0.798047,0.782553,1.544847
d,0.440488,1.331685,-0.604624,0.362789,0.910664,0.136753


### 1.6 questions

Q1: How to convert 1-D array into 2-D array?
- https://numpy.org/doc/stable/user/absolute_beginners.html
- It deals with **np.newaxis** and **np.expend_dims**.

Q2: How to calculate the mean square error?
- Mean squere error is an important metric in regression analysis.
```
y_pred = np.array([1.0, 2.0, 3.0, 4.0])
y_true = np.array([1.1, 1.9, 3.1, 3.9])
```
- https://numpy.org/doc/stable/user/absolute_beginners.html
- Implementing the formula in numpy: error = (1/n) * np.sum(np.square(predictions - labels))


Q3: How select the "a" column for the d'ate "2024-02-08" in `df`?

```python
df.iloc[0,0] # if we know the position
df.loc["2024-02-08", "a"] # or df['a'].loc['2024-02-08']
```

Q4: How to get the positive elements from `df`?

```python
df[df>0] # set all the negative values to NaN
df[df['a']>0] # get the rows of a > 0; or use df.query('a>0')
```

## 2. Time Series Analysis and Visualization (15 mins)

### 2.1 basic datetimes

In [25]:
np.array(['2024-02-08', '2024-02-09', '2024-02-10'], dtype='datetime64')

array(['2024-02-08', '2024-02-09', '2024-02-10'], dtype='datetime64[D]')

### 2.2 parsing time series information from various sources 

In [26]:
import datetime
dti = pd.to_datetime(["02/01/2024", 
                      np.datetime64("2024-02-02"), 
                      datetime.datetime(2024, 2, 3)])
dti

DatetimeIndex(['2024-02-01', '2024-02-02', '2024-02-03'], dtype='datetime64[us]', freq=None)

In [27]:
ts = pd.Series(np.random.randn(29), index=pd.date_range("2024-02-01", "2024-02-29"))

### 2.3 questions
Q1: please create a NumPy array that include all the dates for Feb 2024


Q2: please create a pandas Series that include all the dates for Feb 2024


Q3: please use `np.random.randn` to create a pandas Series (Y axis) and the results from Q2 as the index (X axis), and produce a line plot and a scatter plot


## 3. xarray (15 mins)
- xarray introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like multidimensional arrays, which allows for a more intuitive, more concise, and less error-prone developer experience: https://docs.xarray.dev/en/stable/index.html
- Compared with Numpy-like array, xarray introduces labels in the form of dimensions.

In [28]:
import xarray as xr
xr.__version__

ModuleNotFoundError: No module named 'xarray'

In [None]:
!conda install -c conda-forge xarray -y

### 3.1 create a DataArray 
- data, dimensions(optional), coordinates(optional)

In [None]:
# create a xarry
data = np.random.randn(3,4,3) # dreate data
lat = [-20, -10, 10, 20]
lon = [10, 20, 30]
time = pd.date_range("2023-01-01", periods=3)
array = xr.DataArray(data, coords = [time, lat, lon], 
                     dims=['time', 'lat', 'lon'], 
                     name = "foo") # 3D array ('time', 'lat', and 'lon' are the dimension names)
array

### 3.2 indxing and selecting data

In [None]:
array[2:]

In [None]:
array.sel(lon=10)

### 3.3 deal with NetCDF

In [None]:
# Export a netcdf file
array.to_netcdf('output.nc')

In [None]:
# read in a netcdf file
ds=xr.open_dataset('output.nc')
ds

### 3.4 check NetCDF basic information

In [None]:
ds.dims

In [None]:
ds.attrs

In [None]:
ds.coords

In [None]:
ds.data_vars

Please fork the repo (https://github.com/m-edal/Earth-Env-DS-MSc-Course/tree/main), and add your project description to the `README.md` file

### 3.5 questions
Q1: please provide a figure of the `foo`, where X-axis is `lon`, Y-axis is `lat`, and the value are the mean value


Q2: please provide a figure of the `foo`, where X-axis is the time, Y-axis is the mean value
