<a href="https://colab.research.google.com/github/m-edal/Earth-Env-DS-MSc-Course/blob/main/labs/W1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# W1: Review of Python: NumPy, Pandas, and Basic Operation of Environmental Datasets

- Contributers: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 29 January, 2024

## Intended Learning Outcomes (ILOs)
- Numpy Array Proficiency: learn to create and manipulate Numpy arrays, understanding both one-dimensional and multi-dimensional array operations.
- Data Handling with Pandas: gain skills in importing, exporting, and manipulating data using Pandas DataFrames.
- Environmental Data Analysis: apply Numpy and Pandas skills to analyze environmental datasets and use these analyses in a project context.

## 1. Numpy (15 mins)
- NumPy (Numerical Python) is the fundamental package for scientific computing in Python: https://numpy.org/doc/stable/.

- NumPy is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems: https://numpy.org/doc/stable/user/absolute_beginners.html.

- NumPy can be used to perform a wide variety of mathematical operations on **arrays**.

In [1]:
# import package
import numpy as np

In [2]:
# check numpy version
np.__version__

'2.4.2'

### 1.1 ways to create a numpy 1-D array

In [3]:
# way1
a = np.array([1, 2, 3])
a

array([1, 2, 3])

In [4]:
# way2
b = np.zeros(2)
b

array([0., 0.])

In [5]:
# way3
c = np.ones(2)
c

array([1., 1.])

In [6]:
# way4
d = np.arange(4)
d

array([0, 1, 2, 3])

In [7]:
# to specify the dtype as float, int, etc
# search more details on how to use np.arange() in : https://numpy.org/doc/stable/reference/generated/numpy.arange.html#numpy-arange
d_f = np.arange(4, dtype=float)
d_f

array([0., 1., 2., 3.])

In [8]:
# way5
e = np.arange(2, 9, 2)
e

array([2, 4, 6, 8])

In [9]:
# way6
f = np.linspace(0, 10, num=5)
f

array([ 0. ,  2.5,  5. ,  7.5, 10. ])

In [10]:
# way7: create an array from existing data
a0 = np.array([1,  2,  3,  4,  5,  6,  7,  8,  9, 10])
a1 = a0[3:8]
a1

array([4, 5, 6, 7, 8])

### 1.2 2-D or Muti-D array

In [11]:
a1D = np.array([1, 2, 3, 4])
a2D = np.array([[1, 2], [3, 4]])
a3D = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

a3D

array([[[1, 2],
        [3, 4]],

       [[5, 6],
        [7, 8]]])

### 1.3 basic array operation

In [12]:
d1 = np.array([1, 2])
d2 = np.ones(2, dtype=int)

In [13]:
# addition
d1 + d2

array([2, 3])

In [14]:
# substraction
d1 - d2

array([0, 1])

In [15]:
# broadcasting
d1 * 1.6

array([1.6, 3.2])

In [16]:
# sum
d1.sum()

np.int64(3)

In [17]:
# max
d1.max()

np.int64(2)

In [18]:
# min
d1.min()

np.int64(1)

Note:
- the product operator `*` operates elementwise in NumPy arrays
- the matrix product can be performed using the `@` operator (in python >=3.5) or the `dot` function or method

In [19]:
a = np.array([[1, 0],
              [0, 1]])
b = np.array([[4, 1],
              [2, 2]])

In [20]:
a * b

array([[4, 0],
       [0, 2]])

In [21]:
np.multiply(a, b)

array([[4, 0],
       [0, 2]])

In [22]:
a @ b

array([[4, 1],
       [2, 2]])

In [23]:
np.matmul(a, b)

array([[4, 1],
       [2, 2]])

In [24]:
a.dot(b)

array([[4, 1],
       [2, 2]])

In [25]:
np.dot(a, b)

array([[4, 1],
       [2, 2]])

##  2. Pandas (15 mins)
- pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive: https://pandas.pydata.org/docs/getting_started/overview.html
- When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you. pandas will help you to explore, clean, and process your data. In pandas, a data table is called a `DataFrame`.
- The two primary data structures of pandas, `Series` (1-dimensional) and `DataFrame` (2-dimensional), handle the vast majority of typical use cases in finance, statistics, social science, and many areas of engineering.

In [26]:
import pandas as pd

In [27]:
# check pandas version
pd.__version__

'3.0.0'

### 2.1 importing and exporting data

Here let's learn how create a dataframe

In [28]:
# create a datafrme like a dictionary
df0 = pd.DataFrame(
    {"A": 1.0,
     "B": pd.Timestamp("20240128"),
     "C": pd.Series(1, index=list(range(4)), dtype = float),
     "D": pd.Categorical(["test", "train","foo","test"])
     }
)
df0

Unnamed: 0,A,B,C,D
0,1.0,2024-01-28,1.0,test
1,1.0,2024-01-28,1.0,train
2,1.0,2024-01-28,1.0,foo
3,1.0,2024-01-28,1.0,test


In [29]:
# create a dataframe from a numpy array
a = np.array([[-2.58289208,  0.43014843, -1.24082018, 1.59572603],
              [ 0.99027828, 1.17150989,  0.94125714, -0.14692469],
              [ 0.76989341,  0.81299683, -0.95068423, 0.11769564],
              [ 0.20484034,  0.34784527,  1.96979195, 0.51992837]])

df = pd.DataFrame(a)
df.head()

Unnamed: 0,0,1,2,3
0,-2.582892,0.430148,-1.24082,1.595726
1,0.990278,1.17151,0.941257,-0.146925
2,0.769893,0.812997,-0.950684,0.117696
3,0.20484,0.347845,1.969792,0.519928


export a dataframe to a csv file. Why `index=False`? https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html

In [30]:
df0.to_csv("foo.csv", index=False)

read the csv file that we exported

In [31]:
# read a CSV
# you should upload a csv
df1 = pd.read_csv('foo.csv')

# if you want to specify the path
# path = 'XXX/XXX/XXX/XXX'
# df.to_csv(path + 'pd.csv')

In [32]:
df1

Unnamed: 0,A,B,C,D
0,1.0,2024-01-28,1.0,test
1,1.0,2024-01-28,1.0,train
2,1.0,2024-01-28,1.0,foo
3,1.0,2024-01-28,1.0,test


### 2.2 Manipulating DataFrames

In [33]:
# name the dataframe column
name = ['first_column', 'second_column', 'third_column', 'fourth_column']
df2 = pd.DataFrame(a, columns = name)
df2.head()

Unnamed: 0,first_column,second_column,third_column,fourth_column
0,-2.582892,0.430148,-1.24082,1.595726
1,0.990278,1.17151,0.941257,-0.146925
2,0.769893,0.812997,-0.950684,0.117696
3,0.20484,0.347845,1.969792,0.519928


In [34]:
# add a new column to an existing dataframe
df2['fifth_column'] = ['Hi', 'Hello', 'bonjour', 'nihao']
df2

Unnamed: 0,first_column,second_column,third_column,fourth_column,fifth_column
0,-2.582892,0.430148,-1.24082,1.595726,Hi
1,0.990278,1.17151,0.941257,-0.146925,Hello
2,0.769893,0.812997,-0.950684,0.117696,bonjour
3,0.20484,0.347845,1.969792,0.519928,nihao


In [35]:
# select a specific column, where each column is a series
df2_first_column = df2['first_column']
df2_first_column.head() # df2_first_column is a pandas series

0   -2.582892
1    0.990278
2    0.769893
3    0.204840
Name: first_column, dtype: float64

In [36]:
# select more than one column
df2_multi_column = df2[['first_column', 'third_column']]

df2_multi_column.head() # df2_multi_column is a dataframe

Unnamed: 0,first_column,third_column
0,-2.582892,-1.24082
1,0.990278,0.941257
2,0.769893,-0.950684
3,0.20484,1.969792


In [37]:
# know the shape of a dataframe
df2.shape

(4, 5)

In [38]:
# filter rows
above_1 = df2[df2['second_column']>1] # select rows whose 'second_column' value>1
above_1

Unnamed: 0,first_column,second_column,third_column,fourth_column,fifth_column
1,0.990278,1.17151,0.941257,-0.146925,Hello


optional: dealing with a xlsx file

In [39]:
# read a xlsx
# you may change the excel file and sheet_name according to your datay
y = pd.read_excel('https://github.com/m-edal/Earth-Env-DS-MSc-Course/raw/main/labs/data/sample.xlsx', sheet_name = 'Sheet1')

# export a dataframe to a xlsx
df.to_excel('pd.xlsx', sheet_name = 'export')

ImportError: `Import openpyxl` failed.  Use pip or conda to install the openpyxl package.

## 3. Basic Operation of Environmental Datasets (15 mins)

In [None]:
!wget https://storage.googleapis.com/tensorflow/tf-keras-datasets/jena_climate_2009_2016.csv.zip

In [None]:
!unzip jena_climate_2009_2016.csv.zip -y

In [None]:
df = pd.read_csv("jena_climate_2009_2016.csv")
df

Then you can follow here: https://www.tensorflow.org/tutorials/structured_data/time_series to perform analysis

## 4. Data for Project 1 (15 mins)

Please download the data here: https://www.dropbox.com/scl/fi/azzx0olpeyx45rixlsgdn/project_1.csv?rlkey=b4fj8cnmc4ytyezppfbhpky3t&dl=0

The definitions of **some** variables are available here: https://www.cesm.ucar.edu/community-projects/lens/data-sets

In [None]:
df = pd.read_csv("~/Downloads/project_1.csv") # You may change the path
df

## 5. Homework

- sign up for [Student Developer Pack](https://education.github.com/pack) of the GitHub
- think about the (research) question for project 1