<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Fundamental Python Libraries for Data Handling

This notebook introduces how to import data from external files, write data back to files, and use the most commonly used Python libraries for working with data.

In [None]:
!pip install pandas
!pip install numpy
!pip install xarray
!pip install py7zr

In [None]:
import pandas as pd
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
import py7zr

## 1. Reading and Writing Text Files

Up until now, we’ve worked only with data created inside our own code. But in real-world projects, you’ll mainly work with existing data and files. Let's see how to open and read text files, as well as how to save (write) new files.

We can use the `with` statement togeter with `f.write()` **to save a text file**:

In [None]:
new_string = "These are my notes on experiment A."

# Save as text file (to a relative path) in write mode ("w"):
with open("../Data/Notes.txt", "w") as f:
    f.write(new_string)

**To import a text file**, we use almost the same structure, but change the mode to "r" (read):

In [None]:
# Open the file "World-happiness-report-2024.csv" (from a relative path) in read mode ("r"):
with open("../Data/Iris.csv", "r") as f: 
    data = f.read()

print(data)

This reads the contents of .csv file and stores it in the variable "data". You notice, however, that this file import does **not account for the comma-separation**.

If you want to read the file in a **structured** way using the comma (",") as a separator, then instead of using `f.read()`, you should use the `csv` module or the `pandas` library. These tools understand the structure of CSV files and will split the data into rows and columns for you.

*Note: Terms like module, package, and library are often used interchangeably in Python. While they all refer to reusable Python software, there are subtle differences in structure and scale. A module is the simplest unit, just a single .py file containing Python code. When several modules are grouped together in a folder, this is known as a package. A library, on the other hand, is a more general term. It typically refers to a collection of related packages and modules that together provide tools for a specific purpose.*

## 2. Automatically Loading Multiple Text Files with `glob`

This is handy for batch processing or analyzing many files at once without manually opening each one. You can also combine `glob` with `os` or `pathlib` to interact with the operating system (e.g., navigating folders, creating/deleting files).

Suppose you have a folder with multiple .txt files. You want to **automatically load all these files** to analyze them in Python. The `glob` library allows you to search for files in a folder based on patterns. In this example, we’ll load all .txt files from a folder and print their contents:

In [None]:
# Import the glob library (part of Python’s standard library)
import glob

# Get all text files in the "Samples" folder
files = glob.glob("../Data/Samples/*.txt")

# Read and display contents of each file
for filepath in files:
    with open(filepath, "r", encoding="utf-8") as f:
        content = f.read()
        print(content)

## 3 Using `pandas` for Tabular Data

`pandas` is the most widely used Python library for working with **tabular data** (data arranged in rows and columns) commonly found in files like .csv or Excel spreadsheets. `pandas` reads such files into **dataframes**, which keep both the data and its structure intact. This allows you to efficiently explore, organize, and analyze data within your code. 

**Dataframes** are similar to Excel spreadsheets or database tables. They have a **2-dimensional** data structure and **labeled axes (rows and columns)**. These are **indexed** for efficient data retrieval.

<img src="../Images/dataframe.png" style="width: 300px;">

Let's use `pandas` for **importing** the text file from Sect. 1:

In [None]:
Iris = pd.read_csv("../Data/Iris.csv")
print(Iris)

To **create a DataFrame** from a dictionary:

In [None]:
names_dict = {"names": ["Alice", "Mary", "Kim", "Deniz", "Carla", "Linus"]}
df = pd.DataFrame(names_dict)
print(df)

**Saving** a DataFrame to CSV:

In [None]:
df.to_csv("../Data/names.csv", index=False)

Setting `index=False` prevents pandas from writing row indices to the file.

## 4. Using `numpy` for Fast Numerical Computing

`numpy` is **the foundational library for numerical computing**, supporting large and multi-dimensional **arrays** and vectorized operations. A data array is a structure for stroring elements of the same type. Arrays can be one-dimensional or multi-dimensional (like a matrix).

Actually, many Python libraries are build on top of `numpy`, including `pandas`.

<img src="../Images/array.png" style="width: 600px;"> <img src="../Images/pandas_numpy.png" style="width: 300px;">

Let's **import** some data (temperature anomaly time series) as a 2D NumPy array:

In [None]:
time_series = np.loadtxt('../Data/NOAA_time_series.csv', skiprows=5, delimiter=',')
print(time_series[:5]) # index the array to prints the first 5 "rows" (along axis 0)

To **create** a 2D NumPy array from scratch:

In [None]:
data = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(data.shape)

NumPy can also handle **multi-dimensional data**, such as 3D arrays, which are useful for representing things like image stacks or time-varying data. Here's a basic example:

In [None]:
# Create a 3D array: 2 "layers", each 3x3
data = np.array([
    [[1, 2, 3],
     [4, 5, 6],
     [7, 8, 9]],

    [[10, 11, 12],
     [13, 14, 15],
     [16, 17, 18]]
])

print(data.shape)

### Converting Between NumPy Arrays and pandas DataFrames

NumPy arrays are great for fast numerical operations. However, they lack labels (like column names), which makes DataFrames more convenient for data exploration and analysis.

Here is how you can **convert** between the two data structures:

In [None]:
# Convert NumPy array to pandas DataFrame
df = pd.DataFrame(time_series, columns=['Year', 'Anomaly'])
print(df.head()) # .head prints the first 5 rows of the df by default

In [None]:
# Convert DataFrame back to NumPy array
array_again = df.values
print(array_again[:5]) 

## 5. Complex File Structures: `xarray` for netCDF Files

`xarray` is **like `pandas` for netCDF**, it is a powerful library for handling and analyzing multi-dimensional arrays, commonly used for time series data in the Earth sciences.

Let's see how 3D climate data can be easily imported using `xarray`. The dataset contains two variables (t2m and tp) each have three dimensions: (time, latitude, longitude).

If not unpacked already, the .7z file must be unpacked (use py7zr library):

In [None]:
with py7zr.SevenZipFile('../Data/ERA5_snippet.7z', mode='r', password='secret') as archive:
    archive.extractall(path='../Data/')

Import file:

In [None]:
ERA5 = xr.open_dataset('../Data/ERA5_snippet.nc')
print(ERA5)