# Processing data with pandas II

<div class="alert alert-warning"><b>Attention</b><br/>

Students can follow the lesson and fill in their student notebooks using Binder.<br/>
<a href="https://mybinder.org/v2/gh/NIGS-GeoPython-2022/notebooks/HEAD?labpath=L6%2Fadvanced-data-processing-with-pandas_eq.ipynb"><img alt="Binder badge" src="https://img.shields.io/badge/launch-binder-red.svg" style="vertical-align:text-bottom"></a>
</div>

This week we will continue developing our skills using [pandas](https://pandas.pydata.org/) to process real data. 

## Motivation

In this lesson, we will use our data manipulation and analysis skills to analyze a seismic catalog and investigate the claim that the Philippines experiences 20 earthquakes a day. 
Along the way we will cover a number of useful techniques in pandas including:

- renaming columns
- iterating data frame rows and applying functions
- data aggregation
- repeating the analysis task for several input files

## Input data

In the lesson this week we are using earthquake data from the PHIVOLCS website. Making data open-access is a fairly newly accepted concept in the Philippines, so sometimes data can can come in... inconvenient formats.  

## Downloading the data

The first step for today's lesson is to get the data.

PHIVOLCS has an [Earthquake Information page](https://www.phivolcs.dost.gov.ph/index.php/earthquake/earthquake-information3) where bulletins of newly detected earthquakes are summarized. Thanks to the tabular format, you can easily copy the data into a spreadsheet software and save the data into a .csv file (which can be read in easily with pandas!). The data folder contains an example of a csv file from data from January to March 2021.

## Reading the data

In order to get started, let's first import pandas: 

In [None]:
import pandas as pd

At this point, we can already have a quick look at the data file `pivs202101.txt` and how it is structured. We can notice at least one thing we need to consider when reading in the data:

```{admonition} Column name format
Depending on your spreadsheet software of choice, you may notice '\n' in the names.  If you recall earlier lessons, this is the code for printing text on the next line.  This is not sueful for pandas, so we are going to get rid of these with the .columns.str.replace() function.

```

In [None]:
# Define relative path to the file
fp = r"data/eq_data/pivs202101.csv"

# Read data
data = pd.read_csv(fp)

# replace the \n in the column names with '' to delete them
data.columns = data.columns.str.replace('\n', '')

Let's see how the data looks by printing the first five rows with the `head()` function:

All seems ok. However, we won't be needing the location column to analyze the statistics of the catalog.  We can check all column names by running `data.columns`:

### Reading in the data once again

This time, we will read in only some of the columns using the `usecols` parameter. Let's read in columns that might be somehow useful to our analysis, or at least that contain some values that are meaningful to us, including the Date, Latitude, Longitude, and magnitude.  

In [None]:
# Read in only selected columns
data = pd.read_csv(
    fp,
    usecols=["Date - Time\n(Philippine Time)", "Latitude\n(ºN)", "Longitude\n(ºE)", "Depth\n(km)", "Mag"],
)
# replace the \n in the column names with '' to delete them
data.columns = data.columns.str.replace('\n', '')

# Check the dataframe
data.head()

Okay so we can see that the data was successfully read to the DataFrame.

## Renaming columns

As we saw above some of the column names are a bit awkward and difficult to interpret. Luckily, it is easy to alter labels in a pandas DataFrame using the [rename](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) function. In order to change the column names, we need to tell pandas how we want to rename the columns using a dictionary that lists old and new column names

Let's first check again the current column names in our DataFrame:

In [None]:
data.columns

<div class="alert alert-info"><b>Dictionaries</b><br/>

A [dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries) is a specific data structure in Python for storing key-value pairs. During this course, we will use dictionaries mainly when renaming columns in a pandas series, but dictionaries are useful for many different purposes! For more information about Python dictionaries, check out [this tutorial](https://realpython.com/python-dicts/).
</div>

We can define the new column names using a [dictionary](https://www.tutorialspoint.com/python/python_dictionary.htm) where we list "`key: value`" pairs, in which the original column name (the one which will be replaced) is the key and the new column name is the value.

- Let's change the following:
   
   - `Date - Time(Philippine Time)` to `TIME`
   - `Latitude(ºN)` to `LAT`
   - `Longitude(ºE)` to `LON`
   - `Depth(km)` to `DEP`

In [None]:
# Create the dictionary with old and new names
new_names = {"Date - Time(Philippine Time)": "TIME", "Latitude(ºN)": "LAT", "Longitude(ºE)": "LON", "Depth(km)": "DEP"}

# Let's see what the variable new_names look like
new_names

In [None]:
# Check the data type of the new_names variable
type(new_names)

From above we can see that we have successfully created a new dictionary. 

Now we can change the column names by passing that dictionary using the parameter `columns` in the `rename()` function:

In [None]:
# Rename the columns
data = data.rename(columns=new_names)

# Print the new columns
print(data.columns)

Perfect, now our column names are easier to understand and use. 

### Check your understanding

For magnitude, PHIVOLCS generally reports the surface-wave magnitude ($M_s$), as opposed to $mb$ from the USGS data you looked at in the last lesson. AS you might have guessed at this point, we will do some conversions later.  $M_s$ can be converted to moment magnitude $M_w$ using the following equation [(Lolli et al., 2014)](https://academic.oup.com/gji/article/199/2/805/616348?login=false#86403295):

$$
M_w = exp(2.133+0.063M_s) − 6.205, M_s ≤ 5.5
$$
$$
M_w = exp(−0.109+0.229M_s) + 2.586, M_s > 5.5
$$

## Data properties

As we learned last week, it's always a good idea to check basic properties of the input data before proceeding with the data analysis. Let's check the:

- Number of rows and columns

- Top and bottom rows

- Data types of the columns

- Descriptive statistics

## Using your own functions in pandas 

Now it's again time to determine the equivalent moment magnitude! Yes, we have done before, but this time we will learn how to apply our own functions to data in a pandas DataFrame.  Moment magnitude is a useful quantity because it is more closely related to the energy released in an earthquake (something you can code as long as you know the relevant equations!).

**We will define a function for the magnitude conversion, and apply this function for each surface-magnitude value on each row of the DataFrame. Output magnitude values will be stored in a new column called** `Mw`.

We will first see how we can apply the function row-by-row using a `for` loop and then we will learn how to apply the method to all rows more efficiently all at once.

### Defining the function

For both of these approaches, we first need to define our magnitude conversion function following the equations above:

In [None]:
def Ms_to_Mw(Ms):
    """Function to convert surface magnitude to moment magnitude (based on Lolli et al., 2014)

    Parameters
    ----------

    Ms: float
        Surface wave magnitude

    Returns
    -------

    Moment magnitude (float)
    """
    import math
    
    # Convert the Fahrenheit into Celsius
    if Ms <= 5.5:
        Mw = math.exp(2.133 + (0.063 * Ms)) - 6.205
    
    elif Ms > 5.5:
        Mw = math.exp(-0.109 + (0.229 * Ms)) + 2.586

    return Mw

Let's test the function with some known value:

In [None]:
Ms_to_Mw(7.0)

Let's also print out the first rows of our data frame to see our input data before further processing: 

In [None]:
data.head()

### Iterating over rows

We can apply the function one row at a time using a `for` loop and the [iterrows()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) method. In other words, we can use the `iterrows()` method and a `for` loop to repeat a process *for each row in a pandas DataFrame*. Please note that iterating over rows is a rather inefficient approach, but it is still useful to understand the logic behind the iteration.

When using the `iterrows()` method it is important to understand that `iterrows()` accesses not only the values of one row, but also the `index` of the row as well. 

Let's start with a simple for loop that goes through each row in our DataFrame.

<div class="alert alert-info"><b>Single quotes inside double quotes</b><br/>
We use single quotes to select the column `Mag` of the row in the example below. This is because using double quotes would result in a `SyntaxError` since Python would interpret this as the end of the string for the `print()` function.
</div>

In [None]:
# Iterate over the rows
for idx, row in data.iterrows():

    # Print the index value
    print(f"Index: {idx}")

    # Print the row
    print(f"Mag: {row['Mag']}\n")

    break

<div class="alert alert-info"><b>Breaking a loop</b><br/>

When developing a for loop, you don't always need to go through the entire loop if you just want to test things out. 
The [break](https://www.tutorialspoint.com/python/python_break_statement.htm) statement in Python terminates the current loop whereever it is placed and we used it here just to test check out the values on the first row.
With a large data file or dataset, you might not want to print out thousands of values to the screen!
</div>

We can see that the `idx` variable indeed contains the index value at position 0 (the first row) and the `row` variable contains all the data from that given row stored as a pandas `Series`.

Let's now create an empty column `Mw` for the moment magnitude and update the values in that column using the `Ms_to_Mw` function we defined earlier.

In [None]:
# Create an empty float column for the output values
data["Mw"] = 0.0

# Iterate over the rows
for idx, row in data.iterrows():

    # Convert the Fahrenheit to Celsius
    Mw = Ms_to_Mw(row["Mag"])

    # Update the value of 'Celsius' column with the converted value
    data.at[idx, "Mw"] = Mw

<div class="alert alert-info"><b>Reminder: .at or .loc?</b><br/>

Here, you could also use `data.loc[idx, new_column] = celsius` to achieve the same result. 
    
If you only need to access a single value in a DataFrame, [DataFrame.at](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html) is faster compared to [DataFrame.loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html), which is designed for accessing groups of rows and columns. 
</div>

Finally, let's see how our DataFrame looks like now after the calculations above.

In [None]:
data.head()

### Applying the function

pandas DataFrames and Series have a dedicated method `.apply()` for applying functions on columns (or rows!). When using `.apply()`, we pass the function name (without parentheses!) as an argument to the `apply()` method. Let's start by applying the function to the `TEMP_F` column that contains the temperature values in Fahrenheit.

In [None]:
data["Mag"].apply(Ms_to_Mw)

The results look logical and we can store them permanently into a new column (overwriting the old values): 

In [None]:
data["Mw"] = data["Mag"].apply(Ms_to_Mw)

<div class="alert alert-info"><b>Should I use .iterrows() or .apply()?</b><br/>
It is possible to apply the function on several columns at once. The dataframe can also be re-ordered at the same time.  Say, if you were processing temperature data, you can do the following

data[["TEMP_F", "MIN", "MAX"]].apply(fahr_to_celsius)
    
    
We are teaching the `.iterrows()` method because it helps to understand the structure of a DataFrame and the process of looping through DataFrame rows. However, using `.apply()` is often more efficient in terms of execution time. 

At this point, the most important thing is that you understand what happens when you are modifying the values in a pandas DataFrame. When doing the course exercises, either of these approaches is ok!
</div>

Let's check the output:

In [None]:
data.head(10)

## Parsing dates

We will eventually want to group our data per day. Currently, the date and time information is stored in the column `TIME` (which was originally titled `Date - Time(Philippine Time)`:

Let's have a closer look at the date and time information we have by checking the values in that column, and their data type:

In [None]:
data["TIME"].head(10)

In [None]:
data["TIME"].tail(10)

The `TIME` column contains several events per day. The timestamp for the first observation in the dataframe is `31 January 2021 - 10:50 PM`, i.e. from 31st of January 2021 at 10:50 PM Philippine Standard Time.

In [None]:
data["TIME"].dtypes

The information is stored as objects ('O').

We want to **aggregate the data on a daily level**, and in order to do so we need to "label" each row of data based on the day when the record was observed. In order to do this, we need to somehow separate information about the month and the day for each row.

We create these "labels" by making a new column (or an index) containing information about the day (including the year, but excluding hours and minutes).

Before further taking that step, we should first convert the contents in the `TIME` column to character strings for convenience.

In [None]:
# Convert to string
data["TIME_STR"] = data["TIME"].astype(str)

### String slicing

Now that we have converted the date and time information into character strings, we next need to "cut" the needed information from the [string objects](https://docs.python.org/3/tutorial/introduction.html#strings). If we look at the latest time stamp in the data, you can see that there is a systematic pattern `DD Month YYYY - HH:MM AM/PM`.

In [None]:
date = "01 January 2021 - 12:26 AM"
date[0:2]

Based on this information, we can slice the correct range of characters from the `TIME_STR` column using [pandas.Series.str.slice()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.slice.html)


In [None]:
# SLice the string
data["DAY"] = data["TIME_STR"].str.slice(start=0, stop=10)

# Let's see what we have
data.head()

Nice! Now we have "labeled" the rows based on information about date and time, but only including the year and month in the labels.

<div class="alert alert-info"><b>Many methods of splicing and manipulating strings</b><br>
    
There are many approaches to splitting strings in pandas-- string slicing is only one of them. String slicing is useful if the column has consistent structure or syntax.

You can explore and try other methods (e.g., separating a string based on [whitespaces](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Whitespace)) desribed [here](https://blog.hubspot.com/website/pandas-split-string).
    
</div>

### Check your understanding

Create a new column `'MONTH'` with information about the month without the year.

In [None]:
# Possible Solution
# Extract information about month from the TIME_STR column into a new column 'MONTH'
# we use the str.split function
data["MONTH"] = data['TIME_STR'].str.split(pat=' ', expand=True)[1]

# Check the result
data[["DAY", "MONTH"]]

### Datetime (optional for Lesson 6)

In pandas, we can convert dates and times into a new data type [datetime](https://docs.python.org/3.7/library/datetime.html) using [pandas.to_datetime](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html) function.

In [None]:
# Convert character strings to datetime
data["DATE"] = pd.to_datetime(data["TIME_STR"])

In [None]:
# Check the output
data["DATE"].head()

<div class="alert alert-info"><b>Pandas Series datetime properties</b><br/>

There are several methods available for accessing information about the properties of datetime values. Read more from the pandas documentation about [datetime properties](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#datetime-properties).
</div>

Now, we can extract different time units based on the datetime-column using the [pandas.Series.dt](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.html) accessor:

In [None]:
data["DATE"].dt.day

In [None]:
data["DATE"].dt.month # this will have a more interesting output if your data frame has data from different months

We can also combine the datetime functionalities with other methods from pandas. For example, we can check the number of unique years in our input data: 

In [None]:
data["DATE"].dt.year.nunique()

For the final analysis, we need combined information of the day and month. One way to achieve this is to use the  `format` parameter to define the output datetime format according to [strftime(format)](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) method:

In [None]:
# Convert to datetime and keep only day and the month
data["MONTH_DAY"] = pd.to_datetime(data["TIME_STR"], format="%d %B", exact=False).dt.strftime('%m%d')

`exact=False` finds the characters matching the specified format and drops out the rest (days, hours and minutes are excluded in the output).
`%B` corresponds to the long name (e.g., January) of a month.

In [None]:
data["MONTH_DAY"]

Let us also change the `MONTH` column to have the number instead of the long name.  As you can imagine, numbers are easier to deal with programatically.

In [None]:
data["MONTH"] = data["DATE"].dt.month 

Now we have a unique label for each month as a datetime object.

## Aggregating data in Pandas by grouping

Here, we will learn how to use [pandas.DataFrame.groupby](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) which is a handy method for compressing large amounts of data and computing statistics for subgroups.

We will use the groupby method to calculate the average number of events per day:

  1. **Grouping the data** based on the day
  2. Calculating the average for each day (each group) 
  3. Storing those values into **a new DataFrame** called `daily_data`

Before we start grouping the data, let's once more see what our input data looks like.

In [None]:
print(f"number of rows: {len(data)}")

In [None]:
data.head()

We have quite a number of rows of earthquake data, and several observations per day. Our goal is to create an aggreated data frame that would have only one row per month.

Let's **group** our data based on the unique day and month combinations using `.dt.strftime`.

In [None]:
grouped = data.groupby(data["MONTH_DAY"])

<div class='alert alert-info'>
It is also possible to create combinations of months and days on-the-fly when grouping the data:
    
```
# Group the data 
grouped = data.groupby(['DAY', 'MONTH'])
```
</div>

Let's explore the new variable `grouped`.

In [None]:
type(grouped)

In [None]:
len(grouped)

We have a new object with type `DataFrameGroupBy` with 31 groups. In order to understand what just happened, let's also check the number of unique day combinations in our data:

In [None]:
data["MONTH_DAY"].nunique()

Length of the grouped object should be the same as the number of unique values in the column we used for grouping. For each unique value, there is a group of data.

Let's explore our grouped data even further. 

We can check the "names" of each group.

In [None]:
# Next line will print out all 31 group "keys"
grouped.groups.keys()

### Accessing data for one group

Let us now check the contents for the group representing January 01, 2021 (the name of that group is `0101` if you followed this notebook). We can get the values of that day from the grouped object using the `get_group()` method.

In [None]:
# Specify a day (as character string)
day = "0101"

# Select the group
group1 = grouped.get_group(day)

In [None]:
# Let's see what we have
group1

Ahaa! As we can see, a single group contains a **DataFrame** with values only for that specific month and year. Let's check the DataType of this group.

In [None]:
type(group1)

So, as noted above, one group is a pandas DataFrame! This is really useful, because we can now use all the familiar DataFrame methods for calculating statistics, etc. for this specific group. We can, for example, calculate the max values for all variables using the statistical functions that we have seen already (e.g. mean, std, min, max, median, etc.).

We can do that by using the `max()` function that we already did during Lesson 5. 

- Let's calculate the mean for following attributes all at once:

    - `Mag`
    - `Mw`
    - `DEP`

In [None]:
# Specify the columns that will be part of the calculation
max_cols = ["Mag", "Mw", "DEP"]

# Calculate the max values all at one go
max_values = group1[max_cols].max()

# Let's see what we have
print(max_values)

Above, we saw how you can access data from a single group. In order to get information about all groups (all months) we can use a `for` loop or methods available in the grouped object.

### For loops and grouped objects

When iterating over the groups in our `DataFrameGroupBy` object it is important to understand that a single group in our `DataFrameGroupBy` actually contains not only the actual values, but also information about the `key` that was used to do the grouping. Hence, when iterating over the data we need to assign the `key` and the values into separate variables.

So, let's see how we can iterate over the groups and print the key and the data from a single group (again using `break` to only see what is happening for the first group).

In [None]:
# Iterate over groups
for key, group in grouped:
    # Print key and group
    print(f"Key:\n {key}")
    print(f"\nFirst rows of data in this group:\n {group.head()}")

    # Stop iteration with break command
    break

OK, so from here we can see that the `key` contains the name of the group based on the day.

Let's build on this and see how we can create a DataFrame where we calculate the max values for all those seismicity attributes that we were interested in. We will repeat some of the earlier steps here so you can see and better understand what is happening.

In [None]:
# Create an empty DataFrame for the aggregated values
daily_data = pd.DataFrame()

# The columns that we want to aggregate
max_cols = ["Mag", "Mw", "DEP"]

# Iterate over the groups
for key, group in grouped:

    # Calculate mean
    max_values = group[max_cols].max()

    # Add the ´key´ (i.e. the date+time information) into the aggregated values
    max_values["MONTH_DAY"] = key

    # Append the aggregated values into the DataFrame
    daily_data = daily_data.append(max_values, ignore_index=True)

Now, let us see what we have.

In [None]:
grouped

In [None]:
print(daily_data)

Awesome! Now we have aggregated our data and we have a new DataFrame called `daily_data` where we have max values for each day in the data set.
Does this help you identify the time and date of the largest event in January 2021?

### Finding the mean for all groups at once

We can also achieve the same result by computing the max of all columns for all groups in the grouped object.

In [None]:
grouped.max()

## Detecting most seismically active days

Now that we have aggregated our data on daily level, all we need to do is to sort our results in order to check which days in January 2021 had the most earthquakes. A simple approach is to select all days in the data, group the data and check which group(s) have the highest count of earthquakes.

We can start this by selecting all records that are from January (regardless of the year).

In [None]:
january = data[data["MONTH"] == 1]

Next, we can take a subset of columns that might contain interesting information.

In [None]:
january = january[["MONTH_DAY", "LON", "LAT", "DEP", "Mw"]]

We can group by year and month.

In [None]:
grouped = january.groupby(by="MONTH_DAY")

And then we can calculate the mean for each group (MONTH_DAY) and save it to a different variable. We rename the columns appropriately.
It will also be interesting to find the largest evets in those days and the number (count) of earthquakes.

In [None]:
daily_stats = grouped[['Mw','DEP']].mean().rename(columns={'Mw': 'Mw_mean','DEP': 'DEP_mean'})

daily_stats[['Mw_min', 'DEP_min']] = grouped[['Mw','DEP']].min()
daily_stats[['Mw_max', 'DEP_max']] = grouped[['Mw','DEP']].max()

daily_stats['N_events'] = grouped['Mw'].count()

In [None]:
daily_stats.head()

We can sort and check the most seismically active days. We can sort the data frame in a descending order to do this.

In [None]:
daily_stats.sort_values(by="N_events", ascending=False).head(10)

And finally, take the average number of events per day in January 2021

In [None]:
daily_stats['N_events'].mean()

## Repeating the data analysis with a larger dataset

To wrap up today's lesson, let's repeat the data analysis steps above for all the available data we have (!!). First, it would be good to confirm the path to the **folder** where all the input data are located.

The idea is, that we will repeat the analysis process for each input file using a (rather long) for loop! Here we have all the main analysis steps with some additional output info, all in one long code cell.

In [None]:
import pandas as pd

fp = 'data/eq_data/pivs202101.csv'

# Read selected columns of  data 


# replace the \n in the column names with '' to delete them



# Rename the columns
new_names = {
    "Date - Time(Philippine Time)": "TIME", 
    "Latitude(ºN)": "LAT", 
    "Longitude(ºE)": "LON", 
    "Depth(km)": "DEP"
}



# Create column for moment magnitude



# Convert Magntiudes to Moment magnitudes


# Convert TIME to string, and the original time column to datetime


# Put the month in one column. 


# Parse year and month to a new column


# Group by month and day



# Get mean, min, and max values for each group, save these to new columns
daily_stats = grouped[['Mw','DEP']].mean().rename(columns={'Mw': 'Mw_mean','DEP': 'DEP_mean'})
daily_stats[['Mw_min', 'DEP_min']] = grouped[['Mw','DEP']].min()
daily_stats[['Mw_max', 'DEP_max']] = grouped[['Mw','DEP']].max()
daily_stats['N_events'] = grouped['Mw'].count()
mean_Nevents = daily_stats['N_events'].mean()

# Let's see what we have

# Print info about the current input file:
print(f"MONTH: {data.at[0, 'MONTH']} {data.at[0, 'YEAR']}")
print(f"NUMBER OF EVENTS RECORDED: {len(data)}")

print(f'Average number of earthquakes per day: {round(mean_Nevents, 1)}') # round to the nearest 1 decimal place
print(daily_stats.head())

# Print info
print(daily_stats.sort_values(by="N_events", ascending=False).head(5))
print("\n")

At this point we will use the `glob()` function from the module `glob` to list our input files. glob is a handy function for finding files in a directrory that match a given pattern, for example.

In [None]:
import glob

In [None]:
file_list = glob.glob(r"data/eq_data/pivs*csv")

```{note}
Note that we're using the \* character as a wildcard, so any file that starts with `data/0` and ends with `txt` will be added to the list of files we will iterate over. We specifically use `data/0` as the starting part of the file names to avoid having our metadata files included in the list!
```

In [None]:
print(f"Number of files in the list: {len(file_list)}")
print(file_list)

Now, you should have all the relevant file names in a list, and we can loop over the list using a for loop.

In [None]:
for fp in file_list:
    print(fp)

In [None]:
# Repeat the analysis steps for each input file:
for fp in file_list:

    # Read selected columns of  data 


    # replace the \n in the column names with '' to delete them
    data.columns = data.columns.str.replace('\n', '')

    # Rename the columns
    new_names = {
        "Date - Time(Philippine Time)": "TIME", 
        "Latitude(ºN)": "LAT", 
        "Longitude(ºE)": "LON", 
        "Depth(km)": "DEP"
    }

    data = data.rename(columns=new_names)

    # Create column
 

    # Convert Magntiudes to Moment magnitudes
 

    # Convert TIME to string

    
    # Put the month in one column. 
    
    
    # Parse year and month

    
    # Group by month and day

    
    # Get mean and max values for each group
    daily_stats = grouped[['Mw','DEP']].mean().rename(columns={'Mw': 'Mw_mean','DEP': 'DEP_mean'})
    daily_stats[['Mw_min', 'DEP_min']] = grouped[['Mw','DEP']].min()
    daily_stats[['Mw_max', 'DEP_max']] = grouped[['Mw','DEP']].max()
    daily_stats['N_events'] = grouped['Mw'].count()
    mean_Nevents = daily_stats['N_events'].mean()

    # Let's see what we have
    # Print info about the current input file:
    print(f"MONTH: {data.at[0, 'MONTH']} {data.at[0, 'YEAR']}")
    print(f"NUMBER OF EVENTS RECORDED: {len(data)}")
    
    print(f'Average number of earthquakes per day: {round(mean_Nevents, 1)}') # round to the nearest 1 decimal place
    print(daily_stats.head())

    # Print info
    print(daily_stats.sort_values(by="N_events", ascending=False).head(5))
    print("\n")

So, what can we conclude about the average number of earthquakes?

As an exercise, try getting the data for the next months (April, May, and so on...) from the PHIVOLCS website and see what the results are.
Do you get similar numbers? Why or why not?