# <center> Coral Cover Analysis </center>



# Part 0: Import Necessary Libraries & load data

The first step is to import the libraries you'll need. In this case, we're using Pandas to work with data and NumPy (although not used in this specific example). You can import these libraries as follows:

**Note**: if you get an error: `ModuleNotFoundError: No module named 'x'` - add another cell and write `pip install x`.
The reason for this error is that we are working with an online Python environment that is facilitated via Jupyter Notebooks, and there might be some packages that are not locally installed on your machine. 

In [None]:
import pandas as pd
import numpy as np
#import os
import matplotlib.pyplot as plt
import plotly.express as px
import geopandas
import geodatasets
import contextily as cx
import plotly.graph_objects as go
from scipy.interpolate import griddata
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.tri as tri
import geopandas as gpd
import seaborn as sns

In [None]:
#pip install pandas

**Load Data from a CSV File**

To load data from a CSV file into a Pandas DataFrame, you can use the `pd.read_table()` function. This function is versatile and can handle various file formats. 

In this example, we'll use it to read data from `percent_cover_DFAT.csv` located in a subfolder called `data`. 

We assign the name `coral_data` to the DataFrame so we can reference it later in our analysis. You can choose any name you like, but make sure each dataset you load has a unique, descriptive name so you don’t overwrite existing data.

Here's how you can do it:


In [None]:
coral_data=pd.read_csv('data/percent_cover_DFAT.csv') 

## Part 1: Formatting the data

#### Preparing Your Data Before Analysis

When you first load a dataset, it’s important to spend time cleaning and formatting it before you begin plotting or running calculations. Raw data often comes with issues that can lead to errors or misleading results if left unchecked.

1. **Check that the data has loaded correctly**
- Use commands like df.head() to view the first few rows and df.info() to see column names, data types, and missing values.
- Make sure the headings are recognised as column names, and not treated as data entries (sometimes CSV files contain extra header rows).

2. **Inspect the indexing**
- Every pandas DataFrame has an index (the row labels). By default, it starts from 0, 1, 2, etc.
- Sometimes pandas may assign the first row of the data as an index, and if it isn't a true index this may cause problems later on when merging data or making plots.
- When loading data you can specifiy you which column is the index or specify the data has no index `df = pd.read_csv("data.csv", index_col="SiteID")` or `df = pd.read_csv("data.csv", index_col=False)`

3. **Look at both the first and last rows**
- The last row can sometimes contain “totals” or other automatic summary added by data collection software, which should not be analysed as real observations. Use df.tail() to check this.

4. **Verify structure and consistency**
- Confirm that numeric columns are stored as numbers (e.g., int64, float64), and catagory variables are stored as text (object).

5. **Missing values**
- Ensure missing values are correctly identified.
- When a site has no species present, the data might be recorded as 0 (true zero) or left blank, which Python will interpret as NaN.
- NaN values are treated as missing data, meaning they are ignored in calculations (like averages) and in plots.
- It’s important to check your dataset and decide whether a value represents missing data or a true zero, and then code it appropriately.

By taking these steps early, you set up a clean, consistent dataset that you can trust for plotting, summary statistics, and further analysis.

Below are some typical commands that are useful for this initial step. 

In [None]:
#coral_data.fillna(0, inplace=True)                            # Replace NaNs with 0
#coral_data.head()                                             # preview first 5 rows
#coral_data.info()                                             # summary of columns and data types
#coral_data.describe()                                         # quick statistics
#coral_data.isna().sum()                                       # check missing values
#coral_data = coral_data.dropna()                              # drop rows with missing values
#coral_data["col"].fillna(0, inplace=True)                     # replace NaN with 0
#coral_data.columns = coral_data.columns.str.strip()           # remove spaces in column names
#coral_data = coral_data.rename(columns={"old": "new"})        # rename a column
#coral_data = coral_data.drop(columns=["unnecessary"])         # delete a column
#coral_data = coral_data.drop(coral_data.index[-1])            # delete the last row
#coral_data = coral_data.drop(coral_data.index[0])             # delete the first row
#coral_data["HC"] = coral_data["HC"].astype(int)               # convert to integer
#coral_data["Site"] = coral_data["Site"].astype("category")    # convert to category
#print(coral_data["Zone"].unique())                            # print unique values
#print(zone_data["Zone"].value_counts())                       # print counts of each unique value

#### Coralnet example

First, inspect the dataset to confirm it loaded correctly and the structure makes sense.

In [None]:
coral_data

In [None]:
# check missing values
coral_data.isna().sum() 

In [None]:
# drop rows with missing values
# there are still 265 rows so there are no rows with missing values
coral_data = coral_data.dropna()
coral_data

Based on our data we have:
- **Rows:** each row is a single image.
- **Columns:** each column is a benthic label.
- **Values:** percent cover per image.
- **Summary row:** a row at the end that calculates the average percent cover for each benthic label across all images. We want to delete this row.
- **Image name:** These were assigned before uploading to CoralNet and follow the pattern `SiteID_Framenumber.png`. We want to split these into seperate columns for siteID number and frame number.
- No missing values

**Step 1 — delete the last row from the data**

In [None]:
# in python the last row or column can be indicated by '-1'
coral_data = coral_data.drop(coral_data.index[-1])  
coral_data

**Step 2 — remove the `.png` extension**

In [None]:
# delete text from file name 
coral_data['image name'] = coral_data['image name'].str.replace('.png', '', regex=False) ## to get rif of the .png 
coral_data

**Step 3 — split `SiteID` and `Frame`:**

In [None]:
# Split the values in the 'image name' column into parts wherever an underscore ('_') appears.
# With expand=True, each part is placed into its own separate column (instead of a list in a single cell).
# The result is stored in 'split_cols'
split_cols = coral_data['image name'].str.split('_', expand=True) 

split_cols

In [None]:
# Now we add the columns in split_columns back into our original DataFrame, which you will see are labelled as 0 and 1 by default
coral_data = coral_data.join(split_cols)
coral_data

We want to change the default column names to `Site` and `Frame`

In pandas, this is done using the` .rename()` function. Column labels can be stored as either **integers** or **strings**, so it’s important to match the correct type when renaming:
- If the column labels are integers (e.g., `0`, `1`), pass them as integers in the dictionary.
- If the column labels are strings (e.g., `'0'`, `'1'`), pass them as strings (with quotes).

If unsure, just try one and if you get an error, try the other. 

In [None]:
# Rename columns
coral_data = coral_data.rename(columns={0: 'Site', 1: 'Frame'}) 
coral_data

**Step 4 — Add and merge data**

After you classify the underwater benthos at each site, you often want to add in **descripter** or **environmental variables** to aid in your analysis. 

For our example, we will add **latitude**, **longitude**, and **depth** for each **site**.

We can do this by merging the coral net dataset with another CSV file that contains these details, using the site ID as the key for the merge.

First, lets add in our `Metadata_dropcamera` file in the same way we did above. 

In [None]:
zone_data = pd.read_csv("data/Metadata_dropcamera.csv")
zone_data

In [None]:
print(zone_data["Zone"].unique()) 

In [None]:
print(zone_data["Zone"].value_counts())

We now use the ```merge()```function, specifying the two DataFrames (coral_data and zone_data) and the column to use as the key (Site).
The how argument controls which rows are kept:

- `left` → keeps all rows from the left DataFrame (data_coral) and only matching rows from the right.
- `right` → keeps all rows from the right DataFrame (data_zone) and only matching rows from the left.
- `inner` → keeps only the rows where the key exists in both DataFrames.
- `outer` → keeps all rows from both DataFrames, filling missing values with NaN where no match exists.

In [None]:
# Perform left merge to keep all rows from df2
#data = pd.merge(coral_data, zone_data, on='Site', how='left')
#data

You will notice that we get an error when we try and run the above code. That is because our Site columns are saved as two differn't types object and intiger so they are treated as two differn't things. To fix this we can use the ```.dtypes```function to check what our columns are functioning as. It is best practice to check your variables at the beggining to avoid any errors. 

Typical data tpes include:
- `int64` (Integer) : Whole numbers (no decimals).
- `float64` (Floating Point): Numbers with decimals.
- `object` (Generic Python Object, usually Text): Mixed or text/string data.
- `catagory`: catagorical variable

##### Option 1: 

Just code the function and swap the asterix so you can see each one at a time

In [None]:
# option 1, just prints the last data frame
#zone_data.dtypes 
#coral_data.dtypes 

# option 2: prints both data frames
print('zones:\n',zone_data.dtypes,'\n\ncoralnet:\n',coral_data.dtypes)

**Using the `print()` function**

The `print()` function is useful for diaplaying multiple outputs at the same time with a mix of text and functions.
Use `''` (or `""`) around text strings.

- Separate multiple items with a `,` → they will be printed with spaces between them.
- Use `\n` inside a string to create a new line (like pressing Enter).

**Formatting variable types**

Now we use the `.astype(`) function to change **site** in our DataFrame to **integer variables**. These values are listed as whole numbers, and even though these numbers don't mean anything in terms of order, it is easier to treat them as integers in this instance to make ensure reppeated numbers match each other.

We will change **zone** to a catagorical variable. 

By default, categorical variables in pandas have **no inherent order** (e.g., categories like *low, medium, high* are treated equally). However, you can assign an order if it makes sense for your analysis.

For example, we can assign an order to our **zone** variable, based on a spectrum of connectivity to the open ocean:

- **Reef slope** → high connectivity
- **Reef shelf** → medium connectivity
- **Lagoons 1–3** → low connectivity

Among the lagoons, **Lagoon 1** is the largest and most exposed to dominant wave energy, while **Lagoon 3** is the smallest and most sheltered. Therefore, we consider **Lagoon 3** to be the most isolated of all five zones.

In [None]:
# Convert to categorical
zone_data["Site"] = zone_data["Site"].astype("int")
coral_data["Site"] = coral_data["Site"].astype("int")
coral_data["Frame"] = coral_data["Frame"].astype("category")
zone_data["Zone"] = zone_data["Zone"].astype("category")

# Set the category order
zone_order = ["Reef slope", "Reef flat", "Lagoon 1", "Lagoon 2", "Lagoon 3"]
zone_data["Zone"] = zone_data["Zone"].cat.set_categories(zone_order, ordered=True)

#check it worked
print('zones:\n',zone_data.dtypes,'\ncoralnet:\n',coral_data.dtypes)

Now our `merge.()` function will work because our **Site variable** matches across data frames

In [None]:
data = pd.merge(coral_data, zone_data, on='Site', how='left')
data

**Export dataframe to a csv**

We have finished formatting our data to be used for analysis. You might decide that you want to do analysis in another application, so here you would save your formatted data as a csv using the following code. 

`data.to_csv("formatted_data.csv", index=False)`



## Part 2: Plotting the data

There are many differnt plots we can use in python. 

These include:

`.boxplot()` → shows the spread and distribution of data, highlighting the median, quartiles, and potential outliers.

`.barplot()` → bar chart (mean values with error bars)

`.pie()` → pie chart (proportions)

`.scatterplot()` → scatter plot (relationship between two variables)

`.lineplot()` → line graph (trend over time or sequence)

`.hist()` → histogram (frequency of values)

`.heatmap()` → colour-coded matrix (great for correlations)

Lets start by looking at the mean reef scale changes in each variable.

In [None]:
data[['HC']].boxplot()
plt.show()

We can use `.iloc[:, 1:13]` to select all rows `(:)` and only columns 2 through 13 `(1:13)`. Remember that in Python indexing starts at 0, so column index 1 is actually the second column.

In [None]:
data.iloc[:, 1:13].boxplot(figsize=(12, 3))
plt.show()

We can use the describe function to very quickly calculate common statistics, and then we can use these to make other plots based on the means ans dtandard deviations. 

In [None]:
data.describe() 

In [None]:
df_stat = data.describe() 

In [None]:
df_stat_coral = df_stat.T[0:12]   # transpose and select rows
df_stat_coral[["mean"]].plot.pie(
    y="mean",           # column to use for pie
    legend=False,       # optional: hide legend (labels on wedges instead)
    autopct='%1.1f%%',  # show percentages on slices
    figsize=(6, 6)      # adjust figure size
)

plt.ylabel("")  # remove the automatic 'mean' label on the y-axis
plt.title("Mean values by column")
plt.show()

In [None]:
df_stat_coral = df_stat.T[0:12]   # transpose and select rows
df_stat_coral[["mean"]].plot.bar(
    y="mean",           # column to use for pie
    legend=False,       # optional: hide legend (labels on wedges instead)
    figsize=(6, 6)      # adjust figure size
)

plt.ylabel("")  # remove the automatic 'mean' label on the y-axis
plt.title("Mean values by column")
plt.show()

### Group by a variable such as "reef zone"

We want to calculate the mean cover of each benthic class (e.g., hard coral, macroalgae, CCA, soft coral, etc.) across each reef zone.

To do this in pandas, we can use the `groupby()` function:

- `groupby("Zone")` → groups the dataset by reef zone.
- `.mean()` → calculates the mean for each numeric column within each group.
- This way, we end up with a table where each row is a reef zone, and each column is the average percent cover of a benthic class.

In [None]:
# Group by reef zone and calculate the mean for each benthic class
zone_means = data.groupby("Zone").mean(numeric_only=True).iloc[:, 0:12]

zone_means

In [None]:
# Plot bar chart with error bars
zone_means["HC"].plot.bar(figsize=(6,4))

plt.title("Mean HC by Reef Zone")
plt.xlabel("Reef Zone")
plt.ylabel("Mean HC (%)")
plt.show()

Alternatively, we can use SNS package to plot the mean and standard error from the original dataframe.

In [None]:
sns.barplot(data=data, x="Zone", y="HC", estimator="mean", errorbar="se")

plt.title("Mean HC by Reef Zone")
plt.xlabel("Reef Zone")
plt.ylabel("Mean HC (%)")
plt.show()

# Part 3 Spatial distribution 

**Option 1: Overlaying data onto a map**

Select specific columns ('GPS (S)', 'GPS (E)', and '% Cover Algae') from your coral_data DataFrame, 

The code then creates a scatter mapbox plot using Plotly Express `px.scatter_mapbox`. Here's a breakdown of the key parameters used:

- **lat** and **lon**: Latitude and longitude columns from your DataFrame.
- **size**: The size of the markers, determined by the % Hard Coral column.
- **color**: The color of the markers, also determined by the % Hard Coral column.
- **color_continuous_scale**: The color scale used for the markers.
- **center**: The initial center of the map.
- **opacity**: The opacity of the markers.
- **zoom**: The initial zoom level of the map.
- **mapbox_style**: The style of the map (in this case, **open-street-map**).

In [None]:
fig = px.scatter_mapbox(data, lat = 'lat', lon = 'long',
                        color = 'HC', color_continuous_scale = 'rainbow',
                        center = dict(lat = -23.498, lon = 152.07), opacity=0.7,
                        zoom = 12, mapbox_style = 'open-street-map')
fig.show()

In [None]:
# to save the image as a PNG image file with the dimensions (800x500 pixels).
#fig.write_image("bubble-map-plotly.png", width = 800, height = 500)

#Or save as html to view in a browser:
fig.write_html("bubble-map-plotly.html")

**Option 2: Overlaying a shapefile onto a typical scatterplot**

In [None]:
# Convert to numeric, forcing non-numeric entries to NaN
data['lat'] = pd.to_numeric(data['lat'], errors='coerce')
data['long'] = pd.to_numeric(data['long'], errors='coerce')

print(data[['lat', 'long']].dtypes)

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(data['long'], data['lat'])
plt.xlabel("Longitude")
plt.ylabel("Latitude")
plt.title("Sample locations")
plt.show()


In [None]:
# Load the shapefile
Reef = gpd.read_file("data/OTR_DryReef.shp")


# Plot
fig, ax = plt.subplots(figsize=(8, 6))

# Background polygon
Reef.plot(ax=ax, color="lightgrey", edgecolor="black", zorder=0) #zorder 0 means it will appear in the background
# Points coloured by Hcoral
ax.scatter(data['long'],
           data['lat'],
           c=data['HC'],
           cmap="viridis",
           zorder=1 # zorder 1 means it will appear in the foreground
)

plt.show()


**Explanation of subplots**

In Matplotlib, a figure `(fig)` is the overall container for plots, and an axes `(ax)` is a single subplot within that figure where the data is drawn. By using `fig, ax = plt.subplots()`, we create both the figure and one or more axes objects. Passing ax=ax into plotting functions (e.g., `Reef.plot(ax=ax, ...))` tells Python to draw the output on that specific subplot, instead of creating a new figure each time. This is useful for layering multiple datasets (such as reef polygons and survey points) in the same plot, or for arranging multiple subplots within one figure.