# Correct code: With detailed explaination. Is is to get the trend in each plot. For overall trendall the scale has to be equal.
# ylim are not same here, so, it might be tough to compare side by side

### there are multiple peak events under a annual line! So, the middle of the landfall will be picked based on the hurricane start and end date.
### This middle value will be same for all station to have a fair comparison. As it is a storm surge event all of these values are calculated based on the firast station that is KIPOTOPEKE!!!! So it is important to input KK seperately to find out these ranges!

# each station should have the same range date for pre-post-landfall!!

Baseline magnitude upper and lower threshold are set to isolate the baseline where the interaction is extremely high, like a hurricane/wind induced peaks. Monsoon depression etc. 

In [6]:
"""

################################ READ ME ######################################################
Comment Main:
There are intermediate comments all over the code. All the comments are very important to understand the functionality of the code.
So, here, your job is to read all the comments first then you are going to run the code.
################################ READ ME ######################################################



Job: Log into the distribuition of NIs magnitude shift during normal and extreme condition

Capacity to handle 1 station
Notes:
Comment_1:
After importing the csv, the column headers are remaned from their specific station-wise code
to a more generic version: "TideOnly", "Surge", "RiverTide", "Wave", and "AllForcing"

Comment_2:
#S1: T = Tide
#S2: T+SS = Tide + Surge # Wind pressure was considered, no wave component
#S3: T+RF = Tide + River Flow – No Wind/Pressure was considered here.
#S4: T+W = Tide + Wave – No Pressure was considered here in the Atlantic as well as the FM CPB model domain. 
#S5: Atlantic ocean with all component – P/W, wave, river flow
#Everything is considered here. For now, observation readings are taken as s5.
#====================================================================================================================================#
#NLI_1: S2-S1: (T+SS)-T = Tide-Surge-Interaction (TSI) + Contribution of SS
#NLI_2: S3-S1: (T+RF)-T = Tide-River-Interaction (TRI) + Contribution of RF
#NLI_3: S4-S1: (T+W)-T= Tide-Wave-Interaction (TWI) + Contribution of Wave
#NLI_all: S5-S1: (T+SS+RF+W)-T=Tide-Surge-River-Wave-Interaction



"""

import time
import glob
from netCDF4 import Dataset
import csv
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import linregress
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

%matplotlib qt
### annotation function
def annotate_r2(x, y, **kwargs):
    # Calculate the linear regression
    slope, intercept, r_value, p_value, std_err = linregress(x, y)
    # Calculate R-squared
    r2 = r_value**2
    # Plot the annotation with R-squared value
    ax = plt.gca()
    ax.text(.05, .8, f'$R^2 = {r2:.2f}$', transform=ax.transAxes, fontsize=12, color='black')



station_dict = {
    "SP": "Sewells Point, VA",
    "MP": "Money Point, VA",
    "KK": "Kiptopeke, VA",
    "WAC": "Wachapreague, VA",
    "YOR": "Yorktown USCG, VA",
    "WP": "Windmill Point, VA",
    "OC": "Ocean City Inlet, MD",
    "LEW": "Lewisetta, VA",
    "SI": "Solomons Island, MD",
    "DC": "Washington, DC",
    "BH": "Bishops Head, MD",
    "CC": "Chesapeake City, MD",
    "BAL": "Baltimore, MD",
    "TB": "Tolchester Beach, MD",
    "ANN": "Annapolis, MD",
    "CAM": "Cambridge, MD",
    "DAH": "Dahlgren, VA",
    "CBBT": "CBBT, Chesapeake Channel, VA"
}


# importing and formating the data to standard format. See comment_1


####################### INPUT START #############################
# The middle window is taken as 7 days and similarly the pre-post window is taken as 7 days.
# This 7 days section are valid for Sandy 2012 and it was based on it, so shorter-faster hurriane this might not have been the case.
# So understand this therefore it is extreme important to plot the time-series and judge.
# Again the concern is, will this different time windows be valid fo comparative analysis?

timedelta_landfall= pd.Timedelta(days=3.5)
timedelta_X2= pd.Timedelta(days=7.0)

# the code is going to search for the peak water level within this period and then assign the 7 days window for pre-landfall and post-hurricane
# here for Irene the landfall is 27th of August 2011
# here for Sandy the landfall is 29th of September 2011

hurricane_start = pd.to_datetime('2012-10-26')
hurricane_end = pd.to_datetime('2012-11-02')

fig, axs = plt.subplots(4, 3, figsize=(15, 18))

stations = ['KK', 'LEW', 'ANN']
input_directory = r'F:\DataExtractionStations\Station_2012'

ensemble_size = 500

baseline_upper_threshold=0.40
baseline_lower_threshold=-0.40
####################### INPUT END #############################

for station_index, station in enumerate(stations):
    directory_path = f"{input_directory}\\{station}.csv"
    print(f"Station {station_index + 1}: {station} - Directory: {directory_path}")

    
    data_raw=pd.read_csv(directory_path)

# the renaming of the legends will be done after searching the file name for each specfic term denoted below.
    rename_mapping = {
        'TideOnly': 'Harmonic Tide',
        'Surge': 'Surge',
        'RiverTide': 'RiverTide',
        'Wave': 'Wave',
        'AllForcing': 'AllForcing'
    }
    data_raw.rename(columns=lambda x: next((v for k, v in rename_mapping.items() if k in x), x), inplace=True)
    
    # See comment 2
    data_raw['TSI+SS'] = data_raw['Surge'] - data_raw['Harmonic Tide']
    data_raw['TRI+RF'] = data_raw['RiverTide'] - data_raw['Harmonic Tide']
    data_raw['TWI+WV'] = data_raw['Wave'] - data_raw['Harmonic Tide']
    data_raw['TSRWI'] = data_raw['AllForcing'] - data_raw['Harmonic Tide']
    
    # isolate the required data for plotting
    data_df = data_raw[['DateTime (GMT)', 'AllForcing','Harmonic Tide', 'TSI+SS', 'TRI+RF', 'TWI+WV', 'TSRWI']]
    
    ####################### INPUT START #############################
    data_df = data_df.iloc[10*24*15:,:].reset_index(drop=True) # dropping initial 15 days worth of data for statbility
    
    ####################### INPUT END #############################
    
    
    """
    Comment_3:
    segmentation description: 
    
    Hurricane_landfall: 
    Peak WL level time for station is found out based on the AllForcing simulation waterlevel (standard). 
    Checks set, if its within the landfall timeboudnary 10/29/2012 - 10/30/2012. 
    Then 3.5 days prior and post this to this index is taken as peak hurricane landfall timeline.
    For example: KK-max water level is at '2012-10-30 10:48:00' value: 0.7862355799560957. The labels for 7 days centering at this peak
    is set as hurricane land fall. 
    
    pre-hurricane: 7-day prior 
    post-hurricane: 7-day post
    Baseline: 16-Jan-2020 to 31-Dec-2020; Except: Timelines for landfall, pre-hurricane and post-hurricane
    """
    
    #--------------------------Details in comment_3 creating dividing labels within the dataframe ---------------------------
    data_df['DateTime (GMT)'] = pd.to_datetime(data_df['DateTime (GMT)'], format='%Y-%m-%d %H:%M:%S')
    
    # Maximum WLvalue and corresponding datelue, to find hurricance landfall date
    
    # finding out the landfall date!
    # Maximum WLvalue and corresponding datelue, to find hurricance landfall date
    # Filter data to include only rows within the hurricane timeframe

    
    data_df_hurricane = data_df[(data_df['DateTime (GMT)'] >= hurricane_start) & 
                                 (data_df['DateTime (GMT)'] <= hurricane_end)]
    
    # Find the maximum water level within the hurricane period
    max_wl = data_df_hurricane['AllForcing'].max()
    
    # Find the corresponding date for the max water level within the hurricane period
    max_wl_date = data_df_hurricane.loc[data_df_hurricane['AllForcing'] == max_wl, 'DateTime (GMT)'].iloc[0]
    
    print(f"Max water level date: {max_wl_date}, station: {station}")


############################ Fixing the landfall date center based on the coastal station!#######################
    ############################ It will be same for all other station ##########################################
    # Check if the date is within the specified range
    if not (hurricane_start <= max_wl_date <= hurricane_end):
        warning_message = f"Warning: Maximum 'AllForcing' value occurs on {max_value_date}, which is outside the specified date range."
        warning_message
    
    if station == 'KK': 
    # Define landfall, pre-hurricane, and post-hurricane periods
        landfall_start = max_wl_date - timedelta_landfall
        landfall_end = max_wl_date + timedelta_landfall
    
        pre_hurricane_start = landfall_start - timedelta_X2
        pre_hurricane_end = landfall_start
        
        post_hurricane_start = landfall_end
        post_hurricane_end = landfall_end + timedelta_X2
############################ Fixing the landfall date center based on the coastal station!#######################
    
    # Add the 'Labels' column with default 'Baseline' value
    data_df['Labels'] = 'Baseline'

    # Label the rows accoding to to the data range
    data_df.loc[(data_df['DateTime (GMT)'] >= landfall_start) & (data_df['DateTime (GMT)'] <= landfall_end), 'Labels'] = 'Landfall'
    data_df.loc[(data_df['DateTime (GMT)'] >= pre_hurricane_start) & (data_df['DateTime (GMT)'] < pre_hurricane_end), 'Labels'] = 'Pre-hurricane'
    data_df.loc[(data_df['DateTime (GMT)'] > post_hurricane_start) & (data_df['DateTime (GMT)'] <= post_hurricane_end), 'Labels'] = 'Post-hurricane'
    
    # Combine the counts with the start and end datetimes for each categorical value
    label_summary = data_df.groupby('Labels').agg(
        Count_6min=('DateTime (GMT)', 'size'),
        Start_DateTime=('DateTime (GMT)', 'min'),
        End_DateTime=('DateTime (GMT)', 'max')
    )
    
    # Display the combined summary
    print(label_summary)
    
    
    """
    The baseline data contains 79,190 data points, which is 50 times more than the data points available for landfall and 
    pre/post-hurricane conditions. To create a representative and comparable KDE plot, the baseline data needed to be sampled.
    
    If a single sample was taken, it might miss important underlying trends, such as the highly right-skewed distribution
    observed in the TRI+RF data. To better capture these trends, a Monte Carlo sampling approach with 10,000 iteration can be applied.
    However, this will casue scaling issue. 
    
    In the first iteration (i == 1), where all the KDE are plotted together,
    you're plotting the KDEs for all labels, including 'Baseline' and others like 'Pre-Hurricane', 'Hurricane', and 'Post-Hurricane', 
    This dataset combines the subsampled 'Baseline' data with all other label data, resulting in each label appear smaller. As all 
    the area under the curve ultimately sums up to 1.
    
    This is because the KDEs are normalized within each label, leading to overlapping distributions.
    
    In the other iterations (i != 1), you’re only plotting the KDE for the 'Baseline' subsample, 
    without including the other labels. As a result, the KDE plots in these iterations, especially the blue curves
    representing 'Baseline', appear larger because only this single category is being plotted. Without the influence of other labels, 
    the 'Baseline' KDE becomes more prominent and occupies more space in the plot.
    The KDE plots in these iterations appear larger, especially the blue curves (representing the 'Baseline'), 
    because only this single category is being plotted without the normalization effect across multiple labels
    
    
    Therefore, her, I have resorted to a simple random selection of non-extreme datapoints.
    I have 1000 ensembles that goes thorugh the 79190 data oiubts abd collects 1680n data points and plots the kde plots everytimes.
    
    The custom linestyle allows to make the plots for the pre-post-hurricane invisible by setting the linewidth =0
    
    """
    
    ####################### INPUT START#############################
    data_labeled_df=data_df.drop(columns=['AllForcing'])
    # isolating non-baseline event from the dataset.
    # Apply the threshold filter specifically for rows labeled as 'Baseline'
    data_labeled_df = data_labeled_df[
        ((data_labeled_df['TSRWI'] >= baseline_lower_threshold) & 
         (data_labeled_df['TSRWI'] <= baseline_upper_threshold)) | 
        (data_labeled_df['Labels'] != 'Baseline')
    ]
    


    
    # Find the minimum count across labels (other than Baseline) to set the sampling count.
    min_count = data_labeled_df['Labels'].value_counts().min()
    
    
    
    
    first_iteration_styles = {
        'Pre-hurricane': { 'color': 'darkorange', 'linewidth': 3.25, 'alpha': 1.0},
        'Landfall': { 'color': 'red', 'linewidth': 3.25, 'alpha': 1.0},
        'Post-hurricane': { 'color': 'green', 'linewidth': 3.25, 'alpha': 1.0},
        'Baseline': {'color': 'blue', 'linewidth': 0.25, 'alpha': 0.20}
    }
    
    # different color are set to detect code crush
    subsequent_iteration_styles = {
        'Pre-hurricane': { 'color': 'orange', 'linewidth': 0, 'alpha': 0.3},
        'Landfall': { 'color': 'brown', 'linewidth': 0, 'alpha': 0.4},
        'Post-hurricane': { 'color': 'cyan', 'linewidth': 0, 'alpha': 0.5},
        'Baseline': {'color': 'blue', 'linewidth': 0.15, 'alpha': 0.05}
    }
    
    
    
    ####################### INPUT END #############################
    
    # Initialize a list to keep track of plotted labels
    plotted_labels = []
    
    
    for i in range(ensemble_size):
        # Subsample Baseline data to smallest label category count, random sampling [MC not used]
        baseline_subsample = data_labeled_df[data_labeled_df['Labels'] == 'Baseline'].sample(n=min_count, random_state=None)
        # creating unified dataframe form sampled data
        data_balanced_df = pd.concat([baseline_subsample, data_labeled_df[data_labeled_df['Labels'] != 'Baseline']]) 
    
        # setting kde plot labels and iterating through subplot position
        if i == 0: # iteration 0 used for label and all kde plot accept Baseline
            custom_styles = first_iteration_styles
        else:
            custom_styles = subsequent_iteration_styles
    
        for r, (ax, variable) in enumerate(zip(axs[:,station_index], ['TSI+SS', 'TRI+RF', 'TWI+WV', 'TSRWI'])):  # This will be the station iteration
            for label in data_balanced_df['Labels'].unique():
                
                if label not in plotted_labels:
                    sns.kdeplot(
                        data=data_balanced_df[data_balanced_df['Labels'] == label],
                        x=variable,
                        ax=ax,
                        common_norm=True,
                        alpha=custom_styles[label]['alpha'],
                        color=custom_styles[label]['color'],
                        linewidth=custom_styles[label]['linewidth'],
                        label=label  # Add the label for the legend
                    )
                    plotted_labels.append(label)
                else:
                    sns.kdeplot(
                        data=data_balanced_df[data_balanced_df['Labels'] == label],
                        x=variable,
                        ax=ax,
                        common_norm=True,
                        alpha=custom_styles[label]['alpha'],
                        color=custom_styles[label]['color'],
                        linewidth=custom_styles[label]['linewidth']
                    )
    
            if r == 0:
                ax.set_title(station_dict[station],fontsize=15)
                
            ax.set_xlabel(f'{variable} Magnitude (m)')

            ax.set_xlabel(f'{variable} Magnitude (m)', fontsize=15)  # X-label in font size 12
            ax.set_ylabel('Density', fontsize=16)  # Y-label in font size 12
            
            # Set tick parameters for both x and y axes with fontsize 12
            ax.tick_params(axis='both', which='major', labelsize=16)
                
    
    
    
handles, labels = axs[0, 0].get_legend_handles_labels()
labels[0]='Baseline Ensemble'
fig.legend(handles, labels, loc='upper right', bbox_to_anchor=(0.98, 0.735), bbox_transform=plt.gcf().transFigure, ncol=1, frameon=True, fontsize=16)

# Adjust layout for clarity
plt.tight_layout()
plt.show()


    
    



Station 1: KK - Directory: F:\DataExtractionStations\Station_2012\KK.csv
Max water level date: 2012-10-29 13:18:00, station: KK
                Count_6min      Start_DateTime        End_DateTime
Labels                                                            
Baseline             79190 2012-01-16 00:00:00 2012-12-31 23:00:00
Landfall              1681 2012-10-26 01:18:00 2012-11-02 01:18:00
Post-hurricane        1680 2012-11-02 01:24:00 2012-11-09 01:18:00
Pre-hurricane         1680 2012-10-19 01:18:00 2012-10-26 01:12:00
Station 2: LEW - Directory: F:\DataExtractionStations\Station_2012\LEW.csv
Max water level date: 2012-10-30 19:00:00, station: LEW
                Count_6min      Start_DateTime        End_DateTime
Labels                                                            
Baseline             79190 2012-01-16 00:00:00 2012-12-31 23:00:00
Landfall              1681 2012-10-26 01:18:00 2012-11-02 01:18:00
Post-hurricane        1680 2012-11-02 01:24:00 2012-11-09 01:18:00
Pre-

In [2]:
data_balanced_df

Unnamed: 0,DateTime (GMT),Harmonic Tide,TSI+SS,TRI+RF,TWI+WV,TSRWI,Labels
71793,2011-11-11 03:18:00,-0.005926,-0.240650,0.012269,-0.261779,-0.242609,Baseline
12668,2011-03-09 18:48:00,-0.049959,-0.014087,0.061635,0.091257,0.033794,Baseline
66305,2011-10-19 06:30:00,0.066635,-0.072909,0.019043,-0.073640,-0.057179,Baseline
68343,2011-10-27 18:18:00,0.011930,-0.205325,0.012660,-0.073957,-0.199424,Baseline
33199,2011-06-03 07:54:00,0.078932,-0.191525,0.012889,-0.191645,-0.183249,Baseline
...,...,...,...,...,...,...,...
56270,2011-09-07 11:00:00,-0.008436,0.112396,0.065675,0.080040,0.186379,Post-hurricane
56271,2011-09-07 11:06:00,-0.010334,0.114610,0.065134,0.081250,0.188405,Post-hurricane
56272,2011-09-07 11:12:00,-0.011856,0.116761,0.064573,0.082402,0.190239,Post-hurricane
56273,2011-09-07 11:18:00,-0.012988,0.118766,0.063997,0.083429,0.191745,Post-hurricane


In [3]:
data_labeled_df

Unnamed: 0,DateTime (GMT),Harmonic Tide,TSI+SS,TRI+RF,TWI+WV,TSRWI,Labels
0,2011-01-16 00:00:00,-0.000303,0.026484,0.002133,0.139975,0.036253,Baseline
1,2011-01-16 00:06:00,-0.000783,0.028737,0.002164,0.141346,0.038428,Baseline
2,2011-01-16 00:12:00,-0.000978,0.030892,0.002188,0.142825,0.040530,Baseline
3,2011-01-16 00:18:00,-0.000897,0.033351,0.002205,0.144742,0.042967,Baseline
4,2011-01-16 00:24:00,-0.000533,0.035922,0.002217,0.146837,0.045558,Baseline
...,...,...,...,...,...,...,...
83986,2011-12-31 22:36:00,0.001765,-0.329691,0.017297,-0.226313,-0.302975,Baseline
83987,2011-12-31 22:42:00,0.007922,-0.331081,0.016991,-0.229799,-0.303931,Baseline
83988,2011-12-31 22:48:00,0.014227,-0.332509,0.016708,-0.233279,-0.305004,Baseline
83989,2011-12-31 22:54:00,0.020680,-0.333922,0.016450,-0.236618,-0.306108,Baseline
