# Ocean Wave Data from Buoys Reported by National Data Buoy Center (NDBC)

## Table of Contents

1. [**Importing Necessary Libraries**](#0)

2. [**Data Collection**](#10)

3. [**A look at the Data**](#20)

4. [**Exploratory Data Analysis**](#30)

5. [**Model Development & Evaluation**](#40)
  
  5.1 [**K-Means Clustering**](#45)

## 1. Importing Necessary Libraries <a class="anchor" id="0"></a>

In [None]:
# Let's install and import the required libraries
# Uncomments the following lines in case you don't have the libraries installed on your machine
#!conda install -c conda-forge beautifulsoup4 --yes 
#!conda install -c conda-forge lxml --yes
#!conda install -c conda-forge requests --yes
#!conda install -c conda-forge folium=0.9 --yes
#!conda install -c conda-forge windrose=1.6.7 --yes
#!conda install -c conda-forge ipywebrtc --yes

import sys
from bs4 import BeautifulSoup
import requests
from urllib.request import urlopen, Request
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from windrose import WindroseAxes
import folium
import json
from datetime import date, timedelta, datetime
# Put figures in the center
from IPython.core.display import HTML
import seaborn as sns
sns.set(style="whitegrid")
print('All libraries installed and imported')

## 2. Data Collection <a class="anchor" id="10"></a>

Now, let's get the list of all the buoys from National Data Buoy Center. To get access to the data, we need to first create an account [here](https://data.planetos.com/datasets/noaa_ndbc_stdmet_stations?utm_source=github&utm_medium=notebook&utm_campaign=ndbc-wavewatch-iii-notebook). Once the account is created, API Key can be found under __Account Setting__. The data set id for National Data Buoy Center (NDBC) Standard Meteorological Data is __noaa_ndbc_stdmet_stations__. This data typically updates every hour.

### First, we use an API to get Buys' latitude, longitude and names.

In [None]:
# @ hidden
APIKey = 
DatasetID = 

In [None]:
API_url = 'http://api.planetos.com/v1/datasets/%s/stations?apikey=%s' % (DatasetID, APIKey)
request = requests.get(API_url).json() #Request(API_url)
Stations = pd.DataFrame(request['station'])
df_stations = pd.DataFrame(Stations.columns)
for indx, StName in enumerate(Stations.columns):
    df_stations.loc[indx,1]=Stations.loc['SpatialExtent'][indx]['coordinates'][0]
    df_stations.loc[indx,2]=Stations.loc['SpatialExtent'][indx]['coordinates'][1]

df_stations.columns = ['Station_Name','Latitude','Longitude']
print('number of stations:',df_stations.shape[0])
df_stations.head()

### Since the data from this API is not complete for each buoy and also have missing columns, we use the National Data Buoy Center website directly at https://www.ndbc.noaa.gov/ to get our information.

There is a list of stations categorized by their owners [here](https://www.ndbc.noaa.gov/to_station.shtml)

In [None]:
source=requests.get('https://www.ndbc.noaa.gov/to_station.shtml').text
soup = BeautifulSoup(source,'lxml')
Owners=[]
BuoyData={}

for owner in soup.find_all('h4'):
    Owners.append(owner.text)
for indx,table in enumerate(soup.find_all('pre')):
    #print(indx,table.text)
    try:
        BuoyData[Owners[indx]]=table.text.replace('\n','').strip()
    except:
        BuoyData[Owners[indx]]=table


In [None]:
# First, let's make all the station names uppercase since they are in lowercase format in the website
df_stations['Station_Name']=df_stations['Station_Name'].str.upper()
# Now, let's put this data and the data containing the latitudes and longitudes in the same dataframe
newlist1=[]
newlist2=[]
for indx1 in range(df_stations.shape[0]):
    for indx2,owner in enumerate(Owners):
        if (df_stations.iloc[indx1,0] in BuoyData[owner]):
            newlist1.append(owner)
            newlist2.append(df_stations.iloc[indx1,0])


In [None]:
df_stations0=pd.DataFrame({'Station_Name':newlist2,'Owner':newlist1})
df_full1=pd.merge(df_stations0, df_stations, on='Station_Name')
print('Number of Stations:',df_full1.shape[0])
df_full1.head()

In [None]:
def BuoyLocationPlot2(buoys,MapInput,DataFrameInput,colorm,fill_colorm,Marker=False):
    
    # loop through and add each to the feature group
    for lat, lng, label in zip(DataFrameInput.Longitude, DataFrameInput.Latitude, DataFrameInput.Station_Name):
        buoys.add_child(
            folium.CircleMarker(
                [lat, lng],
                radius=6, # define how big you want the circle markers to be
                color=colorm,
                fill=True,
                fill_color=fill_colorm,
                fill_opacity=0.8,
                popup=label
            )
        )

    
    if Marker==True:
        latitudes = list(DataFrameInput.Longitude)
        longitudes = list(DataFrameInput.Latitude)
        labels = list(DataFrameInput.Station_Name)
        for lat, lng, label in zip(latitudes, longitudes, labels):
            folium.Marker([lat, lng], popup=label).add_to(MapInput)    
    
    
    return buoys

### Available Data for each buoy
Not all the buys have data available for the same period of time. Some have been working for a short period of time, some had worked before but not any more, and some of them recently started working and collecting data. To have a full collection of the time range of the data available for each buoy, the following page is used from National Data Buoy Center website:
https://www.ndbc.noaa.gov/historical_data.shtml#stdmet. It is worth mentioning that we are only interested in Standard Meterological data. To have a better understanding of  Standard Meterological data, look at the information provided here: https://www.ndbc.noaa.gov/measdes.shtml

In [None]:
source=requests.get('https://www.ndbc.noaa.gov/historical_data.shtml').text
soup = BeautifulSoup(source,'lxml')

BuoyYearData={}
BigTable=soup.find_all('ul')[1]
SmallTable=BigTable.li.ul

for indx,subtable in enumerate(SmallTable.find_all('li')):
    StatName=SmallTable.find_all('li')[indx].text.split('\n')[0].replace(':','')
    est=SmallTable.find_all('li')[indx].text.split('\n')
    tmplist = est[1:]
    for indx00,value in enumerate(tmplist):
        tmplist[indx00] = tmplist[indx00].strip()
    del tmplist[-1]
    BuoyYearData[StatName]=tmplist
    

# Let's add one column for each year to the main dataframe
YearsRange = range(1970,2019)
df_full = df_full1.copy(deep=True)
df_full.set_index(['Station_Name'],inplace=True)
for indx, year in enumerate(YearsRange):
    df_full[year] = None

for indx,StName in enumerate(df_full.index):
    first=np.array(df_full.loc[StName][4:].index).astype('int64')
    try:
        second=np.array(BuoyYearData[StName]).astype('int64')
        df_full.loc[StName ,np.intersect1d(first,second)]=1
    except:
        df_full.loc[StName ,np.intersect1d(first,second)]=None

In [None]:
# Plotting the buoys with their corresponding owners
figmp=folium.Figure(width=1300, height=700)
WorldMap_map2 = folium.Map(location=[17.6078, -8.0817],tiles="Stamen Toner",zoom_start=2).add_to(figmp)
color=iter(plt.cm.rainbow(np.linspace(0,1,len(Owners))))
for indx,owner in enumerate(Owners):
    clr=matplotlib.colors.to_hex(next(color))
    feature_group = folium.map.FeatureGroup(name=owner)
    df_tmp = df_full1[df_full1['Owner']==owner]
    buoys=BuoyLocationPlot2(feature_group,WorldMap_map2,df_tmp,clr,clr)
    WorldMap_map2.add_child(buoys,name=owner,index=indx)
    
folium.TileLayer("Stamen Toner").add_to(WorldMap_map2) 
folium.TileLayer("OpenStreetMap").add_to(WorldMap_map2) 
folium.TileLayer("Stamen Terrain").add_to(WorldMap_map2)
WorldMap_map2.add_child(folium.map.LayerControl())
WorldMap_map2.save("WorldMapBuoysOwners.html")
WorldMap_map2

### Get the data for a specific time period for all the buoys on the list

In [None]:
from datetime import date, timedelta

BYear=input('Enter the year in the format YYYY: ')

MonthsNames=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
Months={MonthsNames[0]:str(BYear)+'-01-01 00:00:00',MonthsNames[1]:str(BYear)+'-02-01 00:00:00',MonthsNames[2]:str(BYear)+'-03-01 00:00:00',
        MonthsNames[3]:str(BYear)+'-04-01 00:00:00',MonthsNames[4]:str(BYear)+'-05-01 00:00:00',MonthsNames[5]:str(BYear)+'-06-01 00:00:00',
        MonthsNames[6]:str(BYear)+'-07-01 00:00:00',MonthsNames[7]:str(BYear)+'-08-01 00:00:00',MonthsNames[8]:str(BYear)+'-09-01 00:00:00',
        MonthsNames[9]:str(BYear)+'-10-01 00:00:00',MonthsNames[10]:str(BYear)+'-11-01 00:00:00',MonthsNames[11]:str(BYear)+'-12-01 00:00:00'}


startD=input('Enter the start date in the format DDMM like 10Jan: ')
endD=input('Enter the end date in the format DDMM like 25Dec: ')
date_start = startD+str(BYear)
date_end = endD+str(BYear)
date_st = datetime.strptime(date_start, "%d%b%Y")
date_en = datetime.strptime(date_end, "%d%b%Y")
print()
print('start date:',date_st)
print('end date:',date_en)
print('')
print('')

# To record the data for all the buoys
column_names_tmp = ["Station_Name","AVG_WDIR", "AVG_WSPD", "AVG_WVHT", "AVG_APD", "AVG_MWD"]
Finaldf = pd.DataFrame(columns = column_names_tmp)

NoData=[] # To record the list of buoys with no data at the time period
count=0
for ii in range(len(df_full.index)):
    BName=df_full.index[ii]
    print('')
    print('Name of the station:'+str(BName))
    try: 
        print(BYear in BuoyYearData[BName])
        if (BYear in BuoyYearData[BName]):
            print('Available annual data for this buoy:',BuoyYearData[BName])
            print('Data is available for buoy '+str(BName))
            print("Now, let's continue!")
            url = "https://www.ndbc.noaa.gov/view_text_file.php?filename={}h{}.txt.gz&dir=data/historical/stdmet/".format(BName.lower(),BYear)
            print("Downloading the data ...")
            print(url)
            df_BName = pd.read_csv(url,na_values=[99,999,9999],delim_whitespace=True, skiprows=range(1,2))
            print('Download is done!')

            # Concatonates (#YY MM DD hh mm) into one column as DATE
            df_BName.rename(columns={'#YY':'year','MM':'month','DD':'day','hh':'hour','mm':'minute'},inplace=True)
            df_BName['DATE']=pd.to_datetime(df_BName[['year', 'month', 'day','hour','minute']])
            df_BName.drop(columns=['year','month','day','hour','minute'],inplace=True)
        
            df_BName=df_BName.loc[(df_BName['DATE']>=date_st) & (df_BName['DATE']<=date_en)]
            df_BName.sort_values(by='DATE',ascending=False,inplace=True)
            df_BName.reset_index(drop=True,inplace=True)
    
            listLabels = ['Wind Dir. (deg)', 'Wind speed (m/s)', 'Sign. Wave Height (m)', 'Domi. Wave Period (s)',\
                          'Ave. Wave Period (s)','Wave Dir. (deg)', 'Sea Level Pressure (hPa)','Air Temp. (C)', 'Sea Surface Temp. (C)',\
                          'Dewpoint Temp.','Station Visib. (nautical miles)','Water Level']
            df_BName.drop(columns=['GST'],inplace=True)
            
            df_BName=df_BName.dropna(axis=1, how="all")
            
            # Looking at each month
            hours=np.linspace(0,23,num=24)
            hrdic ={}
            emptydaysinamonth={}
            for idd,monthh in enumerate(MonthsNames):
                try:
                    mask = (df_BName['DATE'] >= Months[MonthsNames[idd]]) & (df_BName['DATE'] < Months[MonthsNames[idd+1]])
                except:
                    mask = df_BName['DATE'] >= Months[MonthsNames[idd]]

                df_month=df_BName.loc[mask]

                try:
                    days = pd.date_range(Months[MonthsNames[idd]],Months[MonthsNames[idd+1]],freq='d',closed='left')
                except:
                    days = pd.date_range(str(BYear)+'-12-01',str(BYear)+'-12-31',freq='d',closed='left')

                # looking at each day within the month
                emptydays=[]
                for did,dayy in enumerate(days):


                    if (dayy.day not in pd.DatetimeIndex(df_month['DATE']).day):
                        emptydays.append(dayy.day)

                    else:
                        df_day=df_month[pd.DatetimeIndex(df_month['DATE']).day==dayy.day]

                        # looking at hours in a day
                        boolholder=0
                        for idh,hr in enumerate(hours):
                            if (hr in pd.DatetimeIndex(df_day['DATE']).hour):
                                boolholder+=1

                        if (24-boolholder)>=22:
                            emptydays.append(dayy.day)
            
                emptydaysinamonth[monthh]=emptydays
            
                # counting the number of consequetive days
                consdays=np.count_nonzero(np.diff(np.array(emptydaysinamonth[monthh]))==1)
            
                #print('Empty days in ',monthh,' are:',emptydaysinamonth[monthh])
                #print('Number of empty days in ',monthh,' is:',len(emptydaysinamonth[monthh]))
                #print('Number of consequetive days with no data:',consdays)

                if (len(emptydaysinamonth[monthh])>=11 or consdays>=5):
                    #print(idd)
                    try:
                        df_BName[pd.DatetimeIndex(df_BName["DATE"]).month==idd+1]=None
                    except:
                        None
                        
                
                df_BName=df_BName.dropna(axis=0, how="all")
    
        
            if len(pd.DatetimeIndex(df_BName["DATE"]).month.value_counts())>=8:
                Finaldf = Finaldf.append({'Station_Name': BName, 'AVG_WDIR': df_BName["WDIR"].mean() if 'WDIR' in df_BName.columns else None,
                                          'AVG_WSPD': df_BName["WSPD"].mean() if 'WSPD' in df_BName.columns else None,
                                          'AVG_WVHT':df_BName["WVHT"].mean() if 'WVHT' in df_BName.columns else None,
                                          'AVG_APD':df_BName["APD"].mean() if 'APD' in df_BName.columns else None,
                                          'AVG_MWD':df_BName["MWD"].mean() if 'MWD' in df_BName.columns else None}, ignore_index=True)
                    
                    
                count=count+1
                print(count)
                    
            else:
                
                NoData.append(BName)
            
        else:
            print('Buoy '+str(BName)+' has no data recorded for this period.')
    except:
        print('Buoy '+str(BName)+' has no data recorded for this period.')
        NoData.append(BName) # Stores the name of the buoys with no recorded data
        

In [None]:
len(pd.DatetimeIndex(df_BName["DATE"]).month.value_counts())

In [None]:
print('Number of stations with recorded data from '+str(date_st)+' to '+ str(date_en)+ ' is '+ str(count))

In [None]:
Finaldf.head(10)

## 4. Exploratory Data Analysis <a class="anchor" id="30"></a>


Let's take a look at the general where the information for all five parameters AVG_WDIR, AVG_WSPD, AVG_WVHT, and AVG_ADP, AVG_MWD.

__WDIR__: Wind direction (the direction the wind is coming from in degrees clockwise from true N) during the same period used for WSPD

__WSPD__: Wind speed (m/s) averaged over an eight-minute period for buoys and a two-minute period for land stations. Reported Hourly. 

__WVHT__: Significant wave height (meters) is calculated as the average of the highest one-third of all of the wave heights during the 20-minute sampling period.

__APD__: Average wave period (seconds) of all waves during the 20-minute period.

__MWD__: The direction from which the waves at the dominant period (DPD) are coming. The units are degrees from true North, increasing clockwise, with North as 0 (zero) degrees and East as 90 degrees. 


In [None]:
# Data that we are Not interested in
df_analysis=Finaldf.drop(columns=['AVG_MWD','AVG_WDIR'])

In [None]:
# A dataframe without Nan values
df_analysis=df_analysis.dropna(axis=0)
df_analysis.head()

In [None]:
print('Number of remaining stations without any nan values for the four features mentioned above: '+str(len(df_analysis.index)))

let's find out if there is any correlation between these features

In [None]:
sns.set(font_scale=2)
fig, axes = plt.subplots(1,3,figsize=(20, 8))
#fig.set_size_inches(25, 10)
#sns.scatterplot(x="AVG_WDIR", y="AVG_WSPD",data=df_analysis, s=200 ,ax=axes[0,0])
#axes[0,0].set_xlabel('Avegare Wind Dir. (Deg)')
#axes[0,0].set_ylabel('Average Wind Speed (m/s)')

#sns.scatterplot(x="AVG_WDIR", y="AVG_WVHT",data=df_analysis, s=200 ,ax=axes[0,1])
#axes[0,1].set_xlabel('Avegare Wind Dir. (Deg)')
#axes[0,1].set_ylabel('Avegare Sig. Wave Height (m)')

#sns.scatterplot(x="AVG_WDIR", y="AVG_APD",data=df_analysis, s=200 ,ax=axes[1,0])
#axes[1,0].set_xlabel('Avegare Wind Dir. (Deg)')
#axes[1,0].set_ylabel('Average Wave Period (s)')

sns.scatterplot(x="AVG_WSPD", y="AVG_WVHT",data=df_analysis, s=200 ,ax=axes[0])
axes[0].set_xlabel('Ave Wind Speed (m/s)')
axes[0].set_ylabel('Ave Wave Height (m)')

sns.scatterplot(x="AVG_WSPD", y="AVG_APD",data=df_analysis, s=200 ,ax=axes[1])
axes[1].set_xlabel('Ave Wind Speed (m/s)')
axes[1].set_ylabel('Ave Wave Period (s)')

sns.scatterplot(x="AVG_WVHT", y="AVG_APD",data=df_analysis, s=200 ,ax=axes[2])
axes[2].set_xlabel('Ave Wave Height (m)')
axes[2].set_ylabel('Ave Wave Period (s)')

plt.savefig('Exploratory.eps',bbox_inches='tight',dpi=300)

In [None]:
df_analysis_mod=df_analysis.copy()

In [None]:
sns.set(font_scale=2)
fig, axes = plt.subplots(1,3,figsize=(20, 8))

sns.scatterplot(x="AVG_WSPD", y="AVG_WVHT",data=df_analysis_mod, s=200 ,ax=axes[0])
axes[0].set_xlabel('Ave Wind Speed (m/s)')
axes[0].set_ylabel('Ave Wave Height (m)')

sns.scatterplot(x="AVG_WSPD", y="AVG_APD",data=df_analysis_mod, s=200 ,ax=axes[1])
axes[1].set_xlabel('Ave Wind Speed (m/s)')
axes[1].set_ylabel('Ave Wave Period (s)')

sns.scatterplot(x="AVG_WVHT", y="AVG_APD",data=df_analysis_mod, s=200 ,ax=axes[2])
axes[2].set_xlabel('Ave Wave Height (m)')
axes[2].set_ylabel('Ave Wave Period (s)')

plt.savefig('Exploratory.eps',bbox_inches='tight',dpi=300)

In [None]:
# Plotting the stations with the recorded data
df_analysis_modplot=pd.merge(df_analysis_mod, df_full1, on='Station_Name')

feature_group = folium.map.FeatureGroup()

figmp=folium.Figure(width=1300, height=700)
WorldMap_map3 = folium.Map(location=[17.6078, -8.0817],tiles="Stamen Toner",zoom_start=2).add_to(figmp)
color=iter(plt.cm.rainbow(np.linspace(0,1,len(df_analysis_modplot.Station_Name))))

buoys=BuoyLocationPlot2(feature_group,WorldMap_map3,df_analysis_modplot,'red','blue')
WorldMap_map3.add_child(buoys,name=owner,index=indx)
    
folium.TileLayer("Stamen Toner").add_to(WorldMap_map3) 
folium.TileLayer("OpenStreetMap").add_to(WorldMap_map3) 
folium.TileLayer("Stamen Terrain").add_to(WorldMap_map3)
WorldMap_map3.add_child(folium.map.LayerControl())
WorldMap_map3.save("WorldMapBuoysAnalysis.html")
WorldMap_map3

## 5. Model Development and Evaluation <a class="anchor" id="40"></a>

## 5.1 K-Means Clustering <a class="anchor" id="45"></a>

Since we have variables with different ranges, let's normalize the data

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.metrics import accuracy_score
import matplotlib.cm as cm
import matplotlib.colors as colors

In [None]:
Feaures=StandardScaler().fit_transform(df_analysis_mod)

First, we need to determine the right value for __k__. To do so, we have two possible approaches:

- **Elbow Method**
- **Silhouette analysis**

__Elbow method__ gives us an idea on what a good k number of clusters would be based on the sum of squared distance (SSE) between data points and their assigned clusters’ centroids. We pick k at the spot where SSE starts to flatten out and forming an elbow. We evaluate SSE for different values of k and see where the curve might form an elbow and flatten out.

In [None]:
# Elbow method to find the best k value
sse = []
list_k = list(range(1, 20))

for k in list_k:
    km = KMeans(init = "k-means++",n_clusters=k)
    km.fit(Feaures)
    sse.append(km.inertia_)

# Plot sse against k
plt.figure(figsize=(7, 5))
plt.plot(list_k, sse, '-o', markersize=10, linewidth=4)
plt.xlabel(r'Number of clusters [k]')
plt.ylabel('SSE');

plt.savefig('ElbowMethod.eps',bbox_inches='tight',dpi=250)

If we do not see a clear elbow shape using this method to determine the value of k, let's try Silhouette analysis. **Silhouette analysis** can be used to determine the degree of separation between clusters. For each sample:
- Compute the average distance from all data points in the same cluster (ai).
- Compute the average distance from all data points in the closest cluster (bi).
- Compute the coefficient.

The coefficient can take values in the interval [-1, 1].
- If  0 : the sample is very close to the neighboring clusters.
- If  1 : the sample is far away from the neighboring clusters.
- If -1 : the sample is assigned to the wrong clusters.

Therefore, we want the coefficients to be as big as possible and close to 1 to have good clusters. See the full explanation here: https://towardsdatascience.com/k-means-clustering-algorithm-applications-evaluation-methods-and-drawbacks-aa03e644b48a

In [None]:
fig, axs = plt.subplots(2, 3, figsize=(20, 15))
axs = axs.ravel()
#fig.subplots_adjust(hspace = .5, wspace=.2)

for i, k in enumerate([2, 3, 4, 5, 6, 7]):
    # Run the Kmeans algorithm
    km = KMeans(init = "k-means++",n_clusters=k)
    labels = km.fit_predict(Feaures)
    centroids = km.cluster_centers_

    # Get silhouette samples
    silhouette_vals = silhouette_samples(Feaures, labels,metric='euclidean')

    # Silhouette plot
    y_ticks = []
    y_lower, y_upper = 0, 0
    for ii, cluster in enumerate(np.unique(labels)):
        cluster_silhouette_vals = silhouette_vals[labels == cluster]
        cluster_silhouette_vals.sort()
        y_upper += len(cluster_silhouette_vals)
        axs[i].barh(range(y_lower, y_upper), cluster_silhouette_vals, edgecolor='none', height=1)
        axs[i].text(-0.05, (y_lower + y_upper) / 2, str(ii + 1))
        y_lower += len(cluster_silhouette_vals)

    # Get the average silhouette score and plot it
    avg_score = np.mean(silhouette_vals)
    axs[i].axvline(avg_score, linestyle='--', linewidth=2, color='green')
    axs[i].set_yticks([])
    axs[i].set_xlim([-0.1, 1])
    axs[i].set_xlabel('Silhouette coefficient values')
    axs[i].set_ylabel('Cluster labels')

plt.savefig('Silhouette2.eps',bbox_inches='tight',dpi=300)

In [None]:
# let's choose k=5
kclusters=5
# run k-means clustering
kmeans = KMeans(init = "k-means++",n_clusters=kclusters, random_state=0).fit(Feaures)
# check cluster labels generated for each row in the dataframe
print(kmeans.labels_[0:10])

In [None]:
# add clustering labels to the data frame
try:
    df_analysis_mod.insert(0, 'Cluster Labels', kmeans.labels_)
except:
    df_analysis_mod['Cluster Labels']=kmeans.labels_
    
df_analysis_mod.head()

In [None]:
# create map
df_analysis_modKMeansplot=pd.merge(df_analysis_mod, df_full1, on='Station_Name')

map_clusters = folium.Map(location=[17.6078, -8.0817], tiles="Stamen Toner", zoom_start=2)

# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]

# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(df_analysis_modKMeansplot['Longitude'], df_analysis_modKMeansplot['Latitude'], df_analysis_modKMeansplot['Station_Name'], df_analysis_modKMeansplot['Cluster Labels']):
    label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
    folium.CircleMarker(
        [lat, lon],
        radius=5,
        popup=label,
        color=rainbow[cluster-1],
        fill=True,
        fill_color=rainbow[cluster-1],
        fill_opacity=0.7).add_to(map_clusters)
       
map_clusters.save("WorldMapBuoysAnalysis_Clustering.html")
map_clusters

In [None]:
df_analysis_modKMeansplot.head()

In [None]:
df_clust_I = df_analysis_mod[df_analysis_mod['Cluster Labels']==0]
print('Stations in Cluster I:')
print(df_clust_I.Station_Name)
print()
df_clust_II = df_analysis_mod[df_analysis_mod['Cluster Labels']==1]
print('Stations in Cluster II:')
print(df_clust_II.Station_Name)
print()
df_clust_III = df_analysis_mod[df_analysis_mod['Cluster Labels']==2]
print('Stations in Cluster III:')
print(df_clust_III.Station_Name)
print()
df_clust_IV = df_analysis_mod[df_analysis_mod['Cluster Labels']==3]
print('Stations in Cluster IV:')
print(df_clust_IV.Station_Name)
df_clust_V = df_analysis_mod[df_analysis_mod['Cluster Labels']==4]
print('Stations in Cluster V:')
print(df_clust_V.Station_Name)

In [None]:
fig, axs = plt.subplots(1, 3, figsize=(25, 12))
sns.boxplot(x="Cluster Labels", y="AVG_WSPD", data=df_analysis_mod,ax=axs[0])
sns.swarmplot(x="Cluster Labels", y="AVG_WSPD", data=df_analysis_mod,color=".15",ax=axs[0])
axs[0].set_xlabel('Clusters');
axs[0].set_ylabel('Ave Wind Speed (m/s)');
axs[0].set_xticklabels([1,2,3,4,5])
#axs[0].set_xticks([0,1,2,3,4],[1,2,3,4,5]);

sns.boxplot(x="Cluster Labels", y="AVG_WVHT", data=df_analysis_mod,ax=axs[1])
sns.swarmplot(x="Cluster Labels", y="AVG_WVHT", data=df_analysis_mod,color=".15",ax=axs[1])
axs[1].set_xlabel('Clusters');
axs[1].set_ylabel('Ave Wave Height (m)');
axs[1].set_xticklabels([1,2,3,4,5])


sns.boxplot(x="Cluster Labels", y="AVG_APD", data=df_analysis_mod,ax=axs[2])
sns.swarmplot(x="Cluster Labels", y="AVG_APD", data=df_analysis_mod,color=".15",ax=axs[2])
axs[2].set_xlabel('Clusters');
axs[2].set_ylabel('Ave Wave Period (s)');
axs[2].set_xticklabels([1,2,3,4,5])

plt.savefig('BoxPlot1Year.eps',bbox_inches='tight',dpi=300)

In [None]:
fig,ax1=plt.subplots(1,1,figsize=(25,10))
labels = kmeans.labels_
area1 = np.pi * ( df_analysis_mod['AVG_WSPD'] )**(2.5)
scatter=ax1.scatter(df_analysis_mod['AVG_WVHT'], df_analysis_mod['AVG_APD'],s=area1.astype(np.float), label=labels, c=labels.astype(np.float),edgecolors='blue', alpha=0.9)
ax1.set_ylabel('Ave Wave Height (m)', fontsize=35)
ax1.set_xlabel('Ave Wave Period (s)', fontsize=35)

# produce a legend with the unique colors from the scatter
legend1 = ax1.legend(*scatter.legend_elements(),loc="upper left",prop={'size': 30},markerscale=4 ,title="Clusters")
# fixing the numbering in legend
legend1.texts[0].set_text('$\\mathdefault{I}$')
legend1.texts[1].set_text('$\\mathdefault{II}$')
legend1.texts[2].set_text('$\\mathdefault{III}$')
legend1.texts[3].set_text('$\\mathdefault{IV}$')
legend1.texts[4].set_text('$\\mathdefault{V}$')

ax1.add_artist(legend1)
# produce a legend with a cross section of sizes from the scatter
handles, labels = scatter.legend_elements(prop="sizes", alpha=0.6)
legend2 = ax1.legend(handles, labels, loc="lower right", title="$(WSPD)^{2.5}$")



plt.savefig('ClusterPlots.eps',bbox_inches='tight',dpi=250)
plt.show()

In [None]:
print('Mean values for cluster I:')
print(df_clust_I[['AVG_WSPD','AVG_WVHT','AVG_APD']].mean())
print('')
print('Mean values for cluster II:')
print(df_clust_II[['AVG_WSPD','AVG_WVHT','AVG_APD']].mean())
print('')
print('Mean values for cluster III:')
print(df_clust_III[['AVG_WSPD','AVG_WVHT','AVG_APD']].mean())
print('')
print('Mean values for cluster IV:')
print(df_clust_IV[['AVG_WSPD','AVG_WVHT','AVG_APD']].mean())
print('Mean values for cluster V:')
print(df_clust_V[['AVG_WSPD','AVG_WVHT','AVG_APD']].mean())

In [None]:
df_clust_II=df_clust_II.astype('float64')
df_clust_II[df_clust_II['AVG_APD']<4]

In [None]:
df_clust_V=df_clust_V.astype('float64')
df_clust_V.describe()

In [None]:
df_clust_I.describe()