# EDA Tools


Due to the NDA fro the stakeholder, the EDA can not be shown here.  
However we will show standard plots used in the EDA.

---


### Import libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go

---

### Read data file "Featureengineering"

For the EDA, the data composed within the feature engineering notebook is used. This data set consists of the high frequency data with reduced features, the daily noon report data, and the predictions from the engine model.


In [None]:
# read data from .csv file
df = pd.read_csv('../data/Featureselection03.csv')

In [None]:
# convert date to datetime
df['EntryDate'] = pd.to_datetime(df['EntryDate'])

### Read file "Feature importance list"

In [None]:
# read list with feature importance
data_log = pd.read_csv('../data/Capstone_features_Features.csv')

In [None]:
# create list of important features (feature importance < 3)
list_imp_feat = list(data_log[data_log['ModelImportance'] < 3]['VarName'])
len(list_imp_feat)

# create dataframe containing only important features
df_model = df[list_imp_feat].copy()

---

### Correlation matrix

To check if there are features that are related to each other and thus have no added value, a correlation matrix is evaluated. For this matrix only important features (compare notebook Featureengineering) are used.

In [None]:
# plot heatmap of correlations
plt.figure(figsize = (40,38))
sns.heatmap(df_model.corr(), annot = True, cmap = 'RdYlGn');

---

### Histogram

Histograms give us a first overview of the distribution of target and features. By using different colors, influencing factors like passage type can be investigated.

In [None]:
px.histogram(df, x='ME.FMS.act.tPh',
        color='passage_type', 
        barmode='overlay',
        histnorm='percent')

---

### Scatter plot

Interactions between the target and a feature but also between two features can be visualised with scatter plots.

In [None]:
px.scatter(df,x='V.SOG.act.kn',
            y='ME.FMS.act.tPh',
            color='passage_type')

---

### Timeseries

Not only point observations are intersting, but also the developement of the target over time. It is also useful to compare timeseries of features to this in order to identify similar pattern and/or influencing factors.

In [None]:
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=df[(df['EntryDate'] > '2021-08-12 16:30:00') & (df['EntryDate'] <= '2021-08-22 20:45:00')]['EntryDate'], 
    y=df[(df['EntryDate'] > '2021-08-12 16:30:00') & (df['EntryDate'] <= '2021-08-22 20:45:00')]['ME.FMS.act.tPh'],
    mode='markers', marker=dict(color='#ff6600',size=5)))
fig.show()

---

### Map

We would like to visualise the route the ship was taking during the investigated time period. Beside the date, we could use other features for coloring to highlight spatial pattern.

In [None]:
fig = px.scatter_mapbox(lat=df['V.GPSLAT.act.deg'],
                        lon=df['V.GPSLON.act.deg'],
                        color=df['EntryDate'].dt.month,
                        width=600, height=700,
                        title='Route during investigated time period', 
                        labels={'lat':'Latitude','lon':'Longitude','color':'Month'},
                        color_continuous_scale=px.colors.sequential.Oranges,
                        zoom=1.5)
fig.update_layout(mapbox_style="carto-positron",
                  title_font_family="Arial",
                  title_font_color="grey",
                  title_font_size=24,
                  title_x=0.5,
                  coloraxis_showscale=False
)
fig.show()

---

### Wind rose

In order to show frequencies of wind speed and direction, a wind rose is usedful. This plot combines the two wind features in one polar coordinate system. Following steps are needed for this plot:
1. Create a dataframe for the wind rose.
2. Transform wind speed from m/s to beaufort and create bins.
3. Create 16 bins for wind directions. Make sure that the bin for 'North' includes 348.75° to 360° __and__ 0° to 11.25°. The other bins cover 22.5° each.
4. Group and count wind speed and direction.
5. Give the right names to wind directions.
6. Make a bar polar plot.

In [None]:
# 1. dataframe for wind rose
wind_rose_df = df[['WEA.WDT.act.deg','WEA.WST.act.mPs']]

# 2. wind speed bins
wind_rose_df['wind_speed_bf'] = pd.cut(wind_rose_df['WEA.WST.act.mPs'],
        bins=[0, 0.3, 1.5, 3.3, 5.4, 7.9, 10.7, 13.8, 17.1, 20.7, 24.4, 28.4, 32.6, 100000],
        labels=['0 bf', '1 bf', '2 bf', '3 bf', '4 bf', '5 bf', '6 bf', '7 bf', '8 bf', '9 bf', '10 bf', '11 bf', '12 bf'])

# 3. wind direction bins
wind_rose_df['wind_dir'] = pd.cut(wind_rose_df['WEA.WDT.act.deg'],
        bins=[0, 11.25, 33.75, 56.25, 78.75, 101.25, 123.75, 146.25, 168.75, 191.25, 213.75, 236.25, 258.75, 281.25, 303.75, 326.25, 348.75, 360.00],
        labels=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1],
        # we use numbers as labels to keep them in the right order during the next step
        # otherwise an alphabetical sorting ocures
        ordered=False)

# 4. group and count
wind_rose_df = wind_rose_df.groupby(['wind_dir','wind_speed_bf']).size().reset_index(name='frequency')

# 5. Convert wind directions
wind_rose_df['wind_dir'].replace({1:'N',2:'NNE',3:'NE',4:'ENE',5:'E',6:'ESE',7:'SE',8:'SSE',9:'S',10:'SSW',11:'SW',12:'WSW',13:'W',14:'WNW',15:'NW',16:'NNW'}, inplace=True)

# 6. bar polar plot
fig = px.bar_polar(wind_rose_df,theta='wind_dir',r='frequency',
                   color='wind_speed_bf',
                   color_discrete_sequence=px.colors.sequential.Oranges,
                   labels={'wind_speed_bf':'Wind Speed'})
fig.show()