# Asia Pacific Storm Tracks Analysis (1956 - 2018)

<center><img src="https://images.r.cruisecritic.com/features/2016/07/au-cyclone-season-hero.jpg"/></center>

This notebook is dedicated to the exploration and analysis of the Asia Pacific Storm Tracks dataset. The dataset comprises the consolidated history of tropical storm paths over the past 50 years in the West Pacific, South Pacific, South Indian, and North Indian basins. 

The dataset provides detailed attributes such as storm Name, Date, Time, wind speed, and GPS points for each advisory point. It's important to note that the wind speeds are recorded in knots. We will conduct an introductory Exploratory Data Analysis (EDA) and data visualization using libraries such as pandas, seaborn, and plotly. The aim is to uncover patterns, insights, or relationships that might be hidden in the raw data. 

## Import Libraries

In [None]:
import pandas as pd
import geopandas as gpd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from scipy import stats

## Load Data

In [None]:
shp_file_path = '/kaggle/input/asia-pacific-storm-dataset/UNISYS_tracks_1956_2018Dec31.shp'

# Read the shapefile into a GeoDataFrame
gdf = gpd.read_file(shp_file_path)

In [None]:
gdf.head()

## Data Preprocessing

In [None]:
# Convert ADV_DATE to datetime
gdf['ADV_DATE'] = pd.to_datetime(gdf['ADV_DATE'])

# Extract year and century from ADV_DATE
gdf['YEAR'] = gdf['ADV_DATE'].dt.year
gdf['CENTURY'] = gdf['YEAR'].apply(lambda x: x//100 + 1)

## EDA (Exploratory Data Analysis)

In [None]:
# Display the information
gdf.info()

In [None]:
# Display the summary statistics
gdf.describe()

In [None]:
# Check for missing values
print(gdf.isnull().sum())

In [None]:
# Checking the skewness and kurtosis of the 'SPEED' variable
print('Skewness: ', stats.skew(gdf['SPEED']))
print('Kurtosis: ', stats.kurtosis(gdf['SPEED']))

The skewness value suggests that *SPEED* is left-skewed, with a longer tail on the left side, and the kurtosis value indicates that it has heavier tails and a sharper peak than a normal distribution

In [None]:
# Checking the skewness and kurtosis of the 'PRESSURE' variable
print('Skewness: ', stats.skew(gdf['PRESSURE']))
print('Kurtosis: ', stats.kurtosis(gdf['PRESSURE']))

The skewness value suggests that PRESSURE is right-skewed, with a longer tail on the right side, and the kurtosis value indicates that it has negative excess kurtosis, meaning it has lighter tails and a flatter peak compared to a normal distribution

In [None]:
# Perform a Shapiro-Wilk test to check for normality
print('Shapiro-Wilk Test for SPEED:', stats.shapiro(gdf['SPEED']))

The extremely low p-value (close to zero) suggests strong evidence against the null hypothesis. In this case, the null hypothesis is that the variable SPEED follows a normal distribution. Since the p-value is very low, we can conclude that the data in the SPEED variable significantly deviates from a normal distribution. It is not normally distributed

In [None]:
print('Shapiro-Wilk Test for PRESSURE:', stats.shapiro(gdf['PRESSURE']))

Similar to the SPEED variable, the extremely low p-value for PRESSURE (close to zero) suggests strong evidence against the null hypothesis. In this case, the null hypothesis is that the variable PRESSURE follows a normal distribution. Since the p-value is very low, we can conclude that the data in the PRESSURE variable significantly deviates from a normal distribution. It is not normally distributed.

In [None]:
# Calculate the correlation between 'SPEED' and 'PRESSURE'
corr, p_value = stats.pearsonr(gdf['SPEED'], gdf['PRESSURE'])
print('Correlation between SPEED and PRESSURE: ', corr)
print('P-value: ', p_value)

There is a statistically significant positive correlation of approximately 0.2245 between SPEED and PRESSURE. This implies that as one of these variables increases, the other tends to increase as well, although the relationship is relatively weak.

In [None]:
# Perform a t-test to check if the mean speed is significantly different from a hypothesized value
t_stat, p_value = stats.ttest_1samp(gdf['SPEED'], popmean=50)
print('T-statistic: ', t_stat)
print('P-value: ', p_value)

In [None]:
# Perform a chi-square test of independence between 'REGION' and 'TYPE'
contingency_table = pd.crosstab(gdf['REGION'], gdf['TYPE'])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print('Chi-square statistic: ', chi2)
print('P-value: ', p_value)

The chi-square statistic of 27604.7508 and the very low p-value of 0.0 indicate that there is a statistically significant association or dependency between the "Region" and "Type"

## Data Visualization

In [None]:
# Checking the correlation between numerical variables
corr = gdf.select_dtypes(include=['number']).corr()
sns.heatmap(corr, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Plotting the count of each region
plt.figure(figsize=(10,6))
sns.countplot(x='REGION', data=gdf)
plt.title('Count of Storms by Region')
plt.show()

In [None]:
# Year-wise analysis of storms
yearly_storms = gdf.groupby('YEAR')['STORM_NO'].nunique().reset_index()
fig = px.line(yearly_storms, x='YEAR', y='STORM_NO', title='Yearly Storms')
fig.show()

In [None]:
# Century-wise analysis of storms
century_storms = gdf.groupby('CENTURY')['STORM_NO'].nunique().reset_index()
fig = px.bar(century_storms, x='CENTURY', y='STORM_NO', title='Century-wise Storms')
fig.show()

In [None]:
# Year-wise average storm speed
yearly_speed = gdf.groupby('YEAR')['SPEED'].mean().reset_index()
fig = px.line(yearly_speed, x='YEAR', y='SPEED', title='Yearly Average Storm Speed')
fig.show()

In [None]:
# Year-wise average storm pressure
yearly_pressure = gdf.groupby('YEAR')['PRESSURE'].mean().reset_index()
fig = px.line(yearly_pressure, x='YEAR', y='PRESSURE', title='Yearly Average Storm Pressure')
fig.show()

## Geo Visualization

In [None]:
# Storm Locations on the Globe
fig = px.scatter_geo(gdf, 
                     lat='LAT', 
                     lon='LONG_', 
                     color='REGION', 
                     title='Storm Locations',
                     )  
fig.update_traces(marker=dict(size=2)) 
fig.update_geos(showcoastlines=True, coastlinecolor="Black", showland=True, )
fig.show()

In [None]:
# Storm locations over time
fig = px.scatter_geo(gdf, lat='LAT', lon='LONG_', color='YEAR', title='Storm Locations Over Time')
fig.update_traces(marker=dict(size=2)) 
fig.update_geos(showcoastlines=True, coastlinecolor="Black", showland=True,)
fig.show()