# Exploratory Data Analysis on Rain in Australia Dataset. 📊
In this notebook, we work with the `Rain in Austrailia` Dataset to analyse it find out the insight from the data using the different statistic techniques. We also perform the interactive visualization using the library called `Plotly`. We further find out the relation between the different features. 

# Data Dictionary
In this section, we define the different features present in our dataset in order to understand our features more clearly.

| Features | Description |
| -------- | ----------- |
| Date | The date of observation |
| Location | The common name of the location of the weather station | 
| MinTemp | The minimum temperature in degrees celsius |
| MaxTemp | The maximum temperature in degrees celsius |
| Rainfall | The amount of rainfall recorded for the day in mm |
| Evaporation | The so-called Class A pan evaporation (mm) in the 24 hours to 9am |
| Sunshine | The number of hours of bright sunshine in the day. | 
| WindGustDir | The direction of the strongest wind gust in the 24 hours to midnight |
| WindGustSpeed | The speed (km/h) of the strongest wind gust in the 24 hours to midnight |
| WindDir9am | Direction of the wind at 9am |
| WindDir3pm | Direction of the wind at 3pm |
| WindSpeed9am | Wind speed (km/hr) averaged over 10 minutes prior to 9am |
| WindSpeed3pm | Wind speed (km/hr) averaged over 10 minutes prior to 3pm |
| Humidity9am | Humidity (percent) at 9am |
| Humidity3pm | Humidity (percent) at 3pm |
| Pressure9am | Atmospheric pressure (hpa) reduced to mean sea level at 9am |
| Pressure3pm | Atmospheric pressure (hpa) reduced to mean sea level at 3pm |
| Cloud9am | Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast. |
| Cloud3pm | Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm. See Cload9am for a description of the values |
| Temp9am | Temperature (degrees C) at 9am |
| Temp3pm | Temperature (degrees C) at 3pm |
| RainToday | Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0 |
| RainTomorrow | The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk". |

# Import Library
In this section, we import all the library required to perform EDA in this notebook.

In [None]:
!pip install chart-studio --quiet

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as ex
import chart_studio.plotly as py
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, iplot, plot
import plotly.graph_objects as go
import missingno

primary_color='#2D3047'
secondary_color='#E84855'

init_notebook_mode(connected=True)
cf.go_offline()

# Load the dataset
In this section, we load the dataset into the notebook and check the first five rows of the dataset.

In [None]:
ds = pd.read_csv('../input/weather-dataset-rattle-package/weatherAUS.csv')
ds.head()

In [None]:
print(f"Dataset Shape: {ds.shape}")
print(f"Dataset Length: {len(ds)}")
print(f"Missing Columns: {(ds.isna().sum() != 0).sum()}")

In [None]:
missingno.bar(ds, color=primary_color);

# Perform Statistics
In this section, we perform different statistics operation over the dataset and understand the different insight from it. We also check the values for each columns wheather they were correctly placed or there is some sort of error.

In [None]:
ds.describe()

In [None]:
skew_col = []
skew_value = []
for col in ds.columns:
    if pd.api.types.is_numeric_dtype(ds[col]):
        skew_col.append(col)
        skew_value.append(ds[col].skew())
skew_dict = {
    "columns": skew_col,
    "values": skew_value
}
skew_ds = pd.DataFrame(skew_dict)
fig = ex.bar(data_frame=skew_ds, x='columns', y='values', title="Skew Values")
fig.update_traces(marker_color=primary_color)

# Exploratory Data Analysis
In this section, we perform EDA over the dataset and find the patterns within the features and the targets. We perform various techniques with the bivariates and univariates. This section helps to understand the dataset more deeply and find out the insight which helps to create patterns.

In [None]:
ds.head()

**Year vs RainTomorrow Features**

Lets find out in which year does rain prediction is more according to dataset.

In [None]:
year = [x.split('-')[0] for x in ds['Date']]
year_rain_value = {
    "year": year,
    "rain_tommorow": ds['RainTomorrow']
}
year_rain_ds = pd.DataFrame(year_rain_value)
rain_yes = year_rain_ds[year_rain_ds['rain_tommorow'] == 'Yes']
rain_no = year_rain_ds[year_rain_ds['rain_tommorow'] == 'No']

fig = go.Figure()

fig.add_trace(go.Bar(x=rain_yes['year'].value_counts().keys(), 
                     y=rain_yes['year'].value_counts(), marker=dict(color=primary_color),
                     name='Yes', text=rain_yes['year'].value_counts()))
fig.add_trace(go.Bar(x=rain_no['year'].value_counts().keys(), marker=dict(color=secondary_color),
                     y=rain_no['year'].value_counts(), 
                     name='No'))
fig.update_layout(title="Year vs Rain Tomorrow", xaxis=dict(
    title='Year'
), yaxis=dict(
    title='Frequency'
))

**Month vs RainTommorow**

Check out which is the most rainy month in australia.

In [None]:
month = [x.split('-')[1] for x in ds['Date']]
year_rain_ds['month'] = month
month_rain_ds = year_rain_ds.drop('year', axis=1)
month_yes = month_rain_ds[month_rain_ds['rain_tommorow'] == 'Yes']
month_no = month_rain_ds[month_rain_ds['rain_tommorow'] == 'No']

fig = go.Figure()
fig.add_trace(go.Bar(x=month_yes['month'].value_counts().keys(), 
                     y=month_yes['month'].value_counts(), 
                     name='Yes', marker=dict(color=primary_color)))
fig.add_trace(go.Bar(x=month_no['month'].value_counts().keys(), 
                     y=month_no['month'].value_counts(), 
                     name='No', marker=dict(color=secondary_color)))
fig.update_layout(title="Month vs Rain Tomorrow", xaxis=dict(
    title="Month"
), yaxis=dict(
    title='Frequency'
))

So from these two analysis we can say that 2016 and the month of June is the most rainy day in australia from the year 2007-2017 on the basis of future analysis.

Lets find out the minimum temprature of different location and check wheather the minimum temprature any sequence in predicting the wheater or not rain comes.

**Location MinTemp Relation**

In this analysis we check which location have a minimum temprature and does it is the most rainy city or not. Does there is any type of insight present between the temperature and the prediction?

In [None]:
location_temp = ds[['Location', 'MinTemp']]
value = location_temp.groupby('Location')['MinTemp'].min()

fig = go.Figure()
fig.add_trace(go.Bar(x=value.keys(),
                    y=value, name='Min Temp', marker=dict(color=primary_color)))
fig.update_layout(title="Location MinTemp Relation")

MountGinini have the most MinTemp with negative value and Adelaida have the MinTemp with positive value. Lets check the Rain Prediction of these location and check wheather there is any relation with it.

In [None]:
location_today = ds[['Location', 'RainToday']]
location_today_yes = location_today[location_today['RainToday'] == 'Yes']
location_today_no = location_today[location_today['RainToday'] == 'No']

fig = go.Figure()
fig.add_trace(go.Bar(x=location_today_yes['Location'].value_counts().keys(),
                    y=location_today_yes['Location'].value_counts(),
                    name='Yes', marker=dict(color=primary_color)))
fig.add_trace(go.Bar(x=location_today_no['Location'].value_counts().keys(),
                    y=location_today_no['Location'].value_counts(),
                    name='No', marker=dict(color=secondary_color)))

So, this insight is quite intresting to see that the city with most positive MinTemp doesnot have much rainfall but Portland which also have a MinTemp of negative 1.5 have more rainfall compare to other city with negative temperature. Carins with second most rainfall city have a MinTemp of positive 9.2.

Lets check relation of MinTemp and RainToday and is there any pattern form or not.

In [None]:
temp_today = ds[['RainToday', 'MinTemp']]
temp_today_yes = temp_today[temp_today['RainToday'] == 'Yes']
temp_today_no = temp_today[temp_today['RainToday'] == 'No']

fig = go.Figure()
fig.add_trace(go.Scatter(x=temp_today_yes['MinTemp'].value_counts().keys(),
                    y=temp_today_yes['MinTemp'].value_counts(), mode='markers', name='Yes',
                        marker=dict(color=primary_color)))
fig.add_trace(go.Scatter(x=temp_today_no['MinTemp'].value_counts().keys(),
                    y=temp_today_no['MinTemp'].value_counts(), mode='markers', name='No',
                        marker=dict(color=secondary_color)))
fig.update_layout(title='MinTemp and RainToday Relation', xaxis=dict(
    title="MinTemp"
), yaxis=dict(
    title='Frequency'
))

So this is something we clear out our previous prediction. 
```
As the temperature increases the rainfall increases but at a peak point i.e. 10.4 the frequency of the rainfall starts decreasing.
```
That's why, City like Carnis have good amount of rainfall as compare to others city as it MinTemp lies nearly to 10.4. Lets find the same relation but now with the MaxTemp. 


**MaxTemp vs RainToday**

In [None]:
temp_today = ds[['RainToday', 'MaxTemp']]
temp_today_yes = temp_today[temp_today['RainToday'] == 'Yes']
temp_today_no = temp_today[temp_today['RainToday'] == 'No']

fig = go.Figure()
fig.add_trace(go.Scatter(x=temp_today_yes['MaxTemp'].value_counts().keys(),
                    y=temp_today_yes['MaxTemp'].value_counts(), mode='markers', name='Yes',
                        marker=dict(color=primary_color)))
fig.add_trace(go.Scatter(x=temp_today_no['MaxTemp'].value_counts().keys(),
                    y=temp_today_no['MaxTemp'].value_counts(), mode='markers', name='No',
                        marker=dict(color=secondary_color)))
fig.update_layout(title='MaxTemp and RainToday Relation', xaxis=dict(
    title="MaxTemp"
), yaxis=dict(
    title='Frequency'
))

We found quiet simlar graph as we had seen before and we getting approx same value as the MinTemp. Lets now check the graph between the Rainfall, Evaporation and Sunshine with the RainToday feature.

In [None]:
combine_ds = ds[['Rainfall', 'Evaporation', 'Sunshine', 'RainToday']]
combine_ds.head()

Since we lots missing columns in Evaporation and Sunshine. It will going to create some empty spot in the graph. So we are going to fill the missing columns with the median of the column.

In [None]:
combine_ds['Evaporation'].fillna(combine_ds['Evaporation'].median(), inplace=True)
combine_ds['Sunshine'].fillna(combine_ds['Sunshine'].median(), inplace=True)
combine_ds.head()

Lets plot the graph with these three features and check wheather the rainfall occurs or not.

In [None]:
combine_ds_yes = combine_ds[combine_ds['RainToday'] == 'Yes']
combine_ds_no = combine_ds[combine_ds['RainToday'] == 'No']

