# Data Visualization Notebook

## Objectives

*   Answer business requirment 3: 
    * We will subset a given city and the data from the last 5 years; then resample and plot a line chart (Rainfall x Time)
      * There wil be 3 plots:
        * Resampled by year
        * Resampled by month
        * Resampled by day


## Inputs

* outputs/datasets/collection/WeatherAustralia.csv

## Outputs

* generate code that answers business requirement 3 and can be used to build Streamlit App

## Additional Comments | Insights | Conclusions




---

# Install Packages

In [None]:
! pip install matplotlib -U
! pip install pandas-profiling==2.11.0
! pip install plotly==4.14.0

# restart runtime - it is a good practice when installing package in colab sessions
import os
os.kill(os.getpid(), 9)

# Setup GPU

* Go to Edit → Notebook Settings
* In the Hardware accelerator menu, selects GPU
* note: when you select an option, either GPU, TPU or None, you switch among kernels/sessions

---
* How to know if I am using the GPU?
  * run the code below, if the output is different than '0' or null/nothing, you are using GPU in this session
  * Typically the output will be /device:GPU:0


In [None]:
import tensorflow as tf
tf.test.gpu_device_name()

# **Connection between: Colab Session and your GitHub Repo**

### Insert your **credentials**

* The variable's content will exist only while the session exists. Once this session terminates, the variable's content will be erased permanently.

In [None]:
from getpass import getpass
import os
from IPython.display import clear_output 
print("=== Insert your credentials === \nType in and hit Enter")
UserName = getpass('GitHub User Name: ')
UserEmail = getpass('GitHub User E-mail: ')
RepoName = getpass('GitHub Repository Name: ')
UserPwd = getpass('GitHub Account Password: ')
clear_output()
print("* Thanks for inserting your credentials!")
print(f"* You may now Clone your Repo to this Session, "
      f"then Connect this Session to your Repo.")

---

### **Clone** your GitHub Repo to your current Colab session

* So you can have access to your project's files

In [None]:
! git clone https://github.com/{UserName}/{RepoName}.git
! rm -rf sample_data   # remove content/sample_data folder, since we dont need for this project

print("\n")
%cd /content/{RepoName}
print(f"\n\n* Current session directory is:  {os.getcwd()}")
print(f"* You may refresh the session folder to access {RepoName} folder.")

---

### **Connect** this Colab session to your GitHub Repo

* So if you need, you can push files generated in this session to your Repo.

In [None]:
!git config --global user.email {UserEmail}
!git config --global user.name {UserName}
!git remote rm origin
!git remote add origin https://{UserName}:{UserPwd}@github.com/{UserName}/{RepoName}.git

print(f"\n\n * The current Colab Session is connected to the following GitHub repo: {UserName}/{RepoName}")
print(" * You can now push new files to the repo.")

---

### **Push** generated/new files from this Session to GitHub repo

* Git status

In [None]:
! git status

* Git commit

In [None]:
CommitMsg = "added-cleaned-data"
!git add .
!git commit -m {CommitMsg}

* Git Push

In [None]:
!git push origin main


---

### **Delete** Cloned Repo from current Session

In [None]:
%cd /content
!rm -rf {RepoName}
print(f"\n * Please refresh session folder to validate that {RepoName} folder was removed from this session.")

---

# Load your data

In [None]:
import pandas as pd
df = pd.read_csv("/content/WalkthroughProject/outputs/datasets/collection/WeatherAustralia.csv")

df['Date'] = pd.to_datetime(df['Date'])
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df['WeekDay']=df['Date'].dt.weekday
df['IsWeekend'] = df['WeekDay'].apply(lambda x: 1 if x >= 5 else 0)

df['WeekDay']=df['Date'].dt.day_name() # gets day name 
days_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
df['WeekDay'] = pd.Series(df['WeekDay'], dtype=pd.CategoricalDtype(categories=days_order, ordered=True))

df.set_index(['Date'],drop=True,inplace=True)
df.head(3)

# Quick exploration with Pandas Profiling

In [None]:
TrainSet.columns.to_list()

In [None]:
from pandas_profiling import ProfileReport
pandas_report = ProfileReport(df=TrainSet,minimal=True)
pandas_report.to_notebook_iframe()

## Correlation

* which variables are more correlated with a given set of variables?

In [None]:
df_corr_spearman = TrainSet.corr(method="spearman")
df_corr_pearson = TrainSet.corr(method="pearson")

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

def heatmap_correlation(df_corr,CorrThreshold,NumberOfColumns):

  if NumberOfColumns > 1:
      mask = np.zeros_like(df_corr, dtype=np.bool)
      mask[np.triu_indices_from(mask)] = True
      mask[abs(df_corr) < CorrThreshold] = True

      fig, ax = plt.subplots(figsize=(20,8))
      ax = sns.heatmap(data=df_corr,annot=True,
                       xticklabels=True,yticklabels=True,mask=mask,
                       cmap='viridis',annot_kws={"size": 8})
      plt.ylim(NumberOfColumns,0)
      plt.show()


def pairplot_correlation(df,transparency,hue=None):
  
  if hue == None:
    fig = sns.pairplot(data=df,plot_kws={'alpha':transparency})
  else:
    fig = sns.pairplot(data=df,hue= hue,plot_kws={'alpha':transparency})
  
  for i, j in zip(*np.triu_indices_from(fig.axes, 1)):
      fig.axes[i, j].set_visible(False)
  
  plt.figure(figsize=(20,8))
  plt.show()

* **Correlation Analysis**
  * Analyze how the target variable for your ML models are correlated with other variables (features and target)
  * Analyze multi colinearity, that is, how the features are correlated among themselves

In [None]:
print("Correlation Heatmap - Spearman: evaluates monotonic relationship \n")
heatmap_correlation(df_corr=df_corr_spearman, CorrThreshold=0.6,NumberOfColumns = len(TrainSet.columns))

In [None]:
print("Correlation Heatmap - Pearson: evaluates the linear relationship between two continuous variables \n")
heatmap_correlation(df_corr=df_corr_pearson,CorrThreshold=0.6,NumberOfColumns = len(TrainSet.columns))

## Power Predictive Score - PPS

* Either load PPS analysis or calculate; then preprare for visualization

In [None]:
import ppscore as pps
try:
  pps_matrix_raw = pd.read_csv("/content/WalkthroughProject/outputs/feature_engineering/pps_analysis.csv")
except:
  pps_matrix_raw = pps.matrix(TrainSet)
  pps_matrix_raw.to_csv("/content/WalkthroughProject/outputs/feature_engineering/pps_analysis.csv",index=False)

pps_matrix = pps_matrix_raw.filter(['x', 'y', 'ppscore']).pivot(columns='x', index='y', values='ppscore')

* PPS score distribution
* It helps to tell the PPS Threshold for relevant relationships. 
  * It is suggested that if Q3 (or 75%) is lower than 0.2, a pps greater than 0.2 is a relevant relationship
  * If Q3 is greater than 0.2, pps values greater than Q3 are a relevant relationship 

In [None]:
pps_matrix_raw.query("ppscore < 1").filter(['ppscore']).describe().T

* Function: Heatmap for PPS

In [None]:
  import matplotlib.pyplot as plt
  import seaborn as sns
  import numpy as np
def heatmap_pps(df,PPS_Threshold):
    if len(df.columns) > 1:
        mask = np.zeros_like(df, dtype=np.bool)
        mask[abs(df) < PPS_Threshold] = True

        fig, ax = plt.subplots(figsize=(20,12))
        ax = sns.heatmap(df, annot=True, xticklabels=True,yticklabels=True,
                         mask=mask,cmap='rocket_r', annot_kws={"size": 7})
        
        plt.ylim(len(df.columns),0)
        plt.show()


In [None]:
print(f"* PPS detects linear or non-linear relationships between two columns.\n"
      f"* The score ranges from 0 (no predictive power) to 1 (perfect predictive power) \n")
heatmap_pps(df=pps_matrix,PPS_Threshold=0.2)

* pps heatmap with target

In [None]:
def heatmap_pps_target(df,NumberOfColumns):
  import matplotlib.pyplot as plt
  import seaborn as sns
  import numpy as np
  fig, ax = plt.subplots(figsize=(20,8))
  ax = sns.heatmap(
          df,
          xticklabels=True,
          yticklabels=True,
          annot=True,
          cmap='coolwarm',
          annot_kws={"size": 8})

  plt.ylim(NumberOfColumns,0)
  plt.show()

heatmap_pps_target(df=pps_matrix_raw,NumberOfColumns=df.shape[1])

# Data Visualization

* We will subset a given city and the data from the last 5 years

In [None]:
df_vis = df.filter(['Location','RainfallToday','Month', 'Year', 'WeekDay','State']).copy()
df_vis.head(3)

## Business Requirement 3

In [None]:
df_vis['Location'].unique()

In [None]:
city = 'Canberra'
years_backward = 5

df_city = df_vis.query(f"Location == '{city}'")
df_city = df_city.query(f"Year > {df_city['Year'].max() - years_backward}").drop(['Location'],axis=1)

print(f"* Index min: {df_city.index.min()} \n* Index max: {df_city.index.max()}")

* Daily Levels of RainfallToday and Moving Avg

In [None]:
window = 30
dfRolling= df_city.filter(['RainfallToday']).rolling(window=window).mean()
dfRolling.columns = [f"RainfallToday Rolling Avg {window} days"]
Df = df_city.filter(['RainfallToday']).merge(dfRolling,how='outer',left_index=True,right_index=True)

import plotly.express as px
fig = px.line(Df, x=Df.index, y=Df.columns.to_list(), title=f"Rainfall in {city} - Daily Levels and Rolling Moving Avg")
fig.update_yaxes(title="Rainfall (mm)")
fig.show()


# try with dots/marker instead of line

* Week day Seasonality

In [None]:
df_day = df_city.filter(['RainfallToday','WeekDay']).groupby(by=['WeekDay']).agg('mean')

fig = px.line(df_day, x=df_day.index, y='RainfallToday',title=f'Week Day Seasonality in {city}')
fig.update_xaxes(type='category')
fig.update_yaxes(title='Raifall Levels',showticklabels=False)
fig.show()

* Monthly seasonality

In [None]:
df_month = df_city.filter(['RainfallToday','Month']).groupby(by=['Month']).agg('mean')
fig = px.line(df_month, x=df_month.index, y='RainfallToday',title=f'Rainfall - Monthly Seasonality in {city}')
fig.update_xaxes(type='category')
fig.update_yaxes(title='Raifall Levels',showticklabels=False)
fig.show()

In [None]:
# df_month = df_city.filter(['RainfallToday']).resample(rule='MS').mean()
# fig = px.line(df_month, x=df_month.index, y=df_month.columns.to_list(), title=f"Rainfall in {city} - Month levels")
# fig.update_yaxes(title="Rainfall (mm)")
# fig.show()

* Yearly Seasonality

In [None]:
df_year = df_city.filter(['RainfallToday','Year']).groupby(by=['Year']).agg('mean')
fig = px.line(df_year, x=df_year.index, y='RainfallToday',title=f'Rainfall - Yearly Seasonality in {city}')
fig.update_xaxes(type='category')
fig.update_yaxes(title='Raifall Levels',showticklabels=False)
fig.show()

* Avg Rainfall Levels per state

In [None]:
df_state = df_vis.filter(['RainfallToday','State']).groupby(by=['State']).agg('mean')
fig = px.bar(df_state, x=df_state.index, color=df_state.index,y='RainfallToday',
             title=f'Rainfall - State Seasonality in {city}')
fig.update_xaxes(type='category')
fig.update_yaxes(title='Raifall Avg Levels')
fig.show()

## EDA and plots

* Subset neeeded variables

In [None]:
variables_eda = ['RainfallToday','RainToday',
                 'Latitude','Longitude','Location', 'State',
                 'Day', 'Month', 'Year','WeekDay', 'IsWeekend']
df_eda = df.filter(variables_eda).copy()

years_backward = 5
df_eda = df_eda.query(f"Year > {df_eda['Year'].max() - years_backward}")

df_eda.head(3)

* Plots we are interested
   * map plot, using lat and long, colored by state, size by rainfall, animated by Date
   * heatmap

### Map

* for a given year, animate by month, agg mean levels of RainfallToday

In [None]:
map_year = 2015

df_map= df_eda.query(f"Year == {map_year}").copy()
df_map_month = df_map[['RainfallToday','Location','Month']].groupby(['Location','Month']).mean().reset_index()


df_map_month=(df_map_month
              .merge(df_map[['Location','State','Latitude','Longitude']],how='right',on='Location')
              .sort_values(by=['Location','Month'])
              .drop_duplicates()
              )
df_map_month

In [None]:
import plotly.express as px 


fig = px.scatter_mapbox(df_map_month.dropna(),
                        lat="Latitude", lon="Longitude", color="State",
                        hover_data=["RainfallToday",'Location'],
                        size='RainfallToday',
                        zoom=2.5,
                        mapbox_style="open-street-map",
                        animation_frame='Month',
                        center={"lat":-27,"lon":133},
                        size_max=15
                        )
fig.show()

### Rainfall Heatmap

* barplot indicating rainToday flag, for given city, over time

In [None]:
! pip install calplot==0.1.7.2

In [None]:
df_eda['Location'].unique()

In [None]:
city = 'Sydney'
df_rain_flag= df_eda.query(f"Location == '{city}' ").copy()
df_rain_flag.head()

In [None]:
import matplotlib.pyplot as plt
import calplot
for year_heatmap in df_rain_flag['Year'].unique():
  print(f"\n * {year_heatmap} \n")
  fig= plt.figure(figsize=(20,5))
  calplot.yearplot(data=df_rain_flag.query(f"Year == {year_heatmap} ")['RainfallToday'],
                  dropzero= True,
                  cmap='GnBu',
                  linewidth =2,
                  # fillcolor='black'
                  );
  plt.show()

### Boxplot

In [None]:
df_eda.columns

In [None]:
select_x='Month'

fig = px.box(df_eda, x=select_x, y='RainfallToday',color=select_x,points ='outliers')
fig.update_layout(xaxis_type = 'category')
fig.show()