# Weather Data Prediction Model
This is module of Miloo Bootcamp : BOOTCAMP CYCLING PREDICTION-ARTIFICIAL INTELLIGENCE.

This module will give example of how to develop prediction model on weather data start from install and importing required library, transforming data, define model, train, evaluate model to get performance, and draw conclusion based on trained model.

Please refer to this link for more info regarding the dataset : https://www.kaggle.com/selfishgene/historical-hourly-weather-data 

## 1. Import & Install Required Python Library
In this module, we use : 
1. Pandas for data loading and transform
2. Matplotlib to visualize data if necessary
3. datetime to convert date data type to date and get specific day name, month name, etc.
4. sweetviz to do simple statistics and visualization to data
5. statsmodels to do statistical regression to understand the data
6. sklearn and xgboost to do modeling, model evaluation, and prediction 

In [None]:
!pip install sweetviz

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline
import datetime as dt
import statsmodels.api as sm
import seaborn as sns 
import numpy as np
import sweetviz as sv

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
from xgboost import XGBClassifier
# from IPython.core.display import display, HTML
# display(HTML("<style>.container { width:100% !important; }</style>"))

## 2. Data Transformation 
in this section, we will transform raw data and merge them into one DataFrame to make it easier to process on further sections

### 2.1 Define Function to Transform Dataset

In [None]:
def time_series_trans(dataset, look_back=1):
  """
  Create windowing data
  Input:
        dataset -> dataframe
        look_back ->
  output:
        result -> np.array type of data
  """
  result = []
  for i in range(len(dataset)-look_back-1):
        a = dataset[i:(i+look_back)]
        result.append(a)
  return np.array(result)

def get_one_sity(city, humidity, wind_speed, wind_dir, press, weat, temp):
  """
  
  """    
  # humidity
  df_sample = pd.merge(humidity[[city,'day','hour','dayname']],weat[[city,'day','hour']],how='inner', left_on=['day','hour'],right_on=['day','hour'])
  df_sample = df_sample.dropna()
  df_sample.columns = ['humidity','day','hour','dayname','weather']

  # temperature
  df_sample = pd.merge(df_sample,temp[[city,'day','hour','dayname']],how='inner', left_on=['day','hour'],right_on=['day','hour'])
  df_sample = df_sample.dropna()
  df_sample.drop('dayname_y',inplace=True,axis=1)
  df_sample.columns = ['humidity','day','hour','dayname','weather','temperature']
  df_sample['temperature'] = df_sample['temperature']-273.15 # convert from kelvin to celcius 

  # pressure
  df_sample = pd.merge(df_sample,press[[city,'day','hour','dayname']],how='inner', left_on=['day','hour'],right_on=['day','hour'])
  df_sample = df_sample.dropna()
  df_sample.drop('dayname_y',inplace=True,axis=1)
  df_sample.columns = ['humidity','day','hour','dayname','weather','temperature','pressure']

    # wind speed
  df_sample = pd.merge(df_sample,wind_speed[[city,'day','hour','dayname']],how='inner', left_on=['day','hour'],right_on=['day','hour'])
  df_sample = df_sample.dropna()
  df_sample.drop('dayname_y',inplace=True,axis=1)
  df_sample.columns = ['humidity','day','hour','dayname','weather','temperature','pressure', 'wind_speed']

    # wind dir
  df_sample = pd.merge(df_sample,wind_dir[[city,'day','hour','dayname']],how='inner', left_on=['day','hour'],right_on=['day','hour'])
  df_sample = df_sample.dropna()
  df_sample.drop('dayname_y',inplace=True,axis=1)
  df_sample.columns = ['humidity','day','hour','dayname','weather','temperature','pressure', 'wind_speed','wind_dir']

  # rearrange column
  df_sample = df_sample[['day','hour','weather','dayname','humidity','temperature','pressure','wind_speed','wind_dir']]
    
    
  # simplified weather 
  # change weather granularity

  df_sample['weather2'] = df_weat.replace({city: dict_weather})[city]

  return df_sample

def expand_time(input_df,time_col):
    input_df['datetime'] = pd.to_datetime(input_df[time_col])
    input_df['year'] =  input_df['datetime'].dt.year
    input_df['month'] =  input_df['datetime'].dt.year * 100 + input_df['datetime'].dt.month
    input_df['day'] =  input_df['datetime'].dt.year * 10000 + input_df['datetime'].dt.month * 100 + input_df['datetime'].dt.day
    input_df['hour'] =  input_df['datetime'].dt.hour
    input_df['dayname'] = input_df['datetime'].apply(lambda x: dt.datetime.strftime(x, '%A'))
    
    return input_df

### 2.2 Read all CSV files
in this sub-section, we download dataset from github and load it to jupyter notebook

In [None]:
!wget https://raw.githubusercontent.com/Miloo-workshop/weather-prediction/main/data_archive.zip

In [None]:
!unzip data_archive.zip

In [None]:
# read all files

df_hum = pd.read_csv('archive/humidity.csv')
df_wind_dir = pd.read_csv('archive/wind_direction.csv')
df_wind_sp = pd.read_csv('archive/wind_speed.csv')
df_pres = pd.read_csv('archive/pressure.csv')
df_temp = pd.read_csv('archive/temperature.csv')
df_weat = pd.read_csv('archive/weather_description.csv')
df_city = pd.read_csv('archive/city_attributes.csv')
df_weat_sim = pd.read_excel('archive/weather_category_simplified.xlsx')
df_weat_sim.drop('count',axis=1,inplace=True)

dict_weather = {}

In [None]:
# convert to dict 
for index, row in df_weat_sim.iterrows():
    dict_weather[row['weather']] = row['weather2']

### 2.3 Transform Data
in this sub-section, we transform date data to various format. We will also use specifically for Miami data

In [None]:
# expand time column
df_hum = expand_time(df_hum,'datetime')
df_wind_dir = expand_time(df_wind_dir,'datetime')
df_wind_sp = expand_time(df_wind_sp,'datetime')
df_pres = expand_time(df_pres,'datetime')
df_weat = expand_time(df_weat,'datetime')
df_temp = expand_time(df_temp,'datetime')

In [None]:
# select city 
df_miami = get_one_sity(city='Miami',humidity = df_hum, wind_speed = df_wind_sp, wind_dir = df_wind_dir, press = df_pres, weat = df_weat, temp = df_temp)

In [None]:
df_miami.head()

#### 2.3.1 Get only 24 hour Data

In [None]:
# get the 24 hours 
df_miami_day = df_miami.groupby(['day']).agg({'hour':'count'}).reset_index()
df_miami_day = df_miami_day[df_miami_day['hour'] == 24]
df_miami2 = pd.merge(df_miami, df_miami_day, right_on=['day'],left_on=['day'],how='left')
df_miami2 = df_miami2.drop(['hour_y'],axis=1)

df_miami2 = df_miami2.dropna()

In [None]:
df_miami2.head()

In [None]:
df_miami2.describe(include='all')

In [None]:
# !pip install pandas-profiling
# import sys
# !{sys.executable} -m pip install -U pandas-profiling[notebook]
# !jupyter nbextension enable --py widgetsnbextension

In [None]:
# from pandas_profiling import ProfileReport

In [None]:
# profile = ProfileReport(df_miami2)
# profile

In [None]:
# !pip freeze|grep pandas

### 2.4 Data Understanding
In this sub-section we will explore statistical condition and visualize it with sweetviz  

In [None]:
my_report = sv.analyze(df_miami2)
my_report.show_notebook()

## 3. Feature Engineering 
in this section we will transform the feature and add feature based previously transformed data 

### 3.2 Encode Weather 
We encode the categorical weather into number as follows:
1. cloudy = 0
2. fog = 1
3. rain = 2
4. sunny = 3
5. wind = 4

In [None]:
# encoding weather 
le = preprocessing.LabelEncoder()
le.fit(df_miami2['weather2'])

df_miami2['weather2_encode'] = le.fit_transform(df_miami2['weather2'])

df_miami2.head()

### 3.3 Windowing Data 
to make prediction model, we will use 3-hour earlier on weather, temperature, humidity, and pressure condition as feature so we need to transforming data which will have format like this : 

temperature hour-3 **|** temperature hour-2 **|** temperature hour-1 **|** humidity hour-3 **|** humidity hour-2 **|** humidity hour-1 **|** pressure hour-3 **|** pressure hour-2 **|** pressure hour-1 **|** weather hour-3 **|** weather hour-2 **|** weather hour-1 **|** weather target

In [None]:
# windowing
df_mi_temp = pd.DataFrame(time_series_trans(df_miami2['temperature'],look_back=3))
df_mi_temp = df_mi_temp.add_prefix('temp_')

df_mi_hum = pd.DataFrame(time_series_trans(df_miami2['humidity'],look_back=3))
df_mi_hum = df_mi_hum.add_prefix('hum_')

df_mi_pres = pd.DataFrame(time_series_trans(df_miami2['pressure'],look_back=3))
df_mi_pres = df_mi_pres.add_prefix('pres_')

df_mi_ws = pd.DataFrame(time_series_trans(df_miami2['wind_speed'],look_back=3))
df_mi_ws = df_mi_ws.add_prefix('ws_')

df_mi_weat = pd.DataFrame(time_series_trans(df_miami2['weather2_encode'],look_back=4))
df_mi_weat = df_mi_weat.add_prefix('weat_')

In [None]:
# collect all windowing
df_train = pd.concat([df_mi_temp,df_mi_hum,df_mi_pres,df_mi_weat],axis=1)
df_train = df_train.dropna()
df_train.head()

## 4. Split Data and Pre-modeling 
In this section we will split data into 2 sets which will be used as training data and testing data. We also will explore data using simple statistical regression to find out more about dataset and its feature

### 4.1 Split Data into Training and Testing Data

In [None]:
# split train and test 
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(['weat_3'],axis=1), df_train['weat_3'], test_size=0.3, random_state=42)
# X_train, X_test, y_train, y_test = train_test_split(df_train.drop(['weat_3','weat_2','weat_1','weat_0'],axis=1), df_train['weat_3'], test_size=0.3, random_state=42)

### 4.2 Pre-modeling
this sub-section will explore data using statistical regression to describe each feature statistically

In [None]:
# pre modelling 

model = sm.OLS(y_train, X_train)
results = model.fit()
print(results.summary())

# 5. Predictive Modeling
This section will show us example on how to make prediction model using Random Forest and XGBoost. We also will show visualization of Random Forest Tree Repesentation of model 

### 5.1 Decision Tree Modeling
this sub-section will explore Decision Tree Modeling from training data to evaluating and visualize single tree

#### 5.1.1 Decision Tree Training

In [None]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)

#### 5.1.2 Visualize Decision Tree Model
in this sub-section we will visualize trained Decision Tree model representated as single tree 

In [None]:
tree.plot_tree(clf,max_depth=2,fontsize=5) 

In [None]:
y_pred = clf.predict(X_test)

#### 5.1.3 Decision Tree Performance Measurement
in this sub-section we will measure performance of Decision Tree models with F1-Score and show its confusion metrics

In [None]:
# check peformance
print('f1 score', (f1_score(y_test, y_pred,average='micro')))
# precision_recall_fscore_support(y_pred, y_pred, average='micro')
cm = confusion_matrix(y_test, y_pred)
cm2 = pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

#### 5.1.4 Feature Importance
This sub-section will show importance of each feature based on Decision Tree model

In [None]:
df_feat_imp_tree = pd.DataFrame({'score':clf.feature_importances_, 'names' : X_train.columns})
df_feat_imp_tree.sort_values('score',ascending=False)

### 5.2 Random Forest Modeling
this sub-section will explore Random Forest Modeling from training data to evaluating and visualize single tree

#### 5.2.1 Random Forest Training

In [None]:
# fit into rf classifier
rf = RandomForestClassifier(n_estimators=20,max_depth = 10,random_state=50)
rf.fit(X_train,y_train)

y_pred = rf.predict(X_test)

#### 5.2.2 Random Forest Performance Measurement
in this sub-section we will measure performance of Random Forest models with F1-Score and show its confusion metrics

In [None]:
# check peformance
print('f1 score', (f1_score(y_test, y_pred,average='micro')))
# precision_recall_fscore_support(y_pred, y_pred, average='micro')
cm = confusion_matrix(y_test, y_pred)
cm2 = pd.crosstab(y_test, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

In [None]:
cm2

#### 5.2.3 Feature Importance
This sub-section will show importance of each feature based on Random Forest model

In [None]:
le.inverse_transform([0,1,2,3])

In [None]:
# feature importance
df_feat_imp = pd.DataFrame({'score':rf.feature_importances_, 'names' : X_train.columns})
df_feat_imp.sort_values('score',ascending=False)

#### 5.2.4 Extract and Visualize Random Forest Tree
in this sub-section we will visualize trained Random Forest model representated as single tree 

In [None]:
# Extract single tree
estimator = rf.estimators_[5]

from sklearn.tree import export_graphviz
# Export as dot file
export_graphviz(estimator, out_file='tree.dot', 
                feature_names = None,
                class_names = None,
                rounded = True, proportion = False, 
                precision = 2, filled = True
                )

# Convert to png using system command (requires Graphviz)
from subprocess import call
call(['dot', '-Tpng', 'tree.dot', '-o', 'tree.png', '-Gdpi=600'])

# Display in jupyter notebook
from IPython.display import Image
Image(filename = 'tree.png')

In [None]:
estimator

### 5.3 XGBoost Modeling
this sub-section will explore XGBoost Modeling from training data to evaluating and get feature importance

#### 5.3.1 XGBoost Training

In [None]:
# fit into xgb classifier

model = XGBClassifier(max_depth = 5,eta = 0.1, n_estimators=5, randomm_state = 50 )
model.fit(X_train, y_train)

y_pred_xgb = model.predict(X_test)

In [None]:
model

#### 5.3.2 XGBoost Performance Measurement
same as Random Forest models earlier, we will measure XGBoost performance with F1-Score and show its confusion metrics

In [None]:
# check peformance
print('f1 score', (f1_score(y_test, y_pred_xgb,average='micro')))
# precision_recall_fscore_support(y_pred, y_pred, average='micro')
cm = confusion_matrix(y_test, y_pred_xgb)
cm2 = pd.crosstab(y_test, y_pred_xgb, rownames=['True'], colnames=['Predicted'], margins=True)

#### 5.3.3 Feature Importance
This sub-section will show importance of each feature based on XGBoost model

In [None]:
# feature importance
df_feat_imp = pd.DataFrame({'score':model.feature_importances_, 'names' : X_train.columns})
df_feat_imp.sort_values('score',ascending=False)

## 6. Post-modeling 
This section will show example on how we make conclusion whether a person will cycling on certain weather condition or not. This conclusion will represent as probability of cycling

In [None]:
idx = 20984 

# input feature
tes_pred = list(X_test.loc[idx])

# survey input 
# survey_input =  np.array([0.4, 0.3, 0.1, 0.15, 0.05 ])
survey_input =  np.array([0.76, 0.76, 0.8, 0.15, 0.05 ])

# output model x survey
# print ('bike percentage : ', le.classes_[np.argmax(np.multiply(rf.predict_proba(np.reshape(tes_pred,(1,-1)))[0], survey_input))], np.max(np.multiply(rf.predict_proba(np.reshape(tes_pred,(1,-1)))[0], survey_input)))

print('actual :', y_test[idx])
print ('bike percenteage : ', le.classes_[np.argmax(np.multiply(rf.predict_proba(np.reshape(tes_pred,(1,-1)))[0], survey_input))], np.max(np.multiply(rf.predict_proba(np.reshape(tes_pred,(1,-1))), survey_input)/ np.sum(np.multiply(rf.predict_proba(np.reshape(tes_pred,(1,-1))), survey_input))))

In [None]:
# bulk post modeling 
X_proba = pd.DataFrame(rf.predict_proba(X_test))
X_proba['proba']= X_proba.values.tolist()
X_proba['proba_index'] = X_proba.apply(lambda x:np.argmax(np.multiply(np.array(x['proba']),survey_input)),axis=1)
X_proba['proba_max'] = X_proba.apply(lambda x:np.max(np.multiply(np.array(x['proba']),survey_input)),axis=1)
X_proba['proba_all'] = X_proba.apply(lambda x:np.sum(np.multiply(np.array(x['proba']),survey_input)),axis=1)
X_proba['proba_bike'] = X_proba['proba_max']/X_proba['proba_all']
X_proba['pred_weather'] = X_proba['proba_index'].apply(lambda x:le.classes_[int(x)])
# X_proba['actual'] = pd.DataFrame(y_test).reset_index().drop('index',axis=1)

In [None]:
X_proba