# Geopotential height Study 

Your team are researchers working on geopotential height data collected from 1998 to 2004. You are required to design a model to forecast geopotential height to inform mountain climbers about the change of this values.

This EDA challenge is the first step to preparing the data for forecast and your team is required to present your insights before moving on to development of the model.

The dataset to be used: eighthr.data, eighthr.names 

##  You are constrained matplotlib and plotly express for plots

In [1]:
## Load packages need in this study
import warnings
warnings.filterwarnings('ignore')

# Data Transformation

## 1. Load data with header

In [2]:
### load data which doesnt have a header, specify header=None


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,64,65,66,67,68,69,70,71,72,73
0,1/1/1998,0.8,1.8,2.4,2.1,2.0,2.1,1.5,1.7,1.9,...,0.15,10.67,-1.56,5795,-12.1,17.9,10330,-55,0,0.0
1,1/2/1998,2.8,3.2,3.3,2.7,3.3,3.2,2.9,2.8,3.1,...,0.48,8.39,3.84,5805,14.05,29.0,10275,-55,0,0.0


In [3]:
### load data of the header


Unnamed: 0,1,0 | two classes 1: ozone day,0: normal day
0,Date: ignore.,,
1,WSR0: continuous.,,


In [4]:
### Removing useless info like the ignore and continuous in the table above
### fit the heeader on the data
## replace '?' with '0' in the dataframe

Unnamed: 0,Date,WSR0,WSR1,WSR2,WSR3,WSR4,WSR5,WSR6,WSR7,WSR8,...,RH50,U50,V50,HT50,KI,TT,SLP,SLP_,Precp,ozone day
0,1/1/1998,0.8,1.8,2.4,2.1,2.0,2.1,1.5,1.7,1.9,...,0.15,10.67,-1.56,5795,-12.1,17.9,10330,-55,0,0.0
1,1/2/1998,2.8,3.2,3.3,2.7,3.3,3.2,2.9,2.8,3.1,...,0.48,8.39,3.84,5805,14.05,29.0,10275,-55,0,0.0
2,1/3/1998,2.9,2.8,2.6,2.1,2.2,2.5,2.5,2.7,2.2,...,0.6,6.94,9.8,5790,17.9,41.3,10235,-40,0,0.0


## 2. Transform data 

Geopotential height at X hpa: 'HT70'

Sea level pressure: 'SLP'

SLP change from previous day: 'SLP_'

In [5]:
### Create a new dataframe df_m with 'Date','SLP','HT50','HT70','SLP_'
df_m.head(3)

Unnamed: 0,Date,SLP,HT50,HT70,SLP_
0,1/1/1998,10330,5795,3178.5,-55
1,1/2/1998,10275,5805,3172.0,-55
2,1/3/1998,10235,5790,3160.0,-40


# Data Visualization

## 3. Compare HT50 and HT70 over time

In [7]:
# Remember plotly express

## 4. Dashboard

Run this dashboard and gain insight

In [8]:
import dash
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output

app = dash.Dash()

df_m=df_m.replace('?','0')
df_m['HT50']=df_m['HT50'].astype(float)
df_m['HT70']=df_m['HT70'].astype(float)
df_m['SLP']=df_m['SLP'].astype(float)
df_m['SLP_']=df_m['SLP_'].astype(float)
df_m['HT50_']=df_m['HT50'].diff()
df_m['HT50_'].iloc[0]=df_m['HT50_'].iloc[1]
df_m['HT70_']=df_m['HT70'].diff()
df_m['HT70_'].iloc[0]=df_m['HT70_'].iloc[1]

col_options = []
for loc in df_m.columns[1:]:
    col_options.append({'label':str(loc),'value':loc})

app.layout = html.Div(children=[
    html.H1(children='Variation Visualization'),

    html.Div([
        
        html.H3(children='Select Variable'),
        html.Div([dcc.Dropdown(id='feat',options=col_options,value=col_options[0])
            ]),
        
        
    ],style={'width': '20%', 'float': 'right', 'display': 'inline-block','height':'500'}),
    
    html.Div([
        html.Div([
            dcc.Graph(id='graph'), 
        ]),
        
    html.Div([html.H2('Median:'),
                  html.H1(id='dif')
                 ],style={'width': '20%', 'float': 'right',"font-weight": "bold"}),
        
    html.Div([html.H2('Mean:'),
                  html.H1(id='dif2')
                 ],style={'width': '20%', 'float': 'right',"font-weight": "bold"}),
        ],style={'width': '80%', 'float': 'right', 'display': 'inline-block','height':'500'}),
    
        
  
   html.H3(children='Prepared by Ebude Yolande'), 
])

@app.callback(
    Output(component_id='graph', component_property='figure'),
    [Input(component_id='feat', component_property='value')]
)
def grpah(feat):
    fig=px.scatter(df_m,x='Date',y=feat)
    fig.layout.update(showlegend=False)
    return fig
    
    
@app.callback(
    Output(component_id='dif', component_property='children'),
    [Input(component_id='feat', component_property='value')]
)
def diff(feat):
    return np.round(np.median(df_m[feat].iloc[:]),1)

@app.callback(
    Output(component_id='dif2', component_property='children'),
    [Input(component_id='feat', component_property='value')]
)
def diff1(feat):
    return np.round(np.mean(df_m[feat].iloc[:]),1)    

if __name__ == '__main__':
    app.run_server(port='5050')

Dash is running on http://127.0.0.1:5050/



 * Running on http://127.0.0.1:5050/ (Press CTRL+C to quit)


# Statistics

## 5. Two sample T-Test on HT50_ and HT70_

H0: two sample have same mean

H1: significant difference

In [10]:
## perform the two sample test
## Conclude on the difference between this two sample


## 6. SARIMA model

In [16]:
## Train a SARIMAX model for the variation of HT50 with order=(1, 1, 1),seasonal_order=(1, 1, 0, 12), enforce_stationarity=False,enforce_invertibility=False
## Print the summary table
## Explain insights from correlation


# Statistical Modelling

## 7. Predict the next 100 values of HT50 based on the variation knowledge


In [19]:
### Use the model above to forecast the next 100 values of variation of HT50
# Plot the prediction 
pred_uc = results.get_forecast(steps=100)
pred_ci = pred_uc.conf_int()
ax = df_m['HT50_'].iloc[:].plot(label='observed', figsize=(14, 7))
pred_uc.predicted_mean.plot(ax=ax, label='Forecast')
ax.fill_between(pred_ci.index,
                pred_ci.iloc[:, 0],
                pred_ci.iloc[:, 1], color='k', alpha=.25)
ax.set_xlabel('Date')
ax.set_ylabel('HT50_')
plt.legend()
plt.show()

In [20]:
#### New HT50 values
## Create a new dataframe df_new
## Create a column date with 1 day difference between each 100 days
## Create a column for new variations of HT50
## Estimate the new HT50 and assign to a column HT50

df_new.head()

Unnamed: 0,Date,HT50_,HT50
0,2004-12-31,31.342868,5851.342868
1,2005-01-01,-12.442909,5807.557091
2,2005-01-02,-47.131082,5772.868918
3,2005-01-03,-12.477055,5807.522945
4,2005-01-04,34.411668,5854.411668


In [22]:
### join the new dataframe to old dataframe
## print the mean and median of this recent dataframe
## plot the variation of HT50
## Conclude on the change of the mean and variance


# Presentation

For your presentation, focus on the following:

- Data transformation and the variations
- Why the variation was important for your study
- Model and prediction

Do you think your forecast was good?