## Import packages

In [1]:
import numpy as np
import pandas as pd

import re

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.neural_network import MLPClassifier
from sklearn import set_config

# Explore data

* Read file `weather_hist.xlsx`

In [3]:
df = pd.read_excel('./data/weather_hist.xlsx')
df.head(5)

Unnamed: 0,date,time,temperature,dew_point,humidity,wind,wind_speed,wind_gust,pressure,precip.,condition
0,2021-07-01,12:00 AM,81,79,94,WSW,6,0,29.76,0.0,Partly Cloudy
1,2021-07-01,12:30 AM,81,79,94,WSW,7,0,29.76,0.0,Partly Cloudy
2,2021-07-01,1:00 AM,82,79,89,SW,6,0,29.76,0.0,Fair
3,2021-07-01,1:30 AM,81,79,94,SW,6,0,29.76,0.0,Fair
4,2021-07-01,2:00 AM,81,79,94,SSW,7,0,29.73,0.0,Fair


* Data shape (rows, columns)

In [4]:
df.shape

(3162, 11)

### What is the meaning of each row?

Each row is the weather condition at a determined time of the date, which is collected every 30 minutes.

### What is the meaning of each column?

There are 11 columns:

* `date`: date that data is collected (YYYY-MM-DD)
* `time`: 12:00 AM -> 11:30 PM
* `temperature`: Fahrenheit
* `dew_point`: the atmospheric temperature (varying according to pressure and humidity) below which water droplets begin to condense and dew can form. (Fahrenheit)
* `humidity`: atmospheric moisture (percentage)
* `wind`: wind's direction code
* `wind_speed`: the rate at which the wind passes a given point (mph : miles per hour)
* `wind_gust`: a brief increase in the speed of the wind (mph)
* `pressure`: sea level pressure (inches Hg)
* `precip.`: any liquid or frozen water that forms in the atmosphere and falls to the Earth (inches)
* `condition`: description of the weather

<h1>???</h1>

`condition` is the attribute we want to predict


### What are the data types of these columns?

In [5]:
df.dtypes

date            object
time            object
temperature      int64
dew_point        int64
humidity         int64
wind            object
wind_speed       int64
wind_gust        int64
pressure       float64
precip.        float64
condition       object
dtype: object

In [12]:
num_cols = list(df.select_dtypes(exclude='object').columns)
cate_cols = list(df.select_dtypes(include='object').columns)

### What we will predict

In [13]:
df.condition.value_counts()

Partly Cloudy                1709
Mostly Cloudy                 900
Fair                          190
Light Rain                    153
Light Rain Shower              71
Light Rain with Thunder        42
T-Storm                        20
Thunder in the Vicinity        16
Rain Shower                     8
Heavy T-Storm                   8
Thunder                         7
Showers in the Vicinity         7
Fog                             6
Partly Cloudy / Windy           6
Light Rain Shower / Windy       5
Heavy T-Storm / Windy           4
Heavy Rain Shower               4
Heavy Rain Shower / Windy       2
Mostly Cloudy / Windy           2
Rain Shower / Windy             2
Name: condition, dtype: int64