# Weather Data Cleaning

# Introduction
This project is meant to gather insights on electricity usage.
This step is needed to clean up the raw weather data so, it's usable for correlation with the Energy Usage Data, visualizations and machine learning algorthms.

## Data Source
This data was collected using [Meteostat](https://github.com/meteostat/meteostat-python). The Meteostat Python library provides a simple API for accessing open weather and climate data. The historical observations and statistics are collected by Meteostat from different public interfaces, most of which are governmental.

Among the data sources are national weather services like the National Oceanic and Atmospheric Administration (NOAA) and Germany's national meteorological service (DWD).

# Goals
* become familiar with the dataset
* remove redundant data
* clean anomalous data

###  src: [Meteostat Documentation](https://dev.meteostat.net/python/hourly.html#data-structure)

| | | |
|-|-|-|
|**Column**|**Description**|**Type**|
|**station**|Meteostat ID of the weather station (only if query refers to multiple stations)|String|
|**time**|datetime of the observation|Datetime64|
|**temp**|air temperature in *°C*|Float64|
|**dwpt**|dew point in *°C*|Float64|
|**rhum**|relative humidity in percent (*%*)|Float64|
|**prcp**|one hour precipitation total in *mm*|Float64|
|**snow**|snow depth in *mm*|Float64|
|**wdir**|average wind direction in degrees (*°*)|Float64|
|**wspd**|average wind speed in *km/h*|Float64|
|**wpgt**|peak wind gust in *km/h*|Float64|
|**pres**|average sea-level air pressure in *hPa*|Float64|
|**tsun**|one hour sunshine total in minutes (*m*)|Float64|
|**coco**|[weather condition code](https://dev.meteostat.net/docs/formats.html#weather-condition-codes) |Float64|

In [2]:
import pandas as pd
import numpy as np

In [4]:
# Import the energy use spreadsheet from the 'data' directory

# Define the directory path and the regular expression pattern
import glob
directory_path = "./data"
file_pattern = "weather_*.csv"

# Use glob.glob to match filenames based on the pattern
file_path = glob.glob(f"{directory_path}/{file_pattern}")[0]
weather_df_raw = pd.read_csv(filepath_or_buffer=file_path)
weather_df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8785 entries, 0 to 8784
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   time    8785 non-null   object 
 1   temp    8785 non-null   float64
 2   dwpt    8785 non-null   float64
 3   rhum    8785 non-null   float64
 4   prcp    8785 non-null   float64
 5   snow    0 non-null      float64
 6   wdir    8785 non-null   float64
 7   wspd    8785 non-null   float64
 8   wpgt    0 non-null      float64
 9   pres    8785 non-null   float64
 10  tsun    0 non-null      float64
 11  coco    8785 non-null   float64
dtypes: float64(11), object(1)
memory usage: 823.7+ KB


In [5]:
weather_df_raw.head()

Unnamed: 0,time,temp,dwpt,rhum,prcp,snow,wdir,wspd,wpgt,pres,tsun,coco
0,2022-10-21 00:00:00,13.0,1.1,44.0,0.0,,190.0,7.6,,1008.0,,3.0
1,2022-10-21 01:00:00,10.7,1.0,51.0,0.0,,160.0,7.6,,1008.0,,3.0
2,2022-10-21 02:00:00,9.0,1.5,59.0,0.0,,180.0,5.4,,1008.0,,3.0
3,2022-10-21 03:00:00,9.0,1.5,59.0,0.0,,180.0,5.4,,1008.0,,3.0
4,2022-10-21 04:00:00,7.6,1.5,65.0,0.0,,170.0,5.4,,1008.0,,1.0


In [None]:
# copy raw data into a df to be cleaned
weather_df = weather_df_raw.copy()

In [7]:
# Printing all the unique values of null columns
# verifying snow, wpgt & tsun seem to be empty columns,
print([weather_df['snow'].unique(),
    weather_df['wpgt'].unique(),
    weather_df['tsun'].unique()])

[array([nan]), array([nan]), array([nan])]


In [9]:
weather_df = weather_df.drop(['snow', 'wpgt', 'tsun'], axis=1)
print(weather_df.columns)

Index(['time', 'temp', 'dwpt', 'rhum', 'prcp', 'wdir', 'wspd', 'pres', 'coco'], dtype='object')


# Observations & TODOs
* column titles have a space in them, change to underscore
* start-end time intervals seem to be all the same
* 'TYPE', 'UNITS', 'NOTES' columns seem to have no variation in values.  aka can be removed.
* 'COST' is in text format instead of numeric
* 2 columns for date & time can be combined into datetime objs
* a usage duration column would likely simplify future data visulization/modeling