# Introduction 

In the following notebook, I will be cleaning the weather_features.csv file located [here](https://github.com/KishenSharma6/Weather-Energy-Consumption-in-Spain/tree/master/Data/01_Raw_Data)

**Read in libraries for notebook**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

**Set notebook preferences**

In [2]:
#Set style for visualizations
plt.style.use('Solarize_Light2')

**Read in data**

In [3]:
#Set path to raw data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Spain Hourly Energy Demand and Weather'

#Read in raw data
df = pd.read_csv(path + '/Data/01_Raw_Data/weather_features.csv',  dtype = {'weather_id': 'object'})

# Data Overview

**Data Preview**

In [4]:
#Print df shape
print('Shape of data:', df.shape)

#View head
df.head()

Shape of data: (178396, 17)


Unnamed: 0,dt_iso,city_name,temp,temp_min,temp_max,pressure,humidity,wind_speed,wind_deg,rain_1h,rain_3h,snow_3h,clouds_all,weather_id,weather_main,weather_description,weather_icon
0,2015-01-01 00:00:00+01:00,Valencia,270.475,270.475,270.475,1001,77,1,62,0.0,0.0,0.0,0,800,clear,sky is clear,01n
1,2015-01-01 01:00:00+01:00,Valencia,270.475,270.475,270.475,1001,77,1,62,0.0,0.0,0.0,0,800,clear,sky is clear,01n
2,2015-01-01 02:00:00+01:00,Valencia,269.686,269.686,269.686,1002,78,0,23,0.0,0.0,0.0,0,800,clear,sky is clear,01n
3,2015-01-01 03:00:00+01:00,Valencia,269.686,269.686,269.686,1002,78,0,23,0.0,0.0,0.0,0,800,clear,sky is clear,01n
4,2015-01-01 04:00:00+01:00,Valencia,269.686,269.686,269.686,1002,78,0,23,0.0,0.0,0.0,0,800,clear,sky is clear,01n


# Data Cleaning

**Delete duplicates and drop columns not pertaining to temp**

In [5]:
#Drop dupes
df.drop_duplicates(inplace=True)

#Subset 'date_time','temp'
df = df[['dt_iso','temp']]

**Average temp per time stamp**

This will average each temperature of the 5 cities in the data set for each time stamp

In [6]:
#Group by date_time and average temperature
df = df.groupby('dt_iso')['temp'].mean().reset_index()

#Check
display(df.head())

Unnamed: 0,dt_iso,temp
0,2015-01-01 00:00:00+01:00,272.491463
1,2015-01-01 01:00:00+01:00,272.5127
2,2015-01-01 02:00:00+01:00,272.099137
3,2015-01-01 03:00:00+01:00,272.089469
4,2015-01-01 04:00:00+01:00,272.1459


**Clean and set timestamp as index**

In [7]:
#Remove +01:00
df.dt_iso.replace('[+].*','', inplace=True,regex=True)

#Rename dt_iso to time and set type
df.rename(columns = {'dt_iso':'date_time'}, inplace = True)

#Set date_time as index
df.set_index('date_time', inplace=True)

#Check
display(df.head(3))

Unnamed: 0_level_0,temp
date_time,Unnamed: 1_level_1
2015-01-01 00:00:00,272.491463
2015-01-01 01:00:00,272.5127
2015-01-01 02:00:00,272.099137


**Convert Temperature from Kelvin to Fahrenheit**

The conversion from Kelvin to Fahrenheit is:
    (K − 273.15) × 9/5 + 32

In [8]:
#Apply conversion formula to temperature columns
df['temp'] = df['temp'].apply(lambda x: ((x - 273.15) * (9/5)) + 32)

#Check
display(df.head())

Unnamed: 0_level_0,temp
date_time,Unnamed: 1_level_1
2015-01-01 00:00:00,30.814633
2015-01-01 01:00:00,30.85286
2015-01-01 02:00:00,30.108448
2015-01-01 03:00:00,30.091044
2015-01-01 04:00:00,30.19262


# Write file to CSV

In [9]:
#Print shape
print('Cleaned data shape:', df.shape)

#Write to CSV
df.to_csv(path + '/Data/02_Cleaned_Data/2020_0620_Cleaned_Weather_Features.csv')

Cleaned data shape: (35064, 1)
