# Weather Data 


This part of the report refers to the analysis of the weather data. These data include the complete catalog of local climatological data from 2009 through 2015 for NYC.

## Libraries
In the cell bellow, the libraries that will be used for the manipulation of the data are imported.
* Pandas is a software library written for data manipulation and analysis. 
* Numpy is a software library that is used for scientific computing in python. 
* Sqlalchemy is an opensource SQL toolkit that is used in python.



In [1]:
import pandas as pd
import numpy as np
from sqlalchemy import create_engine

## Getting the path of the weather CSV files 

The primary step in data analysis is to read the data to enable their manipulation. In the cell bellow the data are imported to Jupiter notebook from the path that they are saved in the computer. 

Since, we are interested in importing multiple files at once, a directory is created using the function listdir (imported from the os library), that allows the import and the manipulation of multiple files at once.  

To be able to analyze the data we need to read the data. The data is in the weather folder with the individual CSV files labelled 2009 to 2015. We make a list of files paths and read the required collumns subsequently concatenating them into a single data frame. 

In [2]:
from os import listdir
path = "C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data"
directory = [path + "/" + filename for filename in listdir(path) if filename.endswith(".csv")]
directory

['C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2009.csv',
 'C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2010.csv',
 'C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2011.csv',
 'C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2012.csv',
 'C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2013.csv',
 'C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2014.csv',
 'C:/Users/Dimitris/Documents/Columbia/Tools For Analytics/Final Project/Weather Data/2015.csv']

## Importing the Weather files

The following cell illustrates the import of the weather files in the directory created above using a for loop to iterate through the directory. 

* The for loop allows for the weather data in a csv file to be appended to a dataframe. 

As it can be seen bellow, before the for loop, two empty dataframes, weather_hourly and weather_daily are created. Subsequently, as it can be seen in the for loop, the weather_hourly data are inserted in the empty database created. It's worth mentioning that using the command where we specify hourly_df, we filter the data, selecting only the columns that are worth analyzing for the scope of the project.

In [3]:
weather_hourly = pd.DataFrame()
weather_daily = pd.DataFrame()
for file in directory:
    df = pd.read_csv(file)
    
    hourly_df = df[["DATE","LATITUDE","LONGITUDE","HourlyPrecipitation","HourlyWindSpeed"]]
    weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
    weather_hourly = pd.concat([weather_hourly,hourly_df])
    
    daily_df = df[["DATE","DailyAverageWindSpeed"]]
    weather_daily = pd.concat([weather_daily,daily_df])

  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)
  df = pd.read_csv(file)
  weather_hourly = weather_hourly.append(hourly_df, ignore_index=True)


## Manipulating the date

Now we format the date column for future data manipulation based on date. 

In [4]:
weather_hourly["DATE"] = pd.to_datetime(weather_hourly["DATE"],format='%Y-%m-%d %H:%M:%S')
weather_hourly

Unnamed: 0,DATE,LATITUDE,LONGITUDE,HourlyPrecipitation,HourlyWindSpeed
0,2009-01-01 00:51:00,40.77898,-73.96925,,18.0
1,2009-01-01 01:51:00,40.77898,-73.96925,,18.0
2,2009-01-01 02:51:00,40.77898,-73.96925,,18.0
3,2009-01-01 03:51:00,40.77898,-73.96925,,8.0
4,2009-01-01 04:51:00,40.77898,-73.96925,,11.0
...,...,...,...,...,...
11382,2015-12-31 21:51:00,40.77898,-73.96925,0.00,
11383,2015-12-31 22:51:00,40.77898,-73.96925,0.00,7.0
11384,2015-12-31 23:51:00,40.77898,-73.96925,0.00,5.0
11385,2015-12-31 23:59:00,40.77898,-73.96925,,


## Filtering the NaN data

In the cell bellow, the data in the weather_hourly database are filtered to remove any rows that don't contain any information (Nan) . This filtering of the data is performed with the .dropna() function

In [5]:
weather_hourly_final = weather_hourly.dropna()
weather_hourly_final = weather_hourly_final.reset_index()
weather_hourly_final

Unnamed: 0,index,DATE,LATITUDE,LONGITUDE,HourlyPrecipitation,HourlyWindSpeed
0,36,2009-01-02 12:34:00,40.77898,-73.96925,T,14.0
1,37,2009-01-02 12:51:00,40.77898,-73.96925,T,11.0
2,38,2009-01-02 13:05:00,40.77898,-73.96925,T,7.0
3,39,2009-01-02 13:30:00,40.77898,-73.96925,T,9.0
4,40,2009-01-02 13:46:00,40.77898,-73.96925,T,10.0
...,...,...,...,...,...,...
76845,11379,2015-12-31 18:51:00,40.77898,-73.96925,0.00,3.0
76846,11380,2015-12-31 19:51:00,40.77898,-73.96925,0.00,6.0
76847,11381,2015-12-31 20:51:00,40.77898,-73.96925,0.00,10.0
76848,11383,2015-12-31 22:51:00,40.77898,-73.96925,0.00,7.0


## Checking the data

The .info function bellow is used to ensure that all of the collumns in the database don't contain any null values

In [6]:
weather_hourly_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76850 entries, 0 to 76849
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   index                76850 non-null  int64         
 1   DATE                 76850 non-null  datetime64[ns]
 2   LATITUDE             76850 non-null  float64       
 3   LONGITUDE            76850 non-null  float64       
 4   HourlyPrecipitation  76850 non-null  object        
 5   HourlyWindSpeed      76850 non-null  float64       
dtypes: datetime64[ns](1), float64(3), int64(1), object(1)
memory usage: 3.5+ MB


## Replacing  Precipitation Trace 


As it was interpreted reading the data, in some rows HourlyPrecipitation contained the letter T (for Trace). Research was contacted and it was deduced that T usually translates to less than 0.01. To enable the data intepretation, and for convenience, the letter T was swapped with 0.01. 

In [7]:
weather_hourly_final['HourlyPrecipitation'] = weather_hourly_final ['HourlyPrecipitation'].str.replace('T',"0.01")
weather_hourly_final

Unnamed: 0,index,DATE,LATITUDE,LONGITUDE,HourlyPrecipitation,HourlyWindSpeed
0,36,2009-01-02 12:34:00,40.77898,-73.96925,0.01,14.0
1,37,2009-01-02 12:51:00,40.77898,-73.96925,0.01,11.0
2,38,2009-01-02 13:05:00,40.77898,-73.96925,0.01,7.0
3,39,2009-01-02 13:30:00,40.77898,-73.96925,0.01,9.0
4,40,2009-01-02 13:46:00,40.77898,-73.96925,0.01,10.0
...,...,...,...,...,...,...
76845,11379,2015-12-31 18:51:00,40.77898,-73.96925,0.00,3.0
76846,11380,2015-12-31 19:51:00,40.77898,-73.96925,0.00,6.0
76847,11381,2015-12-31 20:51:00,40.77898,-73.96925,0.00,10.0
76848,11383,2015-12-31 22:51:00,40.77898,-73.96925,0.00,7.0


## Converting Pandas dataframe to SQL Table 

Here, the processed panda dataframe for weather hourly is converted into a SQL Table and consequently saved as a table, weather_hourly to an SQL Database named project.db


In [None]:
engine = create_engine('sqlite:///C:/Users/Dimitris/Downloads/Project.db', echo=False)
weather_hourly_final.to_sql('Weather_hourly', con=engine)

## Following same procedure for Weather_daily data

The same procedures and functions that were followed and used for the weather_hourly data are followed for the weather_daily data to format the data and enable their manipulation

In [None]:
weather_daily["DATE"] = pd.to_datetime(weather_daily["DATE"],format='%Y-%m-%d %H:%M:%S')
weather_daily_final = weather_daily.dropna()
weather_daily_final = weather_daily_final.reset_index()
weather_daily_final = weather_daily_final.drop(columns=["index"])

engine = create_engine('sqlite:///C:/Users/Dimitris/Downloads/Project.db', echo=False)
weather_daily_final.to_sql('Weather_daily', con=engine)