# ETL Project
<ul>
    <li>UofMN Data Visualization and Analytics Bootcamp</li>
    <li>Week 13 | ETL Project</li>
    <li>Created by: Stephanie Hartje, Chris Howard</li>
    <li>05/18/2019</li>
</ul>

### Project Description and Purpose
<p>This project extracts(E) data from multiple sources, uses the Python Pandas module to transform(T) the data into 
    useful tables, which can then be mapped and loaded(L) into a SQL database. There is no direct analysis done on
    the data for the project, but the intention is to have a usable database for a theoretical analysis at the end of 
    the process.</p>
<p>Our theoretical analysis is looking at any (albeit spurious) correlation between solar eclipses, ufo sightings, and 
    multiple natural disasters including hurricanes and earthquakes. Each event type has been given its own table 
    in the database with a minimum of event date, some form of ID, and location (including latitude and longitude where
    available. All dates have been separated into 'year' 'month' 'day' columns so that events can be easily 
    compared by for clusters around certain months as well as by year and location.</p>
<p>The sql code for our database can be found in our repository, or opened directly into a new Jupyter window <a href='../edit/disaster_etl.sql'>using this link</a> if this notebook is being run locally within a copy of the repository.</p>

In [None]:
# imports
import pandas as pd
import numpy as np
import requests
from sqlalchemy import create_engine
import config

In [None]:
## Chris Extract/Transform below


In [None]:
# ufo data from wikipedia, data from 19th & 20th 
ufo_url = 'https://en.wikipedia.org/wiki/List_of_reported_UFO_sightings'
ufo_df_19th = pd.read_html(ufo_url)[5]
ufo_df_20th = pd.read_html(ufo_url)[6]

# remove label row from 20th century data
ufo_df_20th = ufo_df_20th.drop(0)

# combine tables into single dataframe
ufo_df = ufo_df_19th.append(ufo_df_20th, ignore_index=True) 

# use first row as column headers, then reindex removing top row
ufo_df.columns = ufo_df.iloc[0]
ufo_df = ufo_df.reindex(ufo_df.index.drop(0))

# create loop to extract year/month/day from formatting
dates = ufo_df['Date']
year = []
month = []
day = []
for date in dates:
    date = date.strip('s')
    date = date.split('-')
    year.append(date[0])
    if len(date) > 1:
        month.append(date[1])
    else:
        month.append(None)
    if len(date) > 2:
        day.append(date[2])
    else:
        day.append(None)

# insert 'Year' 'Month' 'Day' columns into the dataframe
ufo_df.insert(loc=0, column='Year', value=year)
ufo_df.insert(loc=1, column='Month', value=month)
ufo_df.insert(loc=2, column='Day', value=day)
ufo_df_clean = ufo_df[['Year', 'Month', 'Day', 'Date', 'Name', 'Country', 'Description']].copy()
ufo_df_clean


In [None]:
eclipse_1900 = pd.read_csv('Data/1901-2000.csv', index_col=False)
eclipse_2000 = pd.read_csv('Data/2001-2100.csv', index_col=False)


In [None]:
## Stephanie Extract/Transform below

# Extract CSVs into DataFrames
    ### AtlanticStorms from https://www.kaggle.com/noaa/hurricane-database
        #### Each date has up to 5 observations per day (but not all days have 5)
        #### Older data appears to use -999 from wind pressure and speed instead of something like NA
        #### ID: AL = Atlantic, XX = number storm for year, YYYY = year
    ### PacificStorms from https://www.kaggle.com/noaa/hurricane-database
        #### Each date has up to 5 observations per day (but not all days have 5)
        #### Older data appears to use -999 from wind pressure and speed instead of something like NA
        #### ID: EP = Pacific, XX = number storm for year, YYYY = year
    ### 

In [None]:
#Extract Atlantic Storm Data

AtlanticStorm = "Data/Atlantic_Storms.csv"
AtlanticStorm_df = pd.read_csv(AtlanticStorm)
AtlanticStorm_df.head()

In [None]:
#Extract Pacific Storm Data

PacificStorm = "Data/Pacific_Storms.csv"
PacificStorm_df = pd.read_csv(PacificStorm)
PacificStorm_df.head()

In [None]:
# Combine Atlantic and Pacific Storm Data

AtlPacStorms = [AtlanticStorm_df, PacificStorm_df]
AtlPacStorms_df = pd.concat(AtlPacStorms).reset_index(drop=True)
AtlPacStorms_df.head()

In [None]:
# Check that Pacific Storms are included in combined df

AtlPacStorms_df.loc[AtlPacStorms_df['ID'] == "EP011949"]

In [None]:
AtlPacStorms_df.dtypes

In [None]:
# Adjust date format

# make string version of original Date column, call it 'col'
AtlPacStorms_df['col'] = AtlPacStorms_df['Date'].apply(str)

# make the new columns using string indexing
AtlPacStorms_df['Year'] = AtlPacStorms_df['col'].str[0:4]
AtlPacStorms_df['Month'] = AtlPacStorms_df['col'].str[4:6]
AtlPacStorms_df['Day'] = AtlPacStorms_df['col'].str[6:8]

# get rid of the extra variable (if you want)
AtlPacStorms_df.drop('col', axis=1, inplace=True)

#check result
AtlPacStorms_df.head()

In [None]:
#Select columns to keep

AtlPacStorms_df = AtlPacStorms_df[["Year", "Month", "Day", "ID", "Status", "Time", "Latitude", "Longitude"]]
AtlPacStorms_df.head()


In [None]:
AtlPacStorms_df["Status"] = AtlPacStorms_df['Status'].astype(str)
AtlPacStorms_df.dtypes

In [None]:
# We are only interested in Hurricanes to only keep rows with Status = HU

AtlPacStorms_df = AtlPacStorms_df.loc[AtlPacStorms_df["Status"] == " HU"]
AtlPacStorms_df.head()

In [None]:
# Keep only the first observation of each unique ID

Hurricane_df = AtlPacStorms_df.drop_duplicates(subset=[AtlPacStorms_df.columns[3]], keep = "first")
Hurricane_df.head()

In [None]:
# Drop Status and Time columns

Hurricane_df = Hurricane_df[["Year", "Month", "Day", "ID", "Latitude", "Longitude"]]
Hurricane_df = Hurricane_df.reset_index(drop = True)
Hurricane_df.head()


In [None]:
## Chris Load below

In [None]:
conn = f"{config.username}:{config.password}@127.0.0.1/disaster_etl"
engine = create_engine(f'mysql+pymysql://{conn}')

In [None]:
engine.table_names()

In [None]:
ufo_df_clean.to_sql(name='ufo_sightings', con=engine, if_exists='append', index=False)

In [None]:
## Stephanie Load below