# Extract Transform Load (ETL)

## Purpose
- CovidTracking API is a very expansive dataset with current data that begins early January 2020, however it does not include cases by day or cumulative deaths, both of which are in the FinnHub API data set. Pertinent data from the CovidTracking API are in columns 'date', 'state', 'positive', 'negative', 'deathIncrease'. It is useful for later data analysis to separate and include all of these parameters into one database. 

## Project Steps:
- Import dataset from CovidTracking.com API, write to CSV
- Import dataset from Finnhub API, write to CSV
- Using functions getTypes, describeData and analyzeNaNs's, get dataset dtypes, statistics, NaN counts and length, before and after cleaning to verify changes in data
- Observe the two datasets, 
    CovidTracking API is a dataset over time and FinnHub API only contains one days 
    worth of data
- Clean CovidTracking dataset and Finnhub dataset to match data
- Remove unneeded Columns
- Convert/Create Column Finnhub dataset column state to state abbreviation to merge
- Merge dataframes on States
- Export data to MongoDB or PostgreSQL

## Project Team:
- Kent Thomas
- Cynthia Zhang
- Temitayo David Olanbiwonnu
- Khorolsuren Erdenebat
- Jen S/phi-6180


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Module Creation
- The next three cells create modules that can be imported to analyze Dtypes, NaNs, and Describes the data
- Be sure to UNCOMMENT them when first running the notebook

In [2]:
# #%%writefile getTypes.py
# #Write file creates a module that can be imported with dependencies, %%writefile -a getTypes.py, remove if func is changed

# import pandas as pd
# import numpy as np
# import requests
# import os
# import json
# import matplotlib.pyplot as plt
# from IPython.core.display import HTML
# from datetime import date, datetime

# #getTypes analyzes a DataFrame's column types
# def getTypes(dataFrameName): #0
#     dtypesSeries = dataFrameName.dtypes #1
#     dtypesDF = dtypesSeries.to_frame().reset_index() #2
#     dtypesDF = dtypesDF.rename(columns={"index": "ColumnName", 0: "DataType"}) #3
#     columnNames = dtypesDF.columns.tolist() #4
#     #print(columnNames) #5
#     newColumns = dtypesDF['DataType'].unique().tolist() #6
#     dtypesList = [] #7
#     for i in newColumns:
#         dtypesList.append(str(i))
#     print('\n\nUnique values in dtypesDF::') #8
#     print('---------------------------')
#     for i in dtypesList: 
#         print(i)
#     print('---------------------------')
#     print(f'Lenght of dtypesDF:: {len(dtypesDF)}') #9
#     dtypesDict = {dtypesDF['ColumnName'][i]: dtypesDF['DataType'][i] for i in range(len(dtypesDF['ColumnName']))}
#     dictSlice = dict(list(dtypesDict.items())[0: 5]) #11
#     #print(dictSlice)
#     counter = 0 #12
#     dtypesGlobalsList = dtypesList
#     for i in dtypesList: #13
#         print(f'Global DataFrame Created:: {i}') #14
#         dtypesGlobalsList = dtypesDF.loc[(dtypesDF['DataType'] == i)] #15
#         globals()[i] = dtypesGlobalsList #16
#         globals()[i] = globals()[i].drop(['DataType'], axis=1).reset_index() #17
#         globals()[i] = globals()[i].rename(columns={"ColumnName": i}) #18
#         globals()[i] = globals()[i].sort_values(i) #19
#         counter += 1
#         #display(HTML(globals()[i].to_html())) #20
#     dtypesSummaryDF = pd.concat([int64, object, float64], axis=1, sort=False) #21
#     columnName = dtypesSummaryDF.columns.tolist() #22
#     #print(columnName) #23
#     counter = 0
#     removeIndex = []
#     for i in columnName: #24
#         if i != 'index':
#             removeIndex.append(columnName.pop(counter))
#             counter += 1
#     #print(removeIndex)
#     dtypesSummaryDF.drop(columns = removeIndex, inplace=True) #25
#     display(HTML(dtypesSummaryDF.to_html())) #26
    
# # <! getTypes( ):
# # /0/ getTypes analyzes a DataFrame's column types
# # /1/ Get dtypes as series
# # /2/ Make dataframe out of dtypes, output is messy next steps to format
# # /3/ Rename columns
# # /4/ Get column names list
# # /5/ Print column names list
# # /6/ DataFrame is long and uncompressed attempting to create a smaller one with column names as rows
# # /7/ Setting up list to get unique data types
# # /8/ Printing out the unique data types for observation
# # /9/ Getting length before manipulation
# # /10/ Creating a Dictionary to Pair values and flip keys/cols and values/rows
# # /11/ Grabbing a slice of the dictionary to confirm format
# # /12/ Setting up variables for lists and counter
# # /13/ For loop to manipulate dtypesList into Global Variables with individual dataframes
# # /14/ Print statement to confirm globals creation
# # /15/ Matching the data to the dtypesList/dtypesGlobalList into each new global variable
# # /16/ Defining globals into dataframes
# # /17/ Dropping the Datatype column, keeping only Column Name from Original DF
# # /18/ Renaming the ColumnName column to match the global variable name
# # /19/ Sorting the ColumnName names alphabetically
# # /20/ Displaying each dataframe as HTML within Jupyter//must have import statment in dependencies
# # /21/ Merging/Concatenating global dataframes into one Dataframe
# # /22/ Creating list of column names to remove extra indicies
# # /23/ Printing list to confirm
# # /24/ For loop to pop instances that do not equal 'Index'// This is important because there may be many more dtypes in globals()[i] and there must be duplicates in list to drop all at once
# # /25/ Dropping removeIndex list
# # /26/ Displaying Final DF with all original DF column names as Html under their datatype
# #by ph1-6180
    

In [3]:
# #%%writefile describeData.py
# #Write file creates a module that can be imported with dependencies, %%writefile -a describeData.py appends, remove if func is changed
# #This function prints stats for strings and integer value columns
# import pandas as pd
# import numpy as np
# import requests
# import os
# import json
# import matplotlib.pyplot as plt
# from IPython.core.display import HTML
# from datetime import date, datetime

# def describeData(dataFrameName):
#     global keyHeaders, colsData, stringDescribe, intDescribe, keyStr, KeyInt
#     keyStr, keyInt, keyHeaders, intDescribe, stringDescribe = [], [], [], [], []
#     for key, value in dataFrameName.items():
#         #grabs cols as keys into list
#         keyHeaders.append(key)
#     for i in keyHeaders:
#         #checks the cols data if string
#         if isinstance(dataFrameName[i][0], (str)):
#             stringDescribe.append(dataFrameName[keyHeaders][i].describe())
#         else:
#             intDescribe.append(dataFrameName[keyHeaders][i].describe())
#     stringDescribe = pd.DataFrame.from_dict(dict(zip(keyHeaders, stringDescribe)), orient='index')
#     intDescribe = pd.DataFrame.from_dict(dict(zip(keyHeaders, intDescribe)), orient='index') 
#     #adding pretty print to dataframes, don't forget import statment when copying code
#     display(HTML(stringDescribe.to_html()))
#     #print(stringDescribe)
#     display(HTML(intDescribe.to_html()))
#     #print(intDescribe)
#     lengthofDF = len(dataFrameName)
#     print(f'Dataframe has {lengthofDF} rows')
#     columnNames = dataFrameName.columns.tolist()
#     print(f'Column names for the Data Frame: \n\n{columnNames}')
#by ph1-6180

In [4]:
# #%%writefile analyzeNaNs.py
# #Write file creates a module that can be imported with dependencies, %%writefile -a analyzeNaNs.py appends, remove if func is changed
# #This function analyzes the NaN's in a DF
# #Print/Returns a Dataframe with the NaN's count and the list of columns without NaN's
# import pandas as pd
# import numpy as np
# import requests
# import os
# import json
# import matplotlib.pyplot as plt
# from IPython.core.display import HTML
# from datetime import date, datetime

# def analyzeNaNs(dataFrameName):
#     columnNames = dataFrameName.columns.tolist()
#     NaNslist = []
#     noNaNs =[]
#     counter = 0
#     for i in columnNames:
#         colNaNs = dataFrameName[i].isna().sum()
#         NaNslist.append(colNaNs)
#         #print(f'{colNaNs} NaNs in {columnNames[counter]}')
#         counter += 1
#     #print(NaNslist)
#     #print(columnNames)
#     NaNsDF = pd.DataFrame(NaNslist, index = columnNames, columns =['NaNsCount'])
#     transposeDF = NaNsDF.T
#     for i in columnNames:
#         if transposeDF[i][0] == 0:
#             noNaNs.append(i)
#             transposeDF = transposeDF.drop([i], axis=1)
#     print('---------------------')
#     print('Columns with no NaNs::')
#     print('---------------------')
#     for j in range(len(noNaNs)):
#         alphaCols = sorted(noNaNs)
#         print(f'{alphaCols[j]}')
#     print('\n\n\n---------------------')
#     print('DataFrame of NaNs::')
#     print('---------------------')
#     display(HTML(transposeDF.to_html()))
#     #print(transposeDF)
# #by ph1-6180

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Import Dependencies

In [5]:
# Dependencies
import getTypes
import describeData
import analyzeNaNs
import pandas as pd
import numpy as np
import requests
import os
import json
import matplotlib.pyplot as plt
from IPython.core.display import HTML
from datetime import date, datetime
#Kent Thomas, ph1-618O

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Query Covid Tracking API

In [6]:
# Pathway to use later to export CSV for cleaning.
covidTrackingDataPath = ".Resources/daily_covid_tracking.csv"

# Country input statement to search for api
# country= input("Type a country you wish to search for: ")

# URL for the API call
covidTrackingDataPath = f"https://api.covidtracking.com/v1/states/daily.json"

# Calling API and printing response.
response = requests.get(covidTrackingDataPath).json()
#print(json.dumps(response, indent=4, sort_keys=True))
#Kent Thomas

In [7]:
#response

In [8]:
covidTrackingDF = pd.DataFrame(response)
covidTrackingDF
#Kent Thomas

Unnamed: 0,date,state,positive,negative,pending,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade
0,20200925,AK,8202.0,433130.0,,441332.0,43.0,,,,...,441332,6,0,0271fcc794329b2dd035402cb74aad56124c446b,0,0,0,0,0,
1,20200925,AL,150658.0,963364.0,,1097595.0,718.0,16852.0,,1782.0,...,1114022,-15,74,08184faa52c95520b2871a34cbea8adad1c1d054,0,0,0,0,0,
2,20200925,AR,79946.0,848822.0,,926294.0,478.0,5202.0,224.0,,...,928768,20,42,6d09ffffc7c9131f7489f7d8e3e4501ce55b8e1e,0,0,0,0,0,
3,20200925,AS,0.0,1571.0,,1571.0,,,,,...,1571,0,0,84c6296997b54d1474da9d9c51e33d48fda7fb5b,0,0,0,0,0,
4,20200925,AZ,216367.0,1211657.0,,1423603.0,521.0,21972.0,119.0,,...,1428024,28,30,a7a38e5509bf8d1ebfe03d1d9acd5febad3546d0,0,0,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11517,20200124,WA,0.0,0.0,,0.0,,,,,...,0,0,0,6f40087f42d06db4121e09b184785b4110cd4df8,0,0,0,0,0,
11518,20200123,MA,,,,2.0,,,,,...,0,0,0,885628de5b5c6da109b79adb7faad55e4815624a,0,0,0,0,0,
11519,20200123,WA,0.0,0.0,,0.0,,,,,...,0,0,0,978c05d8a7a9d46e9fa826d83215f5b9732f2c6d,0,0,0,0,0,
11520,20200122,MA,,,,1.0,,,,,...,0,0,0,0f3eebd5c4a00d0aaa235b0534bd4243794652b6,0,0,0,0,0,


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Exporting CovidTracking CSV
- Covid Tracking Data json to csv

In [72]:
# *Date, state, postive, negative, deathIncrease
# Remove NaN's, Replace with 0's
# Change full state to abbreviation
# Drop all rows that are not the current date
# Read back in CSV for Cleaning

In [68]:
#covidTrackingDataPath = "./Resources/covidTrackingCurrent.csv"
outputPath = os.path.join(".", "Resources", "covidTrackingCurrent.csv")
#outputPath = "./Resources/covidTrackingCurrent.csv"
covidTrackingDF.to_csv(outputPath, index = False)
#Kent Thomas

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Importing CovidTracking CSV
- Read back in Covid Tracking Data CSV for Cleaning

In [14]:
#covidTrackingCSV = pd.read_csv(os.path.join(".", "Resources", "covidTrackingCurrent.csv"))
covidTrackingDF = pd.read_csv(outputPath)
covidTrackingDF
#Kent Thomas

Unnamed: 0,date,state,positive,negative,pending,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade
0,20200925,AK,8202.0,433130.0,,441332.0,43.0,,,,...,441332,6,0,0271fcc794329b2dd035402cb74aad56124c446b,0,0,0,0,0,
1,20200925,AL,150658.0,963364.0,,1097595.0,718.0,16852.0,,1782.0,...,1114022,-15,74,08184faa52c95520b2871a34cbea8adad1c1d054,0,0,0,0,0,
2,20200925,AR,79946.0,848822.0,,926294.0,478.0,5202.0,224.0,,...,928768,20,42,6d09ffffc7c9131f7489f7d8e3e4501ce55b8e1e,0,0,0,0,0,
3,20200925,AS,0.0,1571.0,,1571.0,,,,,...,1571,0,0,84c6296997b54d1474da9d9c51e33d48fda7fb5b,0,0,0,0,0,
4,20200925,AZ,216367.0,1211657.0,,1423603.0,521.0,21972.0,119.0,,...,1428024,28,30,a7a38e5509bf8d1ebfe03d1d9acd5febad3546d0,0,0,0,0,0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11517,20200124,WA,0.0,0.0,,0.0,,,,,...,0,0,0,6f40087f42d06db4121e09b184785b4110cd4df8,0,0,0,0,0,
11518,20200123,MA,,,,2.0,,,,,...,0,0,0,885628de5b5c6da109b79adb7faad55e4815624a,0,0,0,0,0,
11519,20200123,WA,0.0,0.0,,0.0,,,,,...,0,0,0,978c05d8a7a9d46e9fa826d83215f5b9732f2c6d,0,0,0,0,0,
11520,20200122,MA,,,,1.0,,,,,...,0,0,0,0f3eebd5c4a00d0aaa235b0534bd4243794652b6,0,0,0,0,0,


In [15]:
#Running Module getTypes, analyzeNans and describeData to analyze DF types
#Be sure to uncomment getTypes and %%writefile and execute before importing
#Additionally recomment after executing
#getTypes.getTypes(covidTrackingDF)
#analyzeNaNs.analyzeNaNs(covidTrackingDF)
#describeData.describeData(covidTrackingDF)

In [16]:
print(type(covidTrackingDF))
covidTrackingDF.head()

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,date,state,positive,negative,pending,totalTestResults,hospitalizedCurrently,hospitalizedCumulative,inIcuCurrently,inIcuCumulative,...,posNeg,deathIncrease,hospitalizedIncrease,hash,commercialScore,negativeRegularScore,negativeScore,positiveScore,score,grade
0,20200925,AK,8202.0,433130.0,,441332.0,43.0,,,,...,441332,6,0,0271fcc794329b2dd035402cb74aad56124c446b,0,0,0,0,0,
1,20200925,AL,150658.0,963364.0,,1097595.0,718.0,16852.0,,1782.0,...,1114022,-15,74,08184faa52c95520b2871a34cbea8adad1c1d054,0,0,0,0,0,
2,20200925,AR,79946.0,848822.0,,926294.0,478.0,5202.0,224.0,,...,928768,20,42,6d09ffffc7c9131f7489f7d8e3e4501ce55b8e1e,0,0,0,0,0,
3,20200925,AS,0.0,1571.0,,1571.0,,,,,...,1571,0,0,84c6296997b54d1474da9d9c51e33d48fda7fb5b,0,0,0,0,0,
4,20200925,AZ,216367.0,1211657.0,,1423603.0,521.0,21972.0,119.0,,...,1428024,28,30,a7a38e5509bf8d1ebfe03d1d9acd5febad3546d0,0,0,0,0,0,


In [17]:
# #Checking column indicies
# covidTrackingDF.columns[2]

In [18]:
# Drop all except date, state, postive, negative, deathIncrease
columnNames = covidTrackingDF.columns.tolist()
#print(columnNames)
necessaryDataList = ['date', 'state', 'positive', 'negative', 'deathIncrease']
for i in columnNames:
    if i not in necessaryDataList:
        covidTrackingDF = covidTrackingDF.drop([i], axis = 1)

In [19]:
covidTrackingDF

Unnamed: 0,date,state,positive,negative,deathIncrease
0,20200925,AK,8202.0,433130.0,6
1,20200925,AL,150658.0,963364.0,-15
2,20200925,AR,79946.0,848822.0,20
3,20200925,AS,0.0,1571.0,0
4,20200925,AZ,216367.0,1211657.0,28
...,...,...,...,...,...
11517,20200124,WA,0.0,0.0,0
11518,20200123,MA,,,0
11519,20200123,WA,0.0,0.0,0
11520,20200122,MA,,,0


In [33]:
#Running Module getTypes, analyzeNans and describeData to analyze DF types
#Be sure to uncomment getTypes and %%writefile and execute before importing
#Additionally recomment after executing
getTypes.getTypes(covidTrackingDF)



Unique values in dtypesDF::
---------------------------
int64
object
float64
---------------------------
Lenght of dtypesDF:: 5
Global DataFrame Created:: int64
Global DataFrame Created:: object
Global DataFrame Created:: float64


Unnamed: 0,int64,object,float64
0,date,state,positive
1,deathIncrease,,negative


In [34]:
analyzeNaNs.analyzeNaNs(covidTrackingDF)

---------------------
Columns with no NaNs::
---------------------
date
deathIncrease
state



---------------------
DataFrame of NaNs::
---------------------


Unnamed: 0,positive,negative
NaNsCount,99,237


In [37]:
describeData.describeData(covidTrackingDF)

Unnamed: 0,count,unique,top,freq
date,11522,56,WA,248


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
date,11522.0,20200610.0,195.9396,20200122.0,20200424.0,20200615.0,20200805.0,20200925.0
state,11423.0,49892.61,99346.18,0.0,1410.5,12493.0,54176.0,794040.0
positive,11285.0,553279.4,1179340.0,0.0,26334.0,162688.0,567822.0,13258207.0
negative,11522.0,16.98455,47.09346,-213.0,0.0,3.0,14.0,951.0


Dataframe has 11522 rows
Column names for the Data Frame: 

['date', 'state', 'positive', 'negative', 'deathIncrease']


In [38]:
#Getting the most current date from Covid Tracking API to merge to FinnHub
#Covid Tracking API orders the most recent data at index 0
#Matching all dates == ['date']Index 0
covidSubset = covidTrackingDF.loc[covidTrackingDF['date'] == covidTrackingDF['date'][0]].copy()
covidSubset.head()

Unnamed: 0,date,state,positive,negative,deathIncrease
0,20200925,AK,8202.0,433130.0,6
1,20200925,AL,150658.0,963364.0,-15
2,20200925,AR,79946.0,848822.0,20
3,20200925,AS,0.0,1571.0,0
4,20200925,AZ,216367.0,1211657.0,28


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Creating States Dictionary

In [40]:
#Creating US States Dictionary to merge data sets
usStateAbbrev = {
    'Alabama': 'AL', 'Alaska': 'AK', 'American Samoa': 'AS', 'Arizona': 'AZ',
    'Arkansas': 'AR', 'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT',
    'Delaware': 'DE', 'District of Columbia': 'DC', 'Florida': 'FL', 'Georgia': 'GA',
    'Guam': 'GU', 'Hawaii': 'HI', 'Idaho': 'ID', 'Illinois': 'IL', 
    'Indiana': 'IN','Iowa': 'IA', 'Kansas': 'KS', 'Kentucky': 'KY',
    'Louisiana': 'LA', 'Maine': 'ME','Maryland': 'MD', 'Massachusetts': 'MA',
    'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS', 'Missouri': 'MO',
    'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV', 'New Hampshire': 'NH',
    'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY', 'North Carolina': 'NC',
    'North Dakota': 'ND', 'Northern Mariana Islands':'MP', 'Ohio': 'OH', 'Oklahoma': 'OK',
    'Oregon': 'OR', 'Pennsylvania': 'PA','Puerto Rico': 'PR', 'Rhode Island': 'RI',
    'South Carolina': 'SC', 'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX',
    'Utah': 'UT', 'Vermont': 'VT', 'Virgin Islands': 'VI', 'Virginia': 'VA',
    'Washington': 'WA', 'West Virginia': 'WV', 'Wisconsin': 'WI', 'Wyoming': 'WY'
}

In [42]:
#Add tutors code
usStateAbbrev.keys()
usStateAbbrev.values()
usStateDF = pd.DataFrame(usStateAbbrev.keys())
usStateDF['state_abbr'] = pd.DataFrame(usStateAbbrev.values())
usStateDF.reset_index()
usStateDF.head()

Unnamed: 0,0,state_abbr
0,Alabama,AL
1,Alaska,AK
2,American Samoa,AS
3,Arizona,AZ
4,Arkansas,AR


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Query FinnHub API

In [46]:
# Change full state to abbreviation
# Drop rows 55 - 63, Grand Princess
# Make new column with converted date
# Kent Thomas

In [43]:
# URL for the API call
us_covid_tracking_data = "https://finnhub.io/api/v1/covid19/us"

# Calling API and printing response.
responseFinnhub = requests.get(us_covid_tracking_data).json()
#print(json.dumps(responseFinnhub, indent=4, sort_keys=True)) #printing pretty the response
#Kent Thomas

In [25]:
responseFinnhubDF = pd.DataFrame(responseFinnhub)
responseFinnhubDF
#Kent Thomas

Unnamed: 0,state,case,death,updated
0,New York,457865,33095,2020-09-26 00:01:13
1,New Jersey,201662,16216,2020-09-26 00:01:13
2,California,800039,15414,2020-09-26 00:01:13
3,Michigan,132344,7035,2020-09-26 00:01:13
4,Florida,695887,13915,2020-09-26 00:01:13
...,...,...,...,...
57,Wuhan Evacuee,4,0,2020-09-26 00:01:13
58,Northern Mariana Islands,31,2,2020-09-26 00:01:13
59,US Military,63568,94,2020-09-26 00:01:13
60,Federal Bureau of Prisons,16381,125,2020-09-26 00:01:13


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Exporting FinnHub CSV

In [77]:
outputPath = os.path.join(".", "Resources", "FinnhubCovidCurrent.csv")
#outputPath = "./Resources/FinnhubCovidCurrent.csv"
responseFinnhubDF.to_csv(outputPath, index = False)
#Kent Thomas

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Importing FinnHub CSV

In [47]:
# Read back in CSV for Cleaning
FinnHubCSV = pd.read_csv(os.path.join(".", "Resources", "FinnhubCovidCurrent.csv"))
FinnHubCSV
#Kent Thomas

Unnamed: 0,state,case,death,updated
0,New York,457865,33095,2020-09-26 00:01:13
1,New Jersey,201662,16216,2020-09-26 00:01:13
2,California,800039,15414,2020-09-26 00:01:13
3,Michigan,132344,7035,2020-09-26 00:01:13
4,Florida,695887,13915,2020-09-26 00:01:13
...,...,...,...,...
57,Wuhan Evacuee,4,0,2020-09-26 00:01:13
58,Northern Mariana Islands,31,2,2020-09-26 00:01:13
59,US Military,63568,94,2020-09-26 00:01:13
60,Federal Bureau of Prisons,16381,125,2020-09-26 00:01:13


In [48]:
print(type(FinnHubCSV['updated']))

<class 'pandas.core.series.Series'>


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Merge on State

In [56]:
#Merging usStateDF with FinnHubDF to later merge with covidTrackingDF
addStateAbbrDF = FinnHubCSV.merge(usStateDF, left_on='state', right_on=0)
addStateAbbrDF = addStateAbbrDF.drop([0], axis =1)
#addStateAbbrDF.head(20)

--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Final Merge

In [63]:
#Merging covidTracking subset with FinnHub subset on state
combinedDF = covidSubset.merge(addStateAbbrDF, left_on='state', right_on='state_abbr')
combinedDF = combinedDF.drop(['state_x'], axis = 1)
combinedDF = combinedDF.rename(columns={"state_y": "state"})
combinedDF.head(20)

Unnamed: 0,date,positive,negative,deathIncrease,state,case,death,updated,state_abbr
0,20200925,8202.0,433130.0,6,Alaska,7132,46,2020-09-26 00:01:13,AK
1,20200925,150658.0,963364.0,-15,Alabama,148206,2509,2020-09-26 00:01:13,AL
2,20200925,79946.0,848822.0,20,Arkansas,79049,1246,2020-09-26 00:01:13,AR
3,20200925,216367.0,1211657.0,28,Arizona,215852,5560,2020-09-26 00:01:13,AZ
4,20200925,794040.0,13258207.0,84,California,800039,15414,2020-09-26 00:01:13,CA
5,20200925,67217.0,795588.0,9,Colorado,66892,2029,2020-09-26 00:01:13,CO
6,20200925,56587.0,1462564.0,2,Connecticut,56472,4500,2020-09-26 00:01:13,CT
7,20200925,20085.0,259289.0,1,Delaware,19947,630,2020-09-26 00:01:13,DE
8,20200925,695887.0,4510107.0,122,Florida,695887,13915,2020-09-26 00:01:13,FL
9,20200925,312514.0,2534980.0,52,Georgia,311789,6828,2020-09-26 00:01:13,GA


In [64]:
#Running Module getTypes, analyzeNans and describeData to analyze DF types
#Be sure to uncomment getTypes and %%writefile and execute before importing
#Additionally recomment after executing
getTypes.getTypes(combinedDF)



Unique values in dtypesDF::
---------------------------
int64
float64
object
---------------------------
Lenght of dtypesDF:: 9
Global DataFrame Created:: int64
Global DataFrame Created:: float64
Global DataFrame Created:: object


Unnamed: 0,int64,object,float64
0,date,state,positive
1,deathIncrease,updated,negative
2,case,state_abbr,
3,death,,


In [65]:
#This reports if there are any NaN's in DF
# If DataFrame of NaN's is empty there are no NaNs
analyzeNaNs.analyzeNaNs(combinedDF)

---------------------
Columns with no NaNs::
---------------------
case
date
death
deathIncrease
negative
positive
state
state_abbr
updated



---------------------
DataFrame of NaNs::
---------------------


NaNsCount


In [66]:
describeData.describeData(combinedDF)

Unnamed: 0,count,unique,top,freq
date,53,53,Connecticut,1
positive,53,1,2020-09-26 00:01:13,53
negative,53,53,SC,1


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
date,53.0,20200920.0,0.0,20200925.0,20200925.0,20200925.0,20200925.0,20200925.0
positive,53.0,131630.5,174619.4,69.0,24181.0,82045.0,148894.0,794040.0
negative,53.0,1673918.0,2356814.0,14776.0,394573.0,963364.0,1836216.0,13258207.0
deathIncrease,53.0,15.88679,25.26221,-15.0,2.0,7.0,19.0,122.0
state,53.0,131363.8,176020.4,31.0,24034.0,81221.0,147746.0,800039.0
case,53.0,3816.642,5861.551,2.0,446.0,1565.0,4500.0,33095.0


Dataframe has 53 rows
Column names for the Data Frame: 

['date', 'positive', 'negative', 'deathIncrease', 'state', 'case', 'death', 'updated', 'state_abbr']


--------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------

# <! Exporting CombinedDF CSV

In [78]:
outputPath = os.path.join(".", "Resources", "combinedCovidData.csv")
#outputPath = "./Resources/covidTrackingCurrent.csv"
combinedDF.to_csv(outputPath, index = False)

In [None]:
#Export to CombinedDF to Mongo DB or SQLite

In [1]:
# #Comment formatting
# for i in range(0, 26):
#     print("- /" + str(i) +'/ ')

In [None]:
## ph1-6180