# Potential Datasets 

#### Below are some datasets which could be used for Project 3.  Please feel free to find your own datasets as well if you have a particular interest or question.  Some great websites for dataset curation include: 


1. https://www.kaggle.com/
2. https://data.world/
3. https://opendata.cern.ch/
4. https://www.sciencebase.gov/catalog/
5. https://data.neonscience.org/

In [None]:
# Import Numpy and Datascience modules.
import numpy as np
import pandas as pd
from datascience import *

# Plotting modules
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
import warnings
warnings.simplefilter('ignore', UserWarning)

### Biology Datasets
includes: Palmer Penguins, Ecological Footprint, & kidney disease.

##### Palmer Penguins
See: https://allisonhorst.github.io/palmerpenguins/

In [None]:
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
penguin = Table.read_table(url)
penguin

##### Ecological Footprint 
This dataset measures the amount of ecological resources are used from each country in the years 1961 to 2016.  More information can be found at: https://data.world/footprint/nfa-2019-edition

In [None]:
url = 'data/NFA 2019 public_data.csv'
ecoFootprint = Table.read_table(url)
ecoFootprint

##### Chronic Kidney Disease 
More information on columns at: https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Disease

In [None]:
url = 'data/kidney_disease.csv'
kidneyDisease = Table.read_table(url)
kidneyDisease

__________________________

### Chemistry Datasets
including pKa, High Entropy Alloys, & Periodic Table

##### Molecular acid dissociation constant, pKa data
See: https://github.com/samplchallenges/SAMPL7/tree/master/physical_property/pKa
SMILES is a representaion of chemical structure.

In [None]:
url = "https://raw.githubusercontent.com/robraddi/GP-SAMPL7/main/pKaDatabase/OChem/ochem0-2000.csv"
pka = Table.read_table(url)
pka

##### Periodic Table 
More information on columns: https://www.kaggle.com/datasets/berkayalan/chemical-periodic-table-elements?select=chemical_elements.csv

In [None]:
url = 'data/chemical_elements.csv'
ptdf = pd.read_csv(url, sep = ';')
pt = Table.from_df(ptdf)
pt

##### High Entropy Alloys
https://www.sciencedirect.com/science/article/pii/S2352340921006302?via%3Dihub

In [None]:
url = 'data/high_entropy_alloys.csv'
alloys = Table.read_table(url)
alloys

##### Wine Quality Dataset 
More information at https://archive.ics.uci.edu/ml/datasets/wine+quality

In [None]:
url1 = 'data/winequality-red.csv'
url2 = 'data/winequality-white.csv'
df1 = pd.read_csv(url1, sep = ';')
df1.loc[:, 'type'] = ['red' for x in range(len(df1))]
df2 = pd.read_csv(url2, sep = ';')
df2.loc[:, 'type'] = ['white' for x in range(len(df2))]
wine = Table.from_df(pd.concat([df1, df2]))
wine

________________________________________________________________

### Physics datasets
including exoplanets, Near Earth Objects (NEO), & CERN Electron Collision data. 

##### Exoplanets observed by Kepler telescope
https://exoplanets.nasa.gov/keplerscience/

In [None]:
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/kepler.csv"
exoplanets = Table.read_table(url)
exoplanets

##### Near Earth Objects 
Data is found at https://cneos.jpl.nasa.gov/ca/, but more information about the project can be found here: https://cneos.jpl.nasa.gov/

In [None]:
url = 'data/cneos_closeapproach_data.csv'
neo = Table.read_table(url)
neo

##### CERN Electron Collision Data 
Data was downloaded from https://www.kaggle.com/datasets/fedesoriano/cern-electron-collision-data but was modified from the original data https://opendata.cern.ch/record/304

In [None]:
url = 'data/dielectron.csv'
electron = Table.read_table(url)
electron

________________

### Public Health datasets
including Philadelphia vaccination rates, global vaccination rates, & Hepatitis C diagnosis.  

##### Philadelphia vaccination rates by zip code
COVID-19 Vaccinations

Shows distribution counts of first and second dose, as well as total dose information for all vaccinations performed by the health department. Also provides vaccinations by census tract, ZIP code, age, race, and sex. Vaccinations include residents and non-residents of Philadelphia. Updates daily.
See: https://www.opendataphilly.org/dataset/covid-vaccinations/resource/87ac5b4e-8491-41e3-8cf0-5bfebba2e3a0

In [None]:
url = "https://phl.carto.com/api/v2/sql?filename=covid_vaccines_by_zip&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20*%20FROM%20covid_vaccines_by_zip"
phillyVax = Table.read_table(url)
phillyVax

In [None]:
## might be also useful to have population for looking at Philly Vaccination Rates
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/PA_zip_pop.csv"
paPop = Table.read_table(url)
paPop.sort("pop",descending=True)
paPop.where('county','Philadelphia')

##### COVID Vaccination data by country

In [None]:
url = "https://raw.githubusercontent.com/DataScienceTempleFirst/code-cod/main/COVID_VAXDATA.csv"
globalVax = Table.read_table(url)
globalVax

##### Hepatitis C Diagnosis Datasets 
Heptatitis C is a disease caused by the Hepatitus C virus.  More information on the dataset can be found: https://archive.ics.uci.edu/ml/datasets/HCV+data

In [None]:
url = 'data/HepatitisCdata.csv'
hepC = Table.read_table(url)
hepC

_____________________________________________________________________________________________

###  Other datasets
including Philadelphia Graduation Rates & Jeopardy

##### Philadelphia Open Data School Graduation Rates
This longitudinal open data file includes information about the graduation rates for schools broken out by: graduation rate type (four-year, five-year, or six-year), demographic category (EL status, IEP status, Economically Disadvantaged Status, Gender, or Ethnicity), and ninth-grade cohort. Students are attributed to the last school at which they actively attended in the respective graduation window, which ends on September 30 each year. Students are classified as EL, as having an IEP, and/or economically disadvantaged if they were designated as such at any point during their high school career.
see: https://www.philasd.org/performance/programsservices/open-data/school-performance/#school_graduation_rates 

In [None]:
url = "https://cdn.philasd.org/offices/performance/Open_Data/School_Performance/Graduation_Rates/SDP_Graduation_Rates_School_S.csv"
grad = Table.read_table(url)
grad

##### Jeopardy
see: https://www.jeopardy.com

In [None]:
contestant = "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/contestants.csv"
locations =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/locations.csv"
results =  "https://raw.githubusercontent.com/anuparna/jeopardy/master/dataset/final_results.csv"
loc = Table.read_table(locations)
contest = Table.read_table(contestant)
outcome = Table.read_table(results)
outcome