# Open Data

__Open Data__ refers to the concept of making _data freely available_ to the public, without any restrictions on its use, reuse, or redistribution. It is typically provided in a _machine-readable format_ and can be accessed and used by anyone for various purposes, such as research, analysis, and innovation. __Open Data__ plays a crucial role in promoting transparency, accountability, and collaboration in both the public and private sectors.

### Download and load CSV files

To load data from a _CSV_ file into a DataFrame, you can use the `df = pd.read_csv(url)` function from the pandas library. Use the actual _URL_ of the _CSV_ file you want to load. This code will download the CSV file from the specified URL and load it into the `df` DataFrame.

In [1]:
# https://catalog.data.gov/dataset/electric-vehicle-population-data
import pandas as pd

# URL of the CSV file
url_csv = 'https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD'

# Download the CSV file and load it into a DataFrame
df = pd.read_csv(url_csv)
print(df.shape)
df.head()


(191407, 17)


Unnamed: 0,VIN (1-10),County,City,State,Postal Code,Model Year,Make,Model,Electric Vehicle Type,Clean Alternative Fuel Vehicle (CAFV) Eligibility,Electric Range,Base MSRP,Legislative District,DOL Vehicle ID,Vehicle Location,Electric Utility,2020 Census Tract
0,5YJSA1E22K,King,Seattle,WA,98112.0,2019,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,270,0,43.0,202233958,POINT (-122.300312 47.629782),CITY OF SEATTLE - (WA)|CITY OF TACOMA - (WA),53033010000.0
1,3MW39FS05R,Yakima,Zillah,WA,98953.0,2024,BMW,330E,Plug-in Hybrid Electric Vehicle (PHEV),Not eligible due to low battery range,20,0,15.0,264425178,POINT (-120.2658133 46.4063477),PACIFICORP,53077000000.0
2,1N4AZ0CP0F,King,Kent,WA,98031.0,2015,NISSAN,LEAF,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,84,0,11.0,114962025,POINT (-122.201564 47.402358),PUGET SOUND ENERGY INC||CITY OF TACOMA - (WA),53033030000.0
3,5YJSA1H20F,Snohomish,Bothell,WA,98012.0,2015,TESLA,MODEL S,Battery Electric Vehicle (BEV),Clean Alternative Fuel Vehicle Eligible,208,0,1.0,232724670,POINT (-122.206146 47.839957),PUGET SOUND ENERGY INC,53061050000.0
4,JTMAB3FV1N,Yakima,Yakima,WA,98908.0,2022,TOYOTA,RAV4 PRIME,Plug-in Hybrid Electric Vehicle (PHEV),Clean Alternative Fuel Vehicle Eligible,42,0,14.0,221023589,POINT (-120.611068 46.596645),PACIFICORP,53077000000.0


### Get a JSON from a URL and load

To load data from a _JSON_ file into a DataFrame, you can use the `df = pd.read_json(url)` function from the pandas library. Use the actual _URL_ of the _JSON_ file you want to load. This code will download the _JSON_ file from the specified URL and load it into the `df` DataFrame.

In [2]:
# https://jsonplaceholder.typicode.com/

import requests

# URL of the JSON file
url_json = 'https://jsonplaceholder.typicode.com/comments'

# Download the JSON file
response = requests.get(url_json)
data = response.json()
print(type(data))

# Load the JSON data into a DataFrame
df = pd.DataFrame(data)
print(df.shape)
df.head()

<class 'list'>
(500, 5)


Unnamed: 0,postId,id,name,email,body
0,1,1,id labore ex et quam laborum,Eliseo@gardner.biz,laudantium enim quasi est quidem magnam volupt...
1,1,2,quo vero reiciendis velit similique earum,Jayne_Kuhic@sydney.com,est natus enim nihil est dolore omnis voluptat...
2,1,3,odio adipisci rerum aut animi,Nikita@garfield.biz,quia molestiae reprehenderit quasi aspernatur\...
3,1,4,alias odio sit,Lew@alysha.tv,non et atque\noccaecati deserunt quas accusant...
4,1,5,vero eaque aliquid doloribus et culpa,Hayden@althea.biz,harum non quasi et ratione\ntempore iure ex vo...


### Consumming Free Public APIs

To load data from a _public web API_ into a DataFrame, you can use the `pd.read_json(url)` function from the pandas library. Replace the `url` variable with the actual URL of the _API endpoint_.

In [5]:
# NASA picture of the day API
# https://api.nasa.gov/

import sys
import requests
from datetime import datetime, timedelta

# NASA API key
api_key = "CFWfPlCkmvdkkGgnslhSaJNPiASCg5bRCTzNeK9n"

# Calculate dates
end_date = (datetime.now() - timedelta(days=1)).strftime("%Y-%m-%d")
start_date = (datetime.now() - timedelta(days=31)).strftime("%Y-%m-%d")

# API URL with parameters
api_url = f'https://api.nasa.gov/planetary/apod?api_key={api_key}&start_date={start_date}&end_date={end_date}'

# Make the API request
response = requests.get(api_url)
if response.status_code == 200:
    nasa_data = response.json()
else:
    print("ERROR. Failed to retrieve data from NASA.")
    sys.exit()

# Load the JSON data into a DataFrame
nasa_df = pd.DataFrame(nasa_data)
nasa_df.to_csv('csv-files/nasa_photos.csv', index=False)
nasa_df.head() 

Unnamed: 0,date,explanation,hdurl,media_type,service_version,title,url,copyright
0,2024-06-01,Get out your red/blue glasses and float next t...,https://apod.nasa.gov/apod/image/2406/N0017288...,image,v1,Stereo Helene,https://apod.nasa.gov/apod/image/2406/N0017288...,
1,2024-06-02,"No one, presently, sees the Moon rotate like t...",,video,v1,Rotating Moon from LRO,https://www.youtube.com/embed/sNUNB6CMnE8?rel=0,
2,2024-06-03,Magnificent island universe NGC 2403 stands wi...,https://apod.nasa.gov/apod/image/2405/NGC2403-...,image,v1,NGC 2403 in Camelopardalis,https://apod.nasa.gov/apod/image/2405/NGC2403-...,(Team F.A.C.T.)
3,2024-06-04,Why does Comet Pons-Brooks now have tails poin...,https://apod.nasa.gov/apod/image/2406/Comet12P...,image,v1,Comet Pons-Brooks Develops Opposing Tails,https://apod.nasa.gov/apod/image/2406/Comet12P...,\nRolando Ligustri &\n Lukas Demetz \n
4,2024-06-05,What if you saw your shadow on Mars and it was...,https://apod.nasa.gov/apod/image/2406/NeretvaV...,image,v1,Shadow of a Martian Robot,https://apod.nasa.gov/apod/image/2406/NeretvaV...,


In [12]:
# NY Times API
# https://developer.nytimes.com/docs/articlesearch-product/1/overview

import sys
import requests
import pandas as pd

# Parameters
query = "outer earth space exploration research"
api_key = "bi7nGsat8AZqwGQnK8xXjdmvG39A39YN"
pub_year = 2024

# Construct the API URL
ny_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?\
                    q={query}&api-key={api_key}&\
                    begin_date={pub_year}0101&end_date={pub_year}1231"

# Make the GET request
response = requests.get(ny_url)
if response.status_code == 200:
    ny_data = response.json()
else:
    print("ERROR. Cannot retrieve data from NY Times.")
    sys.exit(1)

# Extract the articles
ny_times_df = pd.DataFrame(ny_data['response']['docs'])
ny_times_df.to_csv('csv-files/ny_times.csv', index=False)
ny_times_df.head()

Unnamed: 0,abstract,web_url,snippet,lead_paragraph,print_section,print_page,source,multimedia,headline,keywords,pub_date,document_type,news_desk,section_name,byline,type_of_material,_id,word_count,uri,subsection_name
0,Earth’s stratosphere has never seen the amount...,https://www.nytimes.com/2024/01/09/science/roc...,Earth’s stratosphere has never seen the amount...,The high-altitude chase started over Cape Cana...,D,1.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'The New Space Race Is Causing New Po...,"[{'name': 'subject', 'value': 'Space and Astro...",2024-01-09T07:59:25+0000,article,Science,Science,"{'original': 'By Shannon Hall', 'person': [{'f...",News,nyt://article/d3050ab6-aceb-5a72-aa9a-aeab49ae...,2172,nyt://article/d3050ab6-aceb-5a72-aa9a-aeab49ae...,
1,STEM can’t solve all our problems.,https://www.nytimes.com/2024/04/05/opinion/nas...,STEM can’t solve all our problems.,The window to apply to be a NASA astronaut — a...,SR,8.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'The Next Frontier? Philosophy in Spa...,"[{'name': 'subject', 'value': 'Space and Astro...",2024-04-05T09:02:10+0000,article,OpEd,Opinion,"{'original': 'By Joseph O. Chapa', 'person': [...",Op-Ed,nyt://article/151aa2d4-6761-5811-98fa-105c782f...,904,nyt://article/151aa2d4-6761-5811-98fa-105c782f...,
2,Commercial moon landings will change how we lo...,https://www.nytimes.com/2024/01/21/opinion/moo...,Commercial moon landings will change how we lo...,The moon stands alone. It is unique in the kno...,SR,10.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'What We Do to the Moon Will Transfor...,"[{'name': 'subject', 'value': 'Private Spacefl...",2024-01-21T14:00:09+0000,article,OpEd,Opinion,"{'original': 'By Rebecca Boyle', 'person': [{'...",Op-Ed,nyt://article/bdb1808b-0bba-5b20-8543-1f7d5378...,1496,nyt://article/bdb1808b-0bba-5b20-8543-1f7d5378...,
3,He helped send the twin spacecraft on their wa...,https://www.nytimes.com/2024/06/14/science/spa...,He helped send the twin spacecraft on their wa...,"Edward C. Stone, the visionary physicist who d...",,,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...","{'main': 'Edward Stone, 88, Physicist Who Over...","[{'name': 'subject', 'value': 'Space and Astro...",2024-06-14T23:03:17+0000,article,Obits,Science,"{'original': 'By Sam Roberts', 'person': [{'fi...",Obituary (Obit),nyt://article/0d21ac7b-59a8-533c-a00b-9c31a39d...,951,nyt://article/0d21ac7b-59a8-533c-a00b-9c31a39d...,Space & Cosmos
4,NASA is conducting tests on what might be the ...,https://www.nytimes.com/2024/02/25/magazine/ma...,NASA is conducting tests on what might be the ...,Alyssa Shannon was on her morning commute from...,MM,28.0,The New York Times,"[{'rank': 0, 'subtype': 'xlarge', 'caption': N...",{'main': 'Can Humans Endure the Psychological ...,"[{'name': 'organizations', 'value': 'National ...",2024-02-25T10:04:08+0000,article,Magazine,Magazine,"{'original': 'By Nathaniel Rich', 'person': [{...",News,nyt://article/cc4d4b64-5e15-5bb9-afe0-578f67df...,6219,nyt://article/cc4d4b64-5e15-5bb9-afe0-578f67df...,


### Web Scrapping

To load data from a website using _web scraping_ into a DataFrame, you can use the __BeautifulSoup__ library in combination with the __requests__ library. 

In [2]:
# web scraping from https://www.nytimes.com/2024/06/14/science/space/edward-stone-physicist-dead.html
# Install the en_core_web_lg model
#!pip install spacy
#!python -m spacy download en_core_web_lg
#!pip install BeautifulSoup4

import requests
import pandas as pd
import re
import spacy
from bs4 import BeautifulSoup

# Define a function to count words using regex
def count_words(text: str) -> int:
    return len(re.findall(r'\w+', text))

def get_keywords(text: str, nlp_model) -> str:
    document = nlp_model(text)
    context_keywords = ['space', 'astronomy', 'planet', 'moon', 'galaxy', 'cosmos', 'rocket']
    temp_keywords = [token for token in document if token.pos_ in ('NOUN', 'PROPN')]
    keywords = [(token.text).lower() for token in temp_keywords\
                 if any(token.similarity(nlp_model(keyword)) > 0.6 \
                for keyword in context_keywords)]
    return ', '.join(keywords)

# ------------ Web Scraping ------------

# URL of the website to scrape
url = 'https://www.nytimes.com/2024/06/14/science/space/edward-stone-physicist-dead.html'

# Headers to mimic a browser visit
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)\
                    AppleWebKit/537.36 (KHTML, like Gecko)\
                    Chrome/58.0.3029.110 Safari/537.3'
}

# Send a GET request to the website
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    soup = BeautifulSoup(response.text, 'html.parser')
else:
    print("ERROR. Failed to retrieve data using web scrapping.")
    sys.exit(1)

# Extract the article content
paragraphs = soup.find_all('p')
data = []
for paragraph in paragraphs:
    data.append( paragraph.get_text() )

# Create a DataFrame from the extracted data
scrapping_df = pd.DataFrame(data, columns=['content'])
scrapping_df = scrapping_df[scrapping_df['content'].apply(count_words) > 10]
scrapping_df = scrapping_df[~scrapping_df['content'].str.contains(r'Thank you for your patience', case=False, regex=True)]

# Make simple NLP analysis
nlp = spacy.load('en_core_web_lg')
scrapping_df['keywords'] = scrapping_df['content'].apply(lambda p: get_keywords(p, nlp))

# Print the DataFrame
print(scrapping_df.shape)
scrapping_df = scrapping_df.reset_index(drop=True)
scrapping_df.to_csv('csv-files/scrapping.csv', index=False)
scrapping_df.head()

(6, 2)


Unnamed: 0,content,keywords
0,He helped send the twin spacecraft on their wa...,"spacecraft, earth"
1,"Edward C. Stone, the visionary physicist who d...","spacecraft, planets"
2,Inspired by the launch of the Soviet satellite...,
3,"Twin spacecraft, Voyager 1 and Voyager 2 were ...","spacecraft, space"
4,Dr. Stone was the program’s chief project scie...,physics


# Data Pre-Processing

__Pre-processing__ data is an essential step in the _data science workflow_. It involves _transforming raw data_ into a clean and structured format that is suitable for _analysis and modeling_. The __pre-processing__ process typically includes several steps such as _data cleaning_, _data integration_, _data transformation_, and _data reduction_.

_Data cleaning_ involves handling missing values, outliers, and inconsistencies in the data. _Missing values_ can be imputed or removed depending on the nature of the data and the analysis requirements. _Outliers_, which are extreme values that deviate from the normal distribution, can be detected and treated accordingly. Inconsistencies in the data, such as conflicting values or duplicate records, need to be resolved to ensure data integrity.

_Data integration_ involves combining data from multiple sources into a unified dataset. This step may require resolving differences in data formats, units, or naming conventions. It is important to ensure that the integrated data is consistent and accurate.

_Data transformation_ involves converting data into a suitable format for analysis. This may include _scaling numerical variables_, _encoding categorical variables_, or creating new derived features. Scaling ensures that variables are on a similar scale, which is important for certain algorithms. Encoding categorical variables converts them into numerical representations that can be processed by machine learning algorithms. Creating derived features involves extracting meaningful information from existing variables or combining multiple variables to capture complex relationships.

_Data reduction techniques_ are used to reduce the dimensionality of the dataset while _preserving important information_. This is particularly useful when dealing with high-dimensional data or when computational resources are limited. Techniques such as _feature selection_ and _feature extraction_ can be applied to identify the most relevant variables or to create new variables that capture the essence of the data.

Overall, __pre-processing data__ is a critical step in data science as it ensures the quality and usability of the data for analysis and modeling tasks. By carefully handling missing values, outliers, inconsistencies, and transforming the data appropriately, _data scientists can obtain reliable insights_ and build accurate predictive models.

### Data Profiling

__Data profiling__ is an essential process in _data science_ that involves _analyzing_ and _understanding_ the characteristics of a dataset. It provides _valuable insights_ into the quality, structure, and content of the data, enabling data scientists to make informed decisions during the _data analysis_ and modeling stages.

During __data profiling__, various _statistical measures_ and techniques are applied to gain a comprehensive _understanding_ of the dataset. This includes examining the _data types_, _identifying missing values_, detecting outliers, assessing data distributions, and exploring _relationships between variables_. By performing these analyses, data scientists can uncover patterns, trends, and anomalies within the data.

In [None]:

# Assuming df is your DataFrame

# 2. Descriptive Statistics

# 3. Missing Values

# 4. Value Counts (for a categorical column named 'category_column')

# 5. Correlation


### Data Cleaning

__Data cleaning__ is a crucial step in the _data science process_. It involves identifying and _correcting errors_, _inconsistencies_, and _inaccuracies_ in the dataset to ensure its _quality_ and reliability. The process typically includes _handling missing values_, _removing duplicates_, _dealing with outliers_, and resolving inconsistencies in _data formats_ or units.

_Handling missing_ values is an important aspect of _data cleaning_. _Missing values_ can occur due to various reasons such as _data collection errors_ or incomplete records. Strategies for handling missing values include _imputation_, where missing values are replaced with estimated values based on statistical techniques, or deletion, where rows or columns with missing values are removed from the dataset.

_Removing duplicates_ is another key task in __data cleaning__. Duplicates can arise from data entry errors or data merging processes. Identifying and _removing duplicate records_ ensures that each observation in the dataset is unique and avoids bias in subsequent analyses.

In [None]:

## Data Cleaning

# Drop duplicates

# Reset index

# Removing outliers using mean and standard deviation


# Exploratory Data Analysis

__Exploratory Data Analysis__ (EDA) is a crucial process in data science that involves _examining and understanding_ the characteristics of a _dataset_. It serves as a foundation for further analysis and modeling tasks. 

During __EDA__, data scientists employ various techniques to _gain insights_ into the data. This includes summarizing the main features and statistics of the dataset, visualizing the data through plots and charts, and identifying patterns and relationships between variables. 

Overall, __EDA__ plays a vital role in understanding the data, _formulating hypotheses_, and _generating insights_ that drive the subsequent steps in the _data science workflow_. It helps in making informed decisions, validating assumptions, and building robust models that can effectively solve real-world problems.

### Pandas Exploration

In [None]:
# column names


In [None]:
#!pip install ydata_profiling


### Extract-Transform-Load Pipelines



In [None]:
# Extract-Transform-Load Pipelines
# Step 1: Import Required Libraries

# Step 2: Define Functions for DataFrame Manipulation

# Step 3: Create the Pipeline
