# Open Data

__Open Data__ refers to the concept of making _data freely available_ to the public, without any restrictions on its use, reuse, or redistribution. It is typically provided in a _machine-readable format_ and can be accessed and used by anyone for various purposes, such as research, analysis, and innovation. __Open Data__ plays a crucial role in promoting transparency, accountability, and collaboration in both the public and private sectors.

### Download and load CSV files

To load data from a _CSV_ file into a DataFrame, you can use the `df = pd.read_csv(url)` function from the pandas library. Use the actual _URL_ of the _CSV_ file you want to load. This code will download the CSV file from the specified URL and load it into the `df` DataFrame.

In [None]:
# https://catalog.data.gov/dataset/electric-vehicle-population-data

# URL of the CSV file

# Download the CSV file and load it into a DataFrame


### Get a JSON from a URL and load

To load data from a _JSON_ file into a DataFrame, you can use the `df = pd.read_json(url)` function from the pandas library. Use the actual _URL_ of the _JSON_ file you want to load. This code will download the _JSON_ file from the specified URL and load it into the `df` DataFrame.

In [None]:
# https://jsonplaceholder.typicode.com/

# URL of the JSON file

# Download the JSON file

# Load the JSON data into a DataFrame


### Consumming Free Public APIs

To load data from a _public web API_ into a DataFrame, you can use the `pd.read_json(url)` function from the pandas library. Replace the `url` variable with the actual URL of the _API endpoint_.

In [None]:
# NASA picture of the day API
# https://api.nasa.gov/

# NASA API key

# Calculate dates

# API URL with parameters

# Make the API request

# Load the JSON data into a DataFrame


In [None]:
# NY Times API
# https://developer.nytimes.com/docs/articlesearch-product/1/overview

# Parameters

# Construct the API URL

# Make the GET request

# Extract the articles


### Web Scrapping

To load data from a website using _web scraping_ into a DataFrame, you can use the __BeautifulSoup__ library in combination with the __requests__ library. 

In [None]:
# web scraping from https://www.nytimes.com/2024/01/21/opinion/moon-commercial-companies-transform.html
# Install the en_core_web_sm model
#!python -m spacy download en_core_web_lg

# Define a function to count words using regex

# ------------ Web Scraping ------------

# URL of the website to scrape

# Headers to mimic a browser visit

# Send a GET request to the website

# Check if the request was successful

# Extract the article content

# Create a DataFrame from the extracted data

# Make simple NLP analysis

# Print the DataFrame


# Data Pre-Processing

__Pre-processing__ data is an essential step in the _data science workflow_. It involves _transforming raw data_ into a clean and structured format that is suitable for _analysis and modeling_. The __pre-processing__ process typically includes several steps such as _data cleaning_, _data integration_, _data transformation_, and _data reduction_.

_Data cleaning_ involves handling missing values, outliers, and inconsistencies in the data. _Missing values_ can be imputed or removed depending on the nature of the data and the analysis requirements. _Outliers_, which are extreme values that deviate from the normal distribution, can be detected and treated accordingly. Inconsistencies in the data, such as conflicting values or duplicate records, need to be resolved to ensure data integrity.

_Data integration_ involves combining data from multiple sources into a unified dataset. This step may require resolving differences in data formats, units, or naming conventions. It is important to ensure that the integrated data is consistent and accurate.

_Data transformation_ involves converting data into a suitable format for analysis. This may include _scaling numerical variables_, _encoding categorical variables_, or creating new derived features. Scaling ensures that variables are on a similar scale, which is important for certain algorithms. Encoding categorical variables converts them into numerical representations that can be processed by machine learning algorithms. Creating derived features involves extracting meaningful information from existing variables or combining multiple variables to capture complex relationships.

_Data reduction techniques_ are used to reduce the dimensionality of the dataset while _preserving important information_. This is particularly useful when dealing with high-dimensional data or when computational resources are limited. Techniques such as _feature selection_ and _feature extraction_ can be applied to identify the most relevant variables or to create new variables that capture the essence of the data.

Overall, __pre-processing data__ is a critical step in data science as it ensures the quality and usability of the data for analysis and modeling tasks. By carefully handling missing values, outliers, inconsistencies, and transforming the data appropriately, _data scientists can obtain reliable insights_ and build accurate predictive models.

### Data Profiling

__Data profiling__ is an essential process in _data science_ that involves _analyzing_ and _understanding_ the characteristics of a dataset. It provides _valuable insights_ into the quality, structure, and content of the data, enabling data scientists to make informed decisions during the _data analysis_ and modeling stages.

During __data profiling__, various _statistical measures_ and techniques are applied to gain a comprehensive _understanding_ of the dataset. This includes examining the _data types_, _identifying missing values_, detecting outliers, assessing data distributions, and exploring _relationships between variables_. By performing these analyses, data scientists can uncover patterns, trends, and anomalies within the data.

In [None]:

# Assuming df is your DataFrame

# 2. Descriptive Statistics

# 3. Missing Values

# 4. Value Counts (for a categorical column named 'category_column')

# 5. Correlation


### Data Cleaning

__Data cleaning__ is a crucial step in the _data science process_. It involves identifying and _correcting errors_, _inconsistencies_, and _inaccuracies_ in the dataset to ensure its _quality_ and reliability. The process typically includes _handling missing values_, _removing duplicates_, _dealing with outliers_, and resolving inconsistencies in _data formats_ or units.

_Handling missing_ values is an important aspect of _data cleaning_. _Missing values_ can occur due to various reasons such as _data collection errors_ or incomplete records. Strategies for handling missing values include _imputation_, where missing values are replaced with estimated values based on statistical techniques, or deletion, where rows or columns with missing values are removed from the dataset.

_Removing duplicates_ is another key task in __data cleaning__. Duplicates can arise from data entry errors or data merging processes. Identifying and _removing duplicate records_ ensures that each observation in the dataset is unique and avoids bias in subsequent analyses.

In [None]:

## Data Cleaning

# Drop duplicates

# Reset index

# Removing outliers using mean and standard deviation


# Exploratory Data Analysis

__Exploratory Data Analysis__ (EDA) is a crucial process in data science that involves _examining and understanding_ the characteristics of a _dataset_. It serves as a foundation for further analysis and modeling tasks. 

During __EDA__, data scientists employ various techniques to _gain insights_ into the data. This includes summarizing the main features and statistics of the dataset, visualizing the data through plots and charts, and identifying patterns and relationships between variables. 

Overall, __EDA__ plays a vital role in understanding the data, _formulating hypotheses_, and _generating insights_ that drive the subsequent steps in the _data science workflow_. It helps in making informed decisions, validating assumptions, and building robust models that can effectively solve real-world problems.

### Pandas Exploration

In [None]:
# column names


In [None]:
#!pip install ydata_profiling


### Extract-Transform-Load Pipelines



In [None]:
# Extract-Transform-Load Pipelines
# Step 1: Import Required Libraries

# Step 2: Define Functions for DataFrame Manipulation

# Step 3: Create the Pipeline
