# The Best Sources for Public Datasets -- Part1

-   **[Census Datasets - Statistics
    Canada](https://www12.statcan.gc.ca/datasets/Index-eng.cfm):** A
    portal that gives you access to selected datasets from the Canadian
    census from 1981 to 2021.

-   **[Data.gov](https://data.gov/):** The home of the U.S. government's
    open data, where you can access over 300,000 datasets on topics such
    as agriculture, education, health, public safety, and more.

-   **[Data.world](https://data.world/datasets/canada):** A platform
    that hosts thousands of datasets contributed by users and
    organizations from around the world.

-   **[Datahub.io](https://datahub.io/):** A free, open-source data
    management platform that provides access to high-quality,
    standardized, and linked datasets.

-   **[Google Dataset
    Search](https://datasetsearch.research.google.com/):** A search
    engine that lets you find datasets from various domains and sources

-   **[Kaggle](https://www.kaggle.com/):** A platform for data science
    and machine learning competitions, where you can find and download
    thousands of datasets uploaded by users and organizations.

-   **[Open data, statistics and archives -
    Canada.ca](https://www.canada.ca/en/services/science/open-data.html):**
    A website that provides data, statistics, analyses and archival
    information from the Canadian government.

-   **[UCI Machine Learning Repository](https://archive.ics.uci.edu/):**
    A collection of over 500 datasets that have been used for machine
    learning research and education.

-   **[Using open data \| Open Government - Government of
    Canada](https://open.canada.ca/en/using-open-data):** A website that
    helps you learn how to use open data, find applications and APIs,
    and share your own data projects.


Primary sources are original documents or records that provide first-hand evidence or direct information about a topic. Secondary sources are sources that analyze, interpret, or summarize primary sources.

# Many Packages Provide Access to Various Datasets

-   [Datasets](https://github.com/huggingface/datasets): A library that provides a collection of datasets and metrics for natural language processing, hosted by Hugging Face.

-   [OpenML](https://www.openml.org/apis): A platform that hosts thousands of datasets contributed by users and organizations from around the world. It also provides a standardized web API for retrieving data.

-   [scikit-learn](https://scikit-learn.org/stable/datasets.html):  A popular library for machine learning in Python. It provides many built-in datasets, as well as functions to fetch datasets from external sources, such as OpenML.

-   [seaborn](https://github.com/mwaskom/seaborn-data): A library for statistical data visualization in Python. It also includes some sample datasets for demonstration purposes.

-   [tensorflow_datasets](https://www.tensorflow.org/datasets): A library that provides a collection of datasets ready to use with TensorFlow or other Python ML frameworks, such as Jax.

-   [yellowbrick](https://www.scikit-yb.org/en/latest/api/datasets/index.html): A library that extends scikit-learn with visual analysis and diagnostic tools. It also offers some datasets for testing and comparison.

<font color='Blue'><b>Examples:</b></font>

In [None]:
# Example: OpenML
try:
  import openml
except ImportError:
  !pip install openml
  import openml

import warnings
warnings.filterwarnings(action='ignore', category=FutureWarning, message=r".*download_data.*")


# List all datasets and their properties
openml.datasets.list_datasets(output_format="dataframe")

# Get dataset by ID
dataset = openml.datasets.get_dataset(61)

# Get dataset by name
dataset = openml.datasets.get_dataset('Fashion-MNIST', download_data=True, download_qualities=False, download_features_meta_data=False)

# Get the data itself as a dataframe (or otherwise)

X, y, _, _ = dataset.get_data(dataset_format="dataframe")

display(X)

In [None]:
# Example: scikit-learn
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data # features
y = iris.target # target
print(X.shape, y.shape)
print(iris.feature_names)
print(iris.target_names)

In [None]:
# Example: yellowbrick
try:
  from yellowbrick.datasets import load_concrete
except ImportError:
  !pip install yellowbrick
  from yellowbrick.datasets import load_concrete

dataset = load_concrete(return_dataset=True)
df = dataset.to_dataframe()
display(df)

# How to Handle Missing Values
- Missing values, or NaNs, are common in real-world data and can affect the performance and accuracy of machine learning models.
- One way to deal with missing values is to drop them from the dataset, either by rows or by columns.
- For a Series, dropping missing values is straightforward, as you can use the [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method to remove any element that is NaN.
- For a DataFrame, dropping missing values is more complex, as you have to consider the impact on the whole dataset. You cannot have rows or columns with different lengths, so you have to be careful when dropping values.
- You can use the [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html) method on a DataFrame as well, but you have to specify the axis (0 for rows, 1 for columns) and the criteria for dropping values. You can use the following parameters:
  - how: 'any' or 'all'. If 'any', drop the row or column if any value is NaN. If 'all', drop the row or column if all values are NaN.
  - thresh: an integer. Drop the row or column if the number of non-NaN values is less than thresh.
- For example, ```df.dropna(axis=0, how='any')``` will drop any row that has at least one NaN value. ```df.dropna(axis=1, thresh=3)``` will drop any column that has less than three non-NaN values.

<font color='Blue'><b>Examples:</b></font>

In [None]:
import pandas as pd
import numpy as np

# Create a sample dataframe with some missing values
df = pd.DataFrame({"A": [1, 2, np.nan, 4], "B": [5, np.nan, 7, 8], "C": [np.nan, np.nan, np.nan, 10]})
print("Original DataFrame:")
display(df)

# Drop any row that has at least one NaN value
df1 = df.dropna(axis=0, how='any')
print("\nDataFrame after dropping rows with any NaN value:")
display(df1)

# Drop any column that has less than three non-NaN values
df2 = df.dropna(axis=1, thresh=3)
print("\nDataFrame after dropping columns with less than three non-NaN values:")
display(df2)

In [None]:
import pandas as pd

# https://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=50430
# Define the data retrieval link
link = 'https://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=50430&Year=2024&Month=1&Day=1&time=&timeframe=2&submit=Download+Data'

# Read the CSV file and select relevant columns
df = pd.read_csv(link, usecols=['Date/Time', 'Mean Temp (°C)'])

# Rename the 'Date/Time' column, drop missing values, select the last six rows, and set the 'Date' column as the index
df = df.rename(columns={'Date/Time': ''}).dropna().tail(6).set_index('')

# Making values at positions 2 and 4 missing
df.iloc[1] = pd.NA  # Position 2
df.iloc[3] = pd.NA  # Position 4

# Display the original dataframe
print("Original DataFrame:")
display(df)

# Drop rows with missing values
df_clean = df.dropna()
print("\nDataFrame after dropping rows with missing values:")
display(df_clean)

# Filling NaN Values

- **Forward-fill**: This method uses the previous valid value to fill the missing value, which can be useful for time series data, where the values are ordered by time and the missing value can be assumed to be similar to the previous one. For example, if the temperature data for a day is missing, we can use the temperature data from the previous day to fill it.
- **Back-fill**: This method uses the next valid value to fill the missing value, which can be useful for reverse time series data, where the values are ordered by time in reverse and the missing value can be assumed to be similar to the next one. For example, if the stock price data for a day is missing, we can use the stock price data from the next day to fill it.
- **Custom code**: This method allows us to write our own logic to fill in the missing values, which can be useful for complex or specific cases, where the other methods are not suitable or accurate. For example, if the age data for a person is missing, we can use the mean or median age of the population to fill it or use some other criteria based on the data.

The choice of method depends on the source and nature of the data and the desired outcome.

<font color='Blue'><b>Examples:</b></font>

In [None]:
# Backward fill missing values
df_bfill = df.bfill()
print("\nDataFrame after backward fill:")
display(df_bfill)

# Forward fill missing values
df_ffill = df.ffill()
print("\nDataFrame after forward fill:")
display(df_ffill)

# Linear interpolation for missing values
df_interpolated = df.interpolate(method='linear')
print("\nDataFrame after linear interpolation:")
display(df_interpolated)

# Detecting Errors in Real Measurements

Real-world data collection may introduce errors due to various sources such as equipment failure or environmental noise. You can use different methods to identify these errors depending on the data's characteristics and shape, such as visualization or statistical analysis. For example, you can measure their distance from the data's center. You can handle the errors in the same way as the NaN values. A simple coding solution is to replace all the erroneous values with np.nan and then use your chosen method to fill the missing values.


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Use a custom style for the plot (adjust the path to your style file)
plt.style.use('https://raw.githubusercontent.com/HatefDastour/ENSF444/main/Files/mystyle.mplstyle')

# Load the diabetes dataset
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Create a figure with two subplots, specifying the size and ratios
fig, (ax_box, ax_hist) = plt.subplots(2, figsize=(8, 5), gridspec_kw={"height_ratios": (.15, .85)})

# Implement a box plot for the 'bmi' column
sns.boxplot(x=df['bmi'], ax=ax_box)

# Integrate a histogram for the 'bmi' column with kernel density estimation
sns.histplot(x=df['bmi'], ax=ax_hist, kde=True)

# Add grid lines to the histogram on the x-axis
ax_hist.grid(axis='x')

# Omit the x-axis label for the box plot
ax_box.set(xlabel='')

# Add a title to the plot
plt.suptitle('Distribution of Body Mass Index (BMI)')

# Display the figure
plt.show()

# Optional Content

There are several ways to find the values of outliers in your data using Python. One way is to use the interquartile range (IQR) method, which defines outliers as the values that are below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR, where Q1 and Q3 are the first and third quartiles of the data, respectively. You can use the pandas library to calculate the IQR and the outliers for each column of your dataframe. Here is an example of how to do it for the bmi column:

In [None]:
# Import pandas library
import pandas as pd

# Load the diabetes dataset
from sklearn.datasets import load_diabetes
diabetes = load_diabetes()
df = pd.DataFrame(diabetes.data, columns=diabetes.feature_names)

# Calculate the IQR for the bmi column
Q1 = df['bmi'].quantile(0.25)
Q3 = df['bmi'].quantile(0.75)
IQR = Q3 - Q1

# Find the outliers using the IQR rule
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
outliers = df[(df['bmi'] < lower_bound) | (df['bmi'] > upper_bound)]

# Print the number and index of outliers
print("Number of outliers:", len(outliers))
print("Outlier indices:", outliers.index.values)

# Print the values of outliers
print("Outlier values:", outliers['bmi'].values)