# Dataset

In this notebook you'll get used with the notions of dataset and dataset splitting

---

## Part 1: Understanding dataset

In this part, you'll learn how to load and visualize datasets using scikit-learn.

In [1]:
!pip install numpy
!pip install matplotlib
!pip install scikit-learn



In [2]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

### What is a Dataset?
In machine learning, a dataset is a collection of data points used to train and test models. It typically includes:
- **Features (input variables):** Attributes or properties used to make predictions.
- **Target (output variable):** What we are trying to predict (for supervised learning).
- **Samples:** Individual data points.

We will work with the **Iris dataset**, a classic dataset used to classify flowers into one of three species based on their features.

---

### Loading and Exploring a Dataset

#### Exercise 1: Load the Iris dataset
We will use the `load_iris()` function from `sklearn.datasets` to load the Iris dataset and explore its structure.

In [3]:
# Load the Iris dataset

# Print dataset description


#### Exercise 2: Explore Dataset Attributes
Print out key attributes like:
- Features (`data`)
- Target (`target`)
- Feature names (`feature_names`)

In [4]:
# Explore dataset attributes


#### Exercise 3: Visualize the Data
- Create scatter plots to explore relationships between features.
- Use histograms to see the distribution of feature values.

In [None]:
# Visualize the data


### Exercise 4: Explore a Different Dataset
This time, we’ll work with the **California Housing dataset**, which is used for regression tasks. Each data point represents a block group in California, with features describing socioeconomic and geographical characteristics, and the target being the median house price.

Your tasks are:
1. Load the California dataset using `fetch_california_housing`.
2. Explore the dataset's structure:
   - Print the dataset description.
   - Check the names of the features and the target.
   - Display the first few rows of data.
3. Visualize:
   - Create scatter plots to explore the relationship between a chosen feature and the target.
   - Plot histograms for individual features to understand their distribution.


In [6]:
# Load the California Housing dataset

# Print dataset description

# Explore dataset attributes

# Visualize the data


## Part 2: Dataset Splitting

### Why Split a Dataset?
To evaluate the performance of a machine learning model, it’s crucial to test it on data it hasn’t seen before. This ensures the model generalizes well to new data.

- **Training Set**: Used to train the model.
- **Testing Set**: Used to evaluate the model's performance on unseen data.

### Exercise 5: Splitting a Dataset
We will use `sklearn.model_selection.train_test_split` to split the dataset into training and testing sets. Use the california dataset loaded earlier.
You will:
1. Split the dataset into training and testing sets. You can use a classical 80% train, 20% test proportion.


In [7]:
# Use the California dataset loaded earlier

# Use train_test_split to create training and testing sets (80% train, 20% test)


2. Explore the sizes of the resulting datasets.

In [8]:
# Print the shapes of the splits


3. Visualize the training and testing distributions (for regression, visualize the target distributions).

In [9]:
# Plot histograms of the training and testing target distributions


#### Now feel free to explore more datasets. From scikit-learn or eslsewhere.

https://scikit-learn.org/1.5/datasets/toy_dataset.html