<a href="https://colab.research.google.com/github/MehrdadJalali-AI/Statistics-and-Machine-Learning/blob/main/InClass/C2_Access_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Access To Datasets

This notebook demonstrates **where datasets come from**, how to **load benchmark datasets**, and how to **inspect them before preprocessing**.

We cover:
- Online benchmark sources (conceptual)
- Supervised (classification) dataset example
- Unsupervised dataset example
- Datasets directly available in Python libraries

## 1. Benchmark Dataset Sources (Overview)

Common sources:
- **Kaggle** (competitions & research datasets)
- **UCI Machine Learning Repository** (classic ML datasets)
- **OpenML** (datasets + ML workflows)
- **Google Dataset Search** (dataset search engine)

In this course, we focus on *how data looks after you download it* and how to prepare it for ML.

## 2. Supervised Learning Example (Classification)
### Iris Dataset (from scikit-learn)

- Task: **Classification**
- Goal: Predict flower species
- Labels are available → *Supervised Learning*

In [5]:
from sklearn.datasets import load_iris
import pandas as pd

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='class')

X.head(), y.head()

(   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
 0                5.1               3.5                1.4               0.2
 1                4.9               3.0                1.4               0.2
 2                4.7               3.2                1.3               0.2
 3                4.6               3.1                1.5               0.2
 4                5.0               3.6                1.4               0.2,
 0    0
 1    0
 2    0
 3    0
 4    0
 Name: class, dtype: int64)

### Dataset Shape and Classes

In [6]:
print('Features shape:', X.shape)
print('Target classes:', iris.target_names)

Features shape: (150, 4)
Target classes: ['setosa' 'versicolor' 'virginica']


### Combine Features and Target

In [7]:
iris_df = pd.concat([X, y], axis=1)
iris_df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


## 3. Unsupervised Learning Example
### Wine Dataset (ignoring labels)

- Task: **Clustering / Unsupervised Learning**
- Labels exist but are NOT used
- Goal: Discover structure in data

In [8]:
from sklearn.datasets import load_wine

wine = load_wine()
wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


### Dataset Shape

In [9]:
wine_df.shape

(178, 13)

## 4. Another Unsupervised Dataset
### Digits Dataset (unsupervised view)

High-dimensional numerical data often requires preprocessing (scaling, PCA, etc.).

In [10]:
from sklearn.datasets import load_digits

digits = load_digits()
digits_df = pd.DataFrame(digits.data)
digits_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,54,55,56,57,58,59,60,61,62,63
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0


### Digits Dataset Shape

In [11]:
digits_df.shape

(1797, 64)

## 5. Why Preprocessing is Needed

Before modeling, we usually:
- Handle missing values
- Scale features
- Encode categorical variables
- Reduce dimensionality

These steps will be covered in the next sections.

## Summary

- Data can come from **online repositories** or **Python libraries**
- Supervised datasets include labels
- Unsupervised datasets focus only on features
- Understanding data structure is the **first step of preprocessing**

## 5. Loading a Dataset from Kaggle (Example)

Kaggle is one of the most popular platforms for real-world datasets.

**Important:** Kaggle datasets are usually downloaded as CSV files.
After downloading, you load them locally using pandas.

Below we demonstrate this using the **Titanic dataset**, one of the most famous Kaggle classification datasets.

### Step 1: Download from Kaggle (Conceptual)

1. Go to: https://www.kaggle.com/datasets

https://www.kaggle.com/c/titanic/data

2. Search for **Titanic - Machine Learning from Disaster**
3. Download `train.csv`
4. Place it in the same folder as this notebook

In [3]:
import pandas as pd

# Load Kaggle Titanic dataset (after manual download)
# Make sure 'train.csv' is in the same directory as this notebook

titanic = pd.read_csv('train.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [15]:
print('Features shape:', titanic.shape)
print('Target classes:', titanic.Name)

Features shape: (891, 12)
Target classes: 0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                 Heikkinen, Miss. Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                         Graham, Miss. Margaret Edith
888             Johnston, Miss. Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object


### Dataset Overview

- **Target (label):** Survived (0 = No, 1 = Yes)
- **Task:** Binary Classification (Supervised Learning)
- Contains numerical + categorical features → preprocessing required

In [16]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### Why This Dataset is Good for Preprocessing

- Missing values (Age, Cabin)
- Categorical features (Sex, Embarked)
- Different feature scales

This makes it an excellent real-world example for preprocessing.