# Notebook Imports

In [None]:
# from sklearn.datasets import load_boston
# ImportError: 
# `load_boston` has been removed from scikit-learn since version 1.2.

# In this special case, you can fetch the dataset from the original source:

# import pandas as pd
# import numpy as np

# data_url = "http://lib.stat.cmu.edu/datasets/boston"
# raw_df = pd.read_csv(data_url, sep="\s+", skiprows=22, header=None)
# data = np.hstack([raw_df.values[::2, :], raw_df.values[1::2, :2]])
# target = raw_df.values[1::2, 2]

In [None]:
# print(type(raw_df))
# print(type(data))
# print(type(target))

<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


1. Problem Definition

- First step in data science/ML: define the problem clearly and ask the right questions.

- Boston housing example: friend asks “How much does a house cost in Boston?” — without more info, the only valid answer is the average home price (~$567,500).

- Shows why vague or poorly phrased questions → weak solutions.

2. Real-World Example (Boston Housing)

- Factors affecting price: size of house, location (downtown/suburbs), features (rooms, crime, schools, etc.).

- Boss’s project: build a valuation tool for real estate agents in Boston.

Tool must:

- Predict house price based on features.

- Show contribution of each feature (interpretable, not a black box).

- Provide quick benchmark prices.

- Be user-friendly (could even be public like Zillow/Zoopla).

3. Key Takeaways

- Always start with clear, well-phrased goals.

- Need both relevant features and interpretable models.

- The “average price” approach works only when no other info is available, but it’s too simplistic for real-world use.

## It didn't work as expected...
### sklearn has removed the boston dataset due to ethical issues.
### Here are the notes anyway:

1. Data Gathering

- After defining the problem, the next step in DS/ML is gathering the data.

Typical sources:

- Downloading CSVs from online (Google search).

- Using practice datasets from Python libraries (e.g., scikit-learn).

- Scikit-learn provides clean, user-friendly toy datasets (few missing values, fewer formatting issues).

- Examples: Boston housing, Iris (flowers), Diabetes, Digits, Wine, Breast Cancer.

**Summary:**

The second stage in ML is data gathering. Scikit-learn used to offer clean practice datasets like the Boston housing dataset, which contained 506 samples and 13 features. After importing with load_boston, the data was stored in a Bunch object. While raw output was messy, the dataset was available in Jupyter, ready for exploration and preprocessing.

# CHANGE OF PLANS:
### We will use the California housing dataset instead.

# IMPORTING DATASET

In [2]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()

We can also Add original paper as a clickable link using Markdown [text]\(URL\).

[Click here to see the original dataset source](https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html)

In [3]:
type(housing) # Note: convert to DataFrame later for easier handling.

sklearn.utils._bunch.Bunch

In [10]:
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]], shape=(20640, 8)),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894], shape=(20640,)),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': 

1. **Exploring the Dataset**

After defining the problem and gathering data, next step = exploring the dataset.

Exploration, visualization, and cleaning often happen together (issues only appear once you dig in).

Good starting questions for any dataset:

- What’s the source of the data?

- Is there a description/context of how it was collected?

- How many data points (rows)?

- How many features (columns)?

- What are the names of the features?

- What are the descriptions/units of the features?

In [12]:
dir(housing)

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

In [16]:
print(type(housing.DESCR))
print(housing.DESCR)

<class 'str'>
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using 

In [4]:
print(type(housing.data))
print(housing.data.shape)
print(housing.data)

<class 'numpy.ndarray'>
(20640, 8)
[[   8.3252       41.            6.98412698 ...    2.55555556
    37.88       -122.23      ]
 [   8.3014       21.            6.23813708 ...    2.10984183
    37.86       -122.22      ]
 [   7.2574       52.            8.28813559 ...    2.80225989
    37.85       -122.24      ]
 ...
 [   1.7          17.            5.20554273 ...    2.3256351
    39.43       -121.22      ]
 [   1.8672       18.            5.32951289 ...    2.12320917
    39.43       -121.32      ]
 [   2.3886       16.            5.25471698 ...    2.61698113
    39.37       -121.24      ]]


In [18]:
print(type(housing.target))
print(housing.target)

<class 'numpy.ndarray'>
[4.526 3.585 3.521 ... 0.923 0.847 0.894]


In [19]:
print(type(housing.feature_names))
print(housing.feature_names)

<class 'list'>
['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']


In [20]:
print(type(housing.frame))
print(housing.frame)

<class 'NoneType'>
None


Dot notation & nesting:

- california_housing → Bunch object.

- .data → numpy array.

- .shape → tuple (rows, columns).

- Like Inception, dreams within dreams.

**Summary:**

Exploring a dataset begins with six key questions about its source, size, and features. The California dataset has 2000+ samples and 8 attributes, with context provided by its original research study. Using '.DESCR' and '.shape' in Python helps confirm this information. Data exploration not only reveals structure but also teaches how Python objects nest attributes inside one another.