# Chapter 2 (End-to-End Machine Learning Project)


### Steps:

1. Look at the big picture
2. Get the data
3. Discover and visualize the data to gain insights
4. Prepare the data for Machine Learning algorithms
5. Select a model and train it
6. Fine-tune your model
7. Present your solution
8. Launch, monitor, and maintain your system

## Predicting house prices


You should use the Machine Learning Project checklist (https://github.com/ageron/handson-ml/blob/master/ml-project-checklist.md) this works for most ML projects but you may need to adapt it to your needs.


### Frame the problem
Define your business objective. How does the company expect to use and benefit from this model? This will determine how you frame the problem, what algorithms you will select, what performance measure you will use to evaluate your model and how much effort you should spend tweaking it.




### Pipelines (Don't worry about this right now)

A sequence of data processing components are called a data pipline. Components run asynchronously. Each component pulls in a large amount of data, processes it, and spits out the result in another data store, and then some time later the next component in the pipline pulls this data and spits out it own output and so on.


### Evaluating the current solution

The next question to ask is what the current solution looks like. 

### Design of the system

First you need to fram the problem: is it supervised, unsupervised, or Reinforcement Learning? Is it a classification taks, a regression task, or something else? Should you use batch learning or online learning techniques?



### Select a performance measure

For our scenario since we are working with a regression problem we will use the Root Mean Square Error (RMSE). It gives an idea of how much error the system typically makes in it's predictions.![Screenshot%202022-11-20%20at%204.39.26%20pm.png](attachment:Screenshot%202022-11-20%20at%204.39.26%20pm.png)



M = size of dataset i.e. the number of instances in the dataset you are measuring


x(i) = is a vector of all the feature values (excluding the label). For example if the first district in the dataset is located at longitutde -118.29°, latitude 33.91°, and it has 1,416 inhabitants with a median income of 38,372 and the median house value is $156,400 then






$x^{(1)}$ = \begin{pmatrix}
-118.29\\
33.91\\
1,416\\
38,372\\
\end{pmatrix}





h = The system's predictor function, also called hypothesis. ŷ(i) = h(x(i))


RMSE(X,h) is the cost function measured on the set of examples using your hypothesis h.



### Check the Assumptions

It's a good practice to list and verify the assumptions that were made so far (by you or others), this can catch serious issues early on. 


## Download the Data

In [14]:
import os
import tarfile
from six.moves import urllib

import urllib.request

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"

# HOUSING_PATH = datasets/housing
HOUSING_PATH = os.path.join("datasets", "housing")

HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"


def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data(HOUSING_URL, HOUSING_PATH)


https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.tgz
datasets/housing/housing.tgz


## Loading the data in pandas

In [20]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

## Analyzing

In [21]:
housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [22]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


<i>Notice that the total_bedrooms attributes has only 20,443 non null values meaning that 207 districts are missing this feature.
    
    
All attributes are numerical, expect the ocean_proximity field. It is type object. You can see that it's a categorical attribute since it's values are repetitive in one of the tables.

In [23]:
housing["ocean_proximity"].value_counts()

<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

The describe() method shows a summary of the numerical attributes.

In [24]:
housing.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0
