# End to End Machine learning project 

#### Data used : 
California Housing Prices dataset from the StatLib repository

# Project Checklist 

- [x] Frame the problem and look at the bigger picture
- [x] Get the Data
- [ ] Explore the Data to gain insights 
- [ ] Prepare the data to better expose the underlying data pattersn to ML algos
- [ ] Explore many different models and shortlist the best ones 
- [ ] Fine tune the model and combine them into a solution
- [ ] Present the solution
- [ ] Launch, monitor and maintain the system
  

# Problem statement : 

Welcome to the Machine Learning Housing Corporation! Your first task is to use California census data to build a model of housing prices in the state. This data includes metrics such as the population, median income, and median housing price for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). I will call them “districts” for short.
Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

# Getting the Data


Rather than manually downloading and decompressing the data, it’s usually preferable to write a function that does it for you. 
This is useful in particular if the data changes regularly: 

    - You can write a small script that uses the function to fetch the latest data (or you can set up a scheduled job to do that automatically at regular intervals). 
    
    - Automating the process of fetching the data is also useful if you need to install the dataset on multiple machines.

In [1]:
# Setting up the libraries 

# checking python version
import sys
assert sys.version_info >= (3, 7)

#checking scikit-learn version
from packaging import version
import sklearn
assert version.parse(sklearn.__version__) >= version.parse("1.0.1")

from pathlib import Path
import pandas as pd
import tarfile
import urllib.request

In [3]:
# script to get and load the data 
def load_housing_data():
    tarball_path = Path("datasets/housing.tgz")
    if not tarball_path.is_file():
        Path("datasets").mkdir(parents=True, exist_ok=True)
        url = "https://github.com/ageron/data/raw/main/housing.tgz"
        urllib.request.urlretrieve(url, tarball_path)
        with tarfile.open(tarball_path) as housing_tarball:
            housing_tarball.extractall(path="datasets")
    return pd.read_csv(Path("datasets/housing/housing.csv"))


housing = load_housing_data()
