<a href="https://colab.research.google.com/github/DulanDias/HandsOnMachineLearning/blob/master/Practical_2_%5BLab_Sheet%5D_Getting_Our_Hands_Dirty_with_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CSC 319 1.5 Machine Learning 1
## Practical 2 [Lab Sheet] - Getting Our Hands Dirty with Data

In this section, we will go through an example project related to Real Estates. The following are the main steps we will go through in this section:
1. Look at the big picture.
2. Get the data.
3. Discover and visualize the data to gain insights.
4. Prepare the data for Machine Learning algorithms.


#### Setup

First, let's import a few common modules and check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥ 0.20.

In [0]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

import warnings

#### Working with Real Data

* Popular open data repositories:
 * [UC Irvine Machine Learning Repository](http://archive.ics.uci.edu/ml/)
 * [Kaggle datasets](https://www.kaggle.com/datasets)
 * [Amazon’s AWS datasets](https://registry.opendata.aws/)

* Meta portals (they list open data repositories):
 * http://dataportals.org/
 * http://opendatamonitor.eu/
 * http://quandl.com/

* Other pages listing many popular open data repositories:
 * [Wikipedia’s list of Machine Learning datasets](https://homl.info/9)
 * [Quora.com question](https://homl.info/10)
 * [Datasets subreddit](https://www.reddit.com/r/datasets)

In this section, we will be working with the [California Housing Prices Dataset](https://github.com/gurupratap-matharu/machine-learning-regression/tree/master/dataset).



### 1. Look at the Big Picture

Your task is to build a model of housing prices in California using the California census data. This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts” for short.

Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

#### Frame Your Problem

First, you need to frame the problem: is it supervised, unsupervised,or Reinforcement Learning? Is it a classification task, a regression task, or something else? Should you use batch learning or online learning techniques?

#### Select a Performance Measure

A typical performance measure for regression problems is the Root Mean Square Error (RMSE). Another such measure is called, Mean Absolute Error (MAE).
Both the RMSE and the MAE are ways to measure the distance between two vectors: the vector of predictions and the vector of target values.
RMSE is more sensitive to outliers than MAE. But when outliers are expotentially rare, the RMSE performs very well and is generally preferred.


### 2. Get the Data

You can find the hosted dataset here: https://github.com/gurupratap-matharu/machine-learning-regression/blob/master/dataset/housing.csv

You are required to:
1. Read the data from the above URL and store it in a variable.
2. Print the top most five rows in your loaded dataset (HINT: using head() method).

In [0]:
# YOUR CODE SHOULD COME HERE

### 3. Discover and visualize the data to gain insights

The info() method is useful to get a quick description of the data, in particular the total number of rows, and each attribute’s type and number of non-null values.

You are required to:
1. Use the info() method to find out the type of data fields present in our dateset and identify fields with null values.

In [0]:
# YOUR CODE SHOULD COME HERE

Note: The total_bedrooms attribute has only 20433 non-null values while all the rest of the attributes have 20640 non-null values.

When you looked at the top five rows, you probably noticed that the values in the ocean_proximity column were repetitive, which means that it is probably a categorical attribute, while the rest are numerical attributes.

You are required to:
1. Find out call the different categories of values for the 'ocean_proximity' data column, and count the number of values for each category, using the value_counts() method.
2. Using the describe() method, find a summary of the numerical attributes of your dataset.

In [0]:
# YOUR CODE SHOULD COME HERE

In [0]:
# YOUR CODE SHOULD COME HERE

Note: The describe() method has ignored the null values in the total_bedrooms data column.

Another great way to get a feel of the data is to visualize it through a histogram.

You are required to:
1. Draw a histogram plot for each of the numerical attributes using the whole dataset in one single plot.
2. Draw histogram plots for each of the numerical attributes using the whole dataset using one plot each.

In [0]:
# YOUR CODE SHOULD COME HERE

In [0]:
# YOUR CODE SHOULD COME HERE

Sometimes, our data may have outliers. This can cause a huge mess when we are training our machine learning algorithm.

One of the best plots that we can draw to visualize outliers, is a Box and Whisker plot.

You are required to:
1. Draw a Box and Whisker plot for the numerical attributes of our dataset.
2. Draw a Scatter plot for longitude and latitude columns of our dataset.

In [0]:
# YOUR CODE SHOULD COME HERE

In [0]:
# YOUR CODE SHOULD COME HERE

Now that’s much better: you can clearly see the high-density areas, namely the Bay Area and around Los Angeles and San Diego, plus a long line of fairly high density in the Central Valley, in particular around Sacramento and Fresno.

Our brain is very good at visualizing patterns on pictures. But we need to properly prepare the paramaters to generate proper visualizations that can portray valuable details.

Now let’s look at the housing prices. The radius of each circle represents the district’s population (option s), and the color represents the price (option c). We will use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices).

In [0]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
        s=housing["population"]/100, label="population", figsize=(10,7),
        c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

This image tells you that the housing prices are very much related to the location and to the population density.

#### Looking for Correlations

Since the dataset is not too large, we can easily compute the 'standard correlation coefficient' between every pair of attributes, using the corr() method.

In [0]:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

Another way to check for correlation between attributes can be done using a scatter_matrix function provided by the pandas package, which plots every numerical attribute against every other numerical attribute.

Since there are now 11 numerical attributes, you would get 11^2 = 121 plots, which would not fit on a page, so let’s just focus on a few promising attributes that seem most correlated with the median housing value.

In [0]:
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

Let's zoom in to the plot between median_income and median_house_value.

You are required to: 
1. Draw a scatter plot between median_income and median_house_value.

In [0]:
# YOUR CODE SHOULD COME HERE

### 4. Prepare the data for Machine Learning algorithms

#### Create a Test and Train Subsets

Creating a test set is theoretically quite simple: just pick some instances randomly, typically 20% of the dataset (or less if your dataset is very large), and set them aside.

In [0]:
def split_train_test(data, test_ratio):
  shuffled_indices = np.random.permutation(len(data)) 
  test_set_size = int(len(data) * test_ratio) 
  test_indices = shuffled_indices[:test_set_size] 
  train_indices = shuffled_indices[test_set_size:]
  return data.iloc[train_indices], data.iloc[test_indices]

You are required to:
1. Create two subsets of your dataset, with 20% randomly chosen data rows as the test_set and the rest as train_set.
2. View the size of the train_set and test_set.

In [0]:
# YOUR CODE SHOULD COME HERE

In [0]:
# YOUR CODE SHOULD COME HERE

In [0]:
# YOUR CODE SHOULD COME HERE