# Coursework 1: Data visualisation and simple analysis using Python

In this coursework, you will deal with a dataset stored in the ".csv" format, which describes the housing price in Boston. This is a small dataset with only 506 cases. But it would be a good illustration how Python can be used for loading, visualising and analysing a dataset. The dataset was originally published at

\[1\] Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.

We performed minor edits to the dataset to suit this course. 

## Dataset
A copy of the .csv data should already be here if you git clone this repository. The .csv format is a format for spreadsheet, which can be loaded using Python pandas. It can also opened and viewed using Microsoft Excel or Libreoffice.

## Dataset description
Each row is a case of the housing price. There are 506 cases in total. Each column is an attribute, there are 12 attributes, including:

**crime_rate**: per capita crime rate by town

**zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

**industry**: proportion of non-retail business acres per town

**charles**: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

**nox**: nitric oxides concentration (parts per 10 million)

**room**: average number of rooms per dwelling

**age**: proportion of owner-occupied units built prior to 1940

**distance**: weighted distances to five Boston employment centres

**radial_highway**: index of accessibility to radial highways

**tax**: full-value property-tax rate per \$10,000

**pupil_teacher_ratio**: pupil-teacher ratio by town

**house_value**: Median value of owner-occupied homes in $1000's

## Import libraries
The code importing the libraries is already provided, which includes
* pandas: a library for loading .csv datasets
* numpy: a library for manipulating numbers and arrays
* matplotlib: for data visualisation
* seaborn: for data visualisation as well
* sklearn: for linear regression and machine learing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

## 1. Load data, print the first few lines, column names and dataframe dimension using pandas (5 points).

## 2. Basic statistics (5 points).
Print the basic statistics (mean, standard deviation, minimum, maximum) for each of the data columns.

## 3. Data visualisation.
### 3.1 Plot the histogram for each of the data columns (10 points).

### 3.2 Plot the correlation matrix between the data columns (10 points).

### 3.3 Plot the house value (the last data column) against each feature (each of the first 11 data columns) (10 points).

### 3.4 What is the top factor that positively contribute to house value, based on the univariate correlation coefficient? What is the top factor that negatively contributes? (4 points)

Note that there are other ways to evaluate the importance of each feature, such as using the coefficients of a multiple linear regression model. Here we only use the univariate correlation coefficient.

### Answer:

## 4. Linear regression.
### 4.1. Regress the house value against all the features (11 points).

We have split the whole dataset for you into a training set and a test set using a pre-defined ratio (80:20 in this case). We use the train_test_split function that has been imported from the sklearn library. The dataset split is provided for consistent evaluation. Please do not change the random_state seed.

In [None]:
X = df.iloc[:, :11]
y = df.iloc[:, 11]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### Now fit a linear regression model onto the training set, then visualise the predicted house value using this model against the true house value on the training and test sets.

### 4.2 Quantitatively evaluate the prediction accuracy of the linear model using the root of mean squared error (RMSE) as a metric. Report the metric for training and test sets (4 points).

### 4.3 Suppose that you are interested in a house in the Boston area. You have collected the following information about this house. How much would you estimate the value of this house please? First explain your idea, then implement it.

| feature | value |
|---|---|
| number of rooms | 5 |
| distance | 2.5 |
| pupil_teacher_ratio | 13.5 |

#### Idea (5 points):

#### Implementation (10 points):

### 4.4 Open discussion: How can you predict the house value more accurately? (8 points)

### 4.5 Open discussion: How can you estimate the uncertainty for your prediction? (8 points)