# Coursework 1: Data loading, visualisation and simple analysis using Python

In this coursework, we will deal with a dataset stored in the ".csv" format, which describes the housing price in Boston. This is a small dataset with only 506 cases. But it would be a good illustration how Python can be used for loading, visualising and analysing a dataset. The dataset was originally published at
\[1\] Harrison, D. and Rubinfeld, D.L. Hedonic prices and the demand for clean air, J. Environ. Economics & Management, vol.5, 81-102, 1978.

## Dataset
A copy of the .csv data is already here after you git clone this repository. The .csv format is a format for spreadsheet, which can be opened using Microsoft Excel or Libreoffice.

## Import libraries
The code importing the libraries is already provided, which includes
* pandas: a library for loading .csv datasets
* numpy: a library for manipulating numbers and arrays
* matplotlib: for data visualisation
* seaborn: for data visualisation as well
* sklearn: for linear regression and machine learing

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## 1. Load data and print the first few lines using pandas (10 points).

## Dataset description
Each row is a case of the housing price. There are 506 cases in total. Each column is an attribute, there are 14 attributes, including:

**crim**: per capita crime rate by town

**zn**: proportion of residential land zoned for lots over 25,000 sq.ft.

**indus**: proportion of non-retail business acres per town

**chas**: Charles River dummy variable (1 if tract bounds river; 0 otherwise)

**nox**: nitric oxides concentration (parts per 10 million)

**rm**: average number of rooms per dwelling

**age**: proportion of owner-occupied units built prior to 1940

**dis**: weighted distances to five Boston employment centres

**rad**: index of accessibility to radial highways

**tax**: full-value property-tax rate per \$10,000

**ptratio**: pupil-teacher ratio by town

**b**: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

**lstat**: lower status of the population

**medv**: Median value of owner-occupied homes in $1000's

## 2. Simple statistics (10 points).
Print the basic statistics (mean and standard deviation) for the crime rate, nitric oxides concentration, pupil-teacher ratio and median value of owner-occupied homes.

## 3. Data visualisation (30 points).
### 3.1 Plot the histogram distribution for each data column (10 points).

### 3.2 Plot the correlation matrix between the data columns (10 points).

### 3.3 Plot the house price (the last data column) against each feature (each of the first 13 data columns) (10 points).

## 4. Linear regression (30 points).
### 4.1. Regress the house price against all the features (15 points).

* First, split the whole dataset into a training set and a test set using a pre-defined ratio (80:20 in this case).

* Then, train the linear regression model using the training set.

* Finally, plot the predicted house price on the training set.

The dataset split is provided for consistent evaluation. Please do not change the random_state seed.

In [None]:
X = df.iloc[:, :13]
y = df.iloc[:, 13]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

### 4.2 Quantitatively evaluate the linear model using the root of mean squared error (RMSE) and the R squared (R2) on both the training set and test set (15 points).

## 5. Challenge yourself (20 points)

Previously, we use 13 features to predict the house price. Perhaps some of the features are more relevant to the price, whereas some are less.

### 5.1 Explore the features and develop a linear model for house price prediction using only 5 features as input (10 points).

Hint: either using feature selection or dimensionality reduction.

### 5.2 Evaluate the quantitative performance of the new model in terms of RMSE and R2 on the test set (10 points).

## 6. Survey

How long did it take you to complete the coursework? What is your background and how you feel?