# Technical Interview

The purpose of this technical interview is to assess the following skills:

- Analytics
  - Data collection
  - Data cleaning and preprocessing
  - Data visualisation
  - Modelling


- Programming
  - Proficiency with Python and PyData stack
  - Code readability and commenting
  - Problem solving


- Data / Business Understanding
  - Exhibit ability to understand an unfamiliar area
  - Interpretation of findings


Please limit any work to using the Python standard libraries and any of the following:

- numpy
- scipy
- pandas
- seaborn
- matplotlib
- scikit-learn
- statsmodels
- sktime
- gensim
- nltk
- spacy

Note that it is not expected that all libraries need to be used. Only use those you feel are appropriate for your submission.

Only use additional packages if you feel that without them, you would not be able to complete the work. Also, make sure to incorporate the installation of those packages within the top cell of the notebook i.e. ```!pip install <package>```. However, do not use any packages that require API keys to use.

Please provide detailed explanations throughout the assessment.


## Submission

All work submitted is to be contained within this notebook.

Please follow instructions in the email sent.

Prior to submitting this notebook, it is good practice to restart the kernel and run the notebook again. This ensures the work is free from error and will ensure the notebook will run when being assessed.

The following should be compressed within the folder which will be submitted:

* `Assessment.ipynb`
* `PPR-ALL.csv`
* `HPM09.<time_of_download>.csv`

### Case Study

For this case study, you have been contracted by the Irish Revenue. Using only the data specified below, the Revenue require 3 primary objectives to be addressed:

1. Perform an exploratory data analysis to highlight interesting patterns in the data.
2. Identify residential sales which seem unusual i.e. residential sales which are unusually high or low considering the available features.
3. Provide your opinion on whether a model can reliably forecast monthly stamp duty. [Stamp duty](https://www.revenue.ie/en/property/stamp-duty/property/index.aspx) is a tax applied to the sale of property. For the purpose of this question, assume stamp duty is charged on all sales (exclusive of 13.5% VAT) at a rate of 1%.

The below formula is how to calculate stamp duty:

$$
\text{Stamp Duty Due} = \text{Stamp Duty Rate} \times (\text{Sale Price}-(\text{Sale Price} \times \text{VAT}))
$$


$$
\text{Stamp Duty Due} = 0.01 \times (\text{Sale Price}-(\text{Sale Price} \times 0.135))
$$

#### Data

1. [Residential Property Price Register](https://www.propertypriceregister.ie/)
2. [Residential Property Price Index](https://data.cso.ie/table/HPM09) - [Data description](https://www.cso.ie/en/methods/surveybackgroundnotes/residentialpropertypriceindex/#:~:text=The%20Residential%20Property%20Price%20Index,residential%20properties%20sold%20in%20Ireland.&text=The%20index%20is%20mix%2Dadjusted,are%20sold%20in%20different%20months.)

Before beginning, make sure to visit the provided links to gain a better understanding of the data.

Please read through entire assessment before beginning.

##### 1. Download and load datasets

The first step to perform is to download and read the data by following the below steps.

- **Property Prices**
  - Visit [link](https://www.propertypriceregister.ie/) and download all data using "DOWNLOAD ALL" button
  - Unzip `PPR-ALL.zip` and load `PPR-ALL.csv` into a pandas DataFrame. Only load the following columns:

`Date of Sale (dd/mm/yyyy)`, `Address`, `County`, `Price ()`, `VAT Exclusive`, `Not Full Market Price`, `Description of Property`

- **Residential Price Index (RPI)**
  - Visit [link](https://data.cso.ie/table/HPM09) and download all data as a CSV. File will download with the following name `HPM09.<time_of_download>.csv`
  - Load `HPM09.<time_of_download>.csv` into a pandas DataFrame

##### 2. Data Cleaning & Processing

Next, complete the following steps to obtain the dataset which the exploratory analysis will be performed on. 

Note, perform any additional steps that may be warranted based on your understanding of the data that you have observed from the data itself and/or from the information provided in the links.

- **Property Price Data Cleaning**
  - For the assessment use data from 2010 to 2019 (inclusive). Print length of DataFrame.
  - Filter out all counties except for the following `Dublin`, `Cork`, `Galway`, `Kildare` and `Meath`. Print length of DataFrame.


- **RPI Data Cleaning**
  - For the assessment use data from 2010 to 2019 (inclusive). Print length of DataFrame.
  - Include only `Residential Property Price Index` statistic. Print length of DataFrame.
  - Remove type of residential properties that are not either `Dublin - all residential properties` or `National excluding Dublin - all residential properties`.


- **Combine Data**
  - Combine both data sources in such a way that type of residential properties categorisations are aligned appropriately
  - Estimate sale prices so that they are all comparable to December 2019 prices by using the RPI information

##### 3. Exploratory Data Analysis

Revenue have given a brief that explains they are interested in finding patterns in the property price and residential price index information.

Examples include the change in property prices over time (adjusted and unadjusted for inflation), the number of properties sold, the distribution characteristics of prices and the difference between counties and types of properties.

This is not an all-encompassing list and should be expanded upon based on what you feel the client would be interested in knowing. The purpose of this question is to showcase your ability to extract useful information relevant for a client from a data source, visualise and elaborate on those findings.

##### 4. Outlier Detection

Using an unsupervised approach, implement a method to classify whether a property sale should be considered an outlier or not. In the context of the county and type of property, the Revenue is interested in 2 types of outliers:

* **Residential sales which may have been sold for abnormally low values** i.e. those which may have been sold for less than what they are worth.


The Revenue are interested in creating a backlog of property sales to investigate for tax avoidance.

* **Residential sales which may have been sold for abnormally high values** i.e. those sales which may have been entered in error, such as apartment blocks.

The Revenue are interested in identifying sales which do not represent a sale of a single residential property. The client believes there may be many instances of a single sale which are for several properties, such as apartment blocks. 


Therefore, a deliverable for the client is to provide a list of properties which were sold for abnormally low and high values i.e. outliers. 

The approach chosen should be informed from information you have observed in the previous step.

Prior to an implementation, please provide a detailed explanation of the reason the selected approach was chosen and provide a summary of the findings regarding the list of outliers identified.

##### 5. Forecasting Stamp Duty

The final client deliverable is an evaluation report which outlines the suitability of using a model, which can forecast monthly stamp duty.

For the evaluation report, please provide the following sections:

* Methodology - description of methodology and selection considerations.
* Evaluation - evaluation of model with comments.
* Summary of Findings - provide the client with a summary of your outcome findings and suggested future steps.

The Revenue have provided the following stamp duty formula.

$$
\text{Stamp Duty Due} = \text{Stamp Duty Rate} \times (\text{Sale Price}-(\text{Sale Price} \times \text{VAT}))
$$

where, **Stamp Duty Rate** is assumed to be 1% for all sales and **VAT** is 13.5%.

Limit the evaluation to the data output from part 2 in the case study.