<div align="center">

# Programming for Data Analytics Project
## Residential Property Price Register Analysis
***

</div>

### Table of Contents


1. About the Project
   
2. Import Libraries
   
3. Load Data

4. Data Exploration
   
    4.1  Check for the DataFrame

    4.2  Check for the DataFrame dimensionality with pandas .info() method

    4.3  Generate descriptive statistics with pandas .describe method

    4.4  Check for missing values   

5. Data Analysis
   
    5.1  Analysis: 
     - 5.1.1 

     - 5.1.2 

     - 5.1.3 
  
    5.2 Analysis: 
 
     - 5.2.1 

     - 5.2.2 

     - 5.2.3 
 
6.  References

### 1. About the Project
***

### 2. Import the Libraries
***

I imported the following libraries to plot the dataset.

- `matplotlib.pyplot`: Essential for creating static, animated, and interactive visualizations in Python. It is closely integrated with NumPy and provides a MATLAB-like interface for creating plots and visualizations.
- `numpy`: It contains functionality for multidimensional arrays, high-level mathematical functions such as linear algebra operations.
- `pandas`: Fundamental data analysis and manipulation library built on top of the Python programming language. It offers data structures and operations for manipulating numerical tables and time series.
- `seaborn`: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
- `datetime`: The datetime is a built-in module that provides classes for manipulating dates and times

In [22]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import datetime

### 3. Load the Data
***
Load the dataset from the Residential Property Price Register website https://www.propertypriceregister.ie/


In [23]:
df=pd.read_csv("./data/ppr_all.csv")

  df=pd.read_csv("./data/ppr_all.csv")


### 4. Data Exploration
***

4.1 Check the dataframe using df.head

Lets have an initial glance at the data

In [24]:
df.head(5)

Unnamed: 0,Date of Sale (dd/mm/yyyy),Address,County,Eircode,Price in Euro,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description
0,01/01/2010,"5 Braemor Drive, Churchtown, Co.Dublin",Dublin,,343000.0,No,No,Second-Hand Dwelling house /Apartment,
1,03/01/2010,"134 Ashewood Walk, Summerhill Lane, Portlaoise",Laois,,185000.0,No,Yes,New Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
2,04/01/2010,"1 Meadow Avenue, Dundrum, Dublin 14",Dublin,,438500.0,No,No,Second-Hand Dwelling house /Apartment,
3,04/01/2010,"1 The Haven, Mornington",Meath,,400000.0,No,No,Second-Hand Dwelling house /Apartment,
4,04/01/2010,"11 Melville Heights, Kilkenny",Kilkenny,,160000.0,No,No,Second-Hand Dwelling house /Apartment,


4.2  Check for the DataFrame dimensionality with pandas .info() method

The .info() method in Pandas provides valuable insights about the DataFrame. The information contains the number of columns, column labels, column data types, memory usage, range index, and the number of cells in each column (non-null values). The info() method does not return any value, it prints the information. [[1]](https://www.w3schools.com/python/pandas/ref_df_info.asp) 

The output of the .info() method consists of several key components: [[2]](https://machinelearningtutorials.org/a-comprehensive-guide-to-using-the-pandas-dataframe-info-method/)

- The total number of rows (entries) in the DataFrame.

- A summary of each column, including:
  - The column name
  - The number of non-null values
  - The data type of the column
  - The memory usage of the column

In [25]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 694503 entries, 0 to 694502
Data columns (total 9 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   Date of Sale (dd/mm/yyyy)  694503 non-null  object 
 1   Address                    694503 non-null  object 
 2   County                     694503 non-null  object 
 3   Eircode                    166138 non-null  object 
 4   Price in Euro              694503 non-null  float64
 5   Not Full Market Price      694503 non-null  object 
 6   VAT Exclusive              694503 non-null  object 
 7   Description of Property    694503 non-null  object 
 8   Property Size Description  52830 non-null   object 
dtypes: float64(1), object(8)
memory usage: 47.7+ MB


Upon review, I can gather the following information:

- The DataFrame contains 694503 rows and 9 columns.
- The columns are: Date of Sale", ""Address", "County", "Eircode", "Price in Euro", "Not Full Market Price","VAT Exclusive", "Description of Property" and "Property Size Description".
- Some columns have non-null values, indicating that there are missing values in this dataset. Two variables of the columns have missing values.
- One quantitative variables is numeric with type float64: "Price in Euro". 
- Eight qualitative variables are categorical with type object: "Date of Sale", ""Address", "County", "Eircode", "Not Full Market Price","VAT Exclusive", "Description of Property" and "Property Size Description".
- The memory usage of this DataFrame is approximately 47.7 MB.

###### [1] [w3schools Pandas DataFrame info() Method](https://www.w3schools.com/python/pandas/ref_df_info.asp)
###### [2] [Understanding the .info output - Machine Learning Tutorials](https://machinelearningtutorials.org/a-comprehensive-guide-to-using-the-pandas-dataframe-info-method/)

4.3  Generate descriptive statistics with pandas df.describe method

This analysis provides generalized descriptive statistics that summarises the central tendency of the data, the dispersion, and the shape of the dataset’s distribution. It also provides helpful information on missing NaN data. It includes the following statistics: [[3]](https://www.pythonlore.com/exploring-pandas-dataframe-describe-for-descriptive-statistics/)  [[4]](https://pandas.pydata.org/pandas-docs/version/0.20.2/generated/pandas.DataFrame.describe.html)

By default, the describe() function only generates descriptive statistics for numeric columns in a pandas DataFrame. I specifying include='all' which will force pandas to generate summaries for all columns in the dataframe. Some data types don’t have any information. Pandas marks them as NaN.


- Count: This represents the number of non-null (non-empty) values in the dataset for each column.

- Unique: This will show the number of unique values in the column

- Top: Ths will display the most common value in the column

- Frequency: This will show the frequency of the top value within the column.

- Mean: This will display the average value for each column.

- Standard deviation: It indicates how spread out the values are around the mean. A higher standard deviation means the values are more spread out from the mean, while a lower standard deviation means the values are closer to the mean.

- Minimum: It represents the lowest value in each column.

- The default percentiles of the describe function are 25th, 50th, and 75th percentile or (0.25, 0.5, and 0.75).
  
- First quartile (25th percentile): 25% of the data values are below this value.

- Second quartile (50th percentile): It represents the median, the middle value of the dataset.

- Third quartile (75th percentile): 75% of the data values are below this value.

- Maximum: It represents the highest value in the dataset.

In [18]:
df.describe(include="all")


Unnamed: 0,Date of Sale (dd/mm/yyyy),Address,County,Eircode,Price in Euro,Not Full Market Price,VAT Exclusive,Description of Property,Property Size Description
count,694503,694503,694503,166138,694503.0,694503,694503,694503,52830
unique,5145,622025,26,160850,,2,2,5,6
top,22/12/2014,"Broomfield, Midleton",Dublin,D24W9NN,,No,No,Second-Hand Dwelling house /Apartment,greater than or equal to 38 sq metres and less...
freq,1542,21,217653,34,,659594,579152,576951,38096
mean,,,,,297785.9,,,,
std,,,,,1024681.0,,,,
min,,,,,5001.0,,,,
25%,,,,,135000.0,,,,
50%,,,,,227000.0,,,,
75%,,,,,340000.0,,,,


Key interpretations can be made from this function, both for the quantitative and qualitative data. 

Null values: Confirming the observations made from .info, the count of THE "Eircode" and "Property Size Description" rows does not match the count of values the other columns, indicating missing data.

###### [3] [Understanding the Output of pandas.DataFrame.describe](https://www.pythonlore.com/exploring-pandas-dataframe-describe-for-descriptive-statistics/)
###### [4] [Pandas Documentation on pandas.DataFrame.describe](https://pandas.pydata.org/pandas-docs/version/0.20.2/generated/pandas.DataFrame.describe.html)

4.4 Check for missing data using df.isna

In [19]:
print(df.isna().sum())

Date of Sale (dd/mm/yyyy)         0
Address                           0
County                            0
Eircode                      528365
Price in Euro                     0
Not Full Market Price             0
VAT Exclusive                     0
Description of Property           0
Property Size Description    641673
dtype: int64


Two of the colums, "Eircode" and "Property Size Description" have missing values. I will keep this in  mind throughout the project.

## 5.   Data Analysis

***

### 5.1  Analysis


5.1.1 

5.1.2

5.1.3

### 5.2  Analysis

5.2.1

5.2.2

5.2.3

## 6.   References

***

***
## End

