# Property Price Register EDA
***
AIM : perform exploratory data analysis to better understand the story that the data is providing
***
Tasks
* Review each of the input variables
* Understand the key trends emerging
* Begin to perform some data cleaning to help with future analysis
***
TODO
* Complete address cleaning process
* Develop automated visualizations
* Include external data sources (location details, House Price Indices)

## 1 Setup Notebook

In [None]:
# Import packages and modules
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
import re
import plotly.express as px

# Review the dataset's stored in the input library
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Switch on setting to allow all outputs to be displayed
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# Adjust options for displaying the float columns
pd.options.display.float_format = '{:,.2f}'.format

## 2 Import data

### 2a. Initial dataset review
***
Aiming to understand what parameters are required to ensure that the initial import performs as much initial data cleaning as possible.

In [None]:
# Import the dataset
df = pd.read_csv('/kaggle/input/residential-property-prices-2020/PPR-2020.csv')

In [None]:
# Preview the data
df.head()
df.shape
df.dtypes
df.describe(include='all')

In [None]:
# Review columns and rename to remove the spaces
df.columns # columns before adjustment
df.columns = df.columns.str.replace(' ', '_') # changing the spaces into underscores
df.columns # columns after adjustment

In [None]:
# Rename specific columns
df = df.rename(columns={'Date_of_Sale_(dd/mm/yyyy)':'Date_of_Sale',
                        'Price_(�)':'Price'
                       })
df.head()

In [None]:
# Reviewing memory usage aims to show the size and type of each column. Understanding if a better column format can be used will help with future analysis
# if the dataset increases in size. Also it allows us to make more efficient use of the memory
df.info(memory_usage='deep')

In [None]:
# Convert the variables to more efficient versions of data type
df['Price'] = df['Price'].str[1:]
df['Price'] = df['Price'].str.replace(',','').astype(float)
df.head()

In [None]:
# Check to see if the price data type has been changed correctly
df.dtypes
df.describe()

In [None]:
# Check to see if the columns can be converted to categories. If there is a low cardinality (proportion of unique values) then it 
# makes sense to convert the column data type
cardinality = df.apply(pd.Series.nunique) # Display the cardinality for each column
cardinality

From this output it makes sense to convert the final four columns to category data types

In [None]:
# Extract the column name which matches the column index value being reviewed
cat_val = [i for i in (df.apply(pd.Series.nunique)) if i <= 3]
cat_cols = [df.columns[i] for i, n in enumerate(df.apply(pd.Series.nunique)) if n <=3] # adding the enumerate method provides an index value
cat_val
cat_cols

# Convert the cat_cols list to category data type
df[cat_cols] = df[cat_cols].astype('category')
df.dtypes

In [None]:
# Review the new size of the dataset
df.info(memory_usage='deep')

In [None]:
# Convert the Date_of_Sale to date
df['Date_of_Sale'] = df['Date_of_Sale'].apply(pd.to_datetime)

In [None]:
df.head()
df.dtypes
df.info(memory_usage='deep')

### 2b. Missing value review

In [None]:
# Understand the missing values by column
df.isnull().sum()

# Create method to review the proportion of missing values by each column
def missing_columns(df):
    for col in df.columns:
        miss = df.isnull().sum()
        miss_per = miss / len(df)
    return miss_per

missing_columns(df)

Can drop the Property_Size_Description column as this has a large number of missing values. Will have to review the counties which are showing the largest number of available values for the Postal_Code column.

In [None]:
# Drop the columns not required
df = df.drop(columns=['Property_Size_Description'])
df.head()

## 3. Initial EDA visualizations

In [None]:
# Price by the date of sale
fig = px.bar(df, x='Date_of_Sale', y='Price', title='Price by Time')
fig.show()

In [None]:
# Price grouped by month and county
price_county = df.groupby([df['Date_of_Sale'].dt.month, 'County'])['Price'].mean()
price_county = price_county.reset_index()
price_county

In [None]:
# Display the price by date of sale with county applied as a color
fig = px.bar(price_county, x='Date_of_Sale', y='Price', color='County', title='Average Price by Time')
fig.show()