# **Data Visualization Course**

<br>

- **Group Project 2022/2023**
- **Academic Year: 2020-2023 | 3rd Trimester**
- **Professor: Pedro Cabral**

<br>

- **"King County House Prices"**
- **This notebook uses the *kc_house_data.csv**


<br>

> **Group composed by:**<p>
> Ana Carolina Ottavi, nº 20220541<p>
> Carolina Bezerra, nº 20220392 <p>
> Carolina Confraria, nº 20220711 <p>
> Daniella Camilato, nº 20221641  <p>

## 📖 Introduction

Within the scope of __Data Visualization__, it was proposed a project, where the groups' ability to deliver a visualization in accordance with the different features included in a dataset. The data for this project was provided by King County, Washington. The data includes homes sold between May 2014 and May 2015

## 📖Dataset description


- id - Unique ID for each home sold

- date - Date of the home sale

- price - Price of each home sold

- bedrooms - Number of bedrooms

- bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower

- sqft_living - Square footage of the apartments interior living space

- sqft_lot - Square footage of the land space

- floors - Number of floors

- waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not

- view - An index from 0 to 4 of how good the view of the property was

- condition - An index from 1 to 5 on the condition of the apartment,

- grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.

- sqft_above - The square footage of the interior housing space that is above ground level

- sqft_basement - The square footage of the interior housing space that is below ground level

- yr_built - The year the house was initially built

- yr_renovated - The year of the house’s last renovation

- zipcode - What zipcode area the house is in

- lat - Lattitude

- long - Longitude

- sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors

- sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors


Verified from 2 sources:
https://www.slideshare.net/PawanShivhare1/predicting-king-county-house-prices
https://www.kaggle.com/datasets/harlfoxem/housesalesprediction

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
from plotly import express as px
import plotly.graph_objects as go
import plotly.graph_objs as go
import ipywidgets as widgets
import plotly.offline as py

In [2]:
# load data
url = "https://raw.githubusercontent.com/AnaOttavi/Data_Visualization_Project/main/dataset.csv"
df = pd.read_csv(url)

In [3]:
df.columns

Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')

In [4]:
# Data Dimensions
print('Number of Rows: {}'.format(df.shape[0]))
print('Number of Columns: {}'.format(df.shape[1]))

Number of Rows: 21613
Number of Columns: 21


In [5]:
# Data Type
df.dtypes

id                 int64
date              object
price            float64
bedrooms           int64
bathrooms        float64
sqft_living        int64
sqft_lot           int64
floors           float64
waterfront         int64
view               int64
condition          int64
grade              int64
sqft_above         int64
sqft_basement      int64
yr_built           int64
yr_renovated       int64
zipcode            int64
lat              float64
long             float64
sqft_living15      int64
sqft_lot15         int64
dtype: object

In [6]:
# Create the scatter plot
fig = px.scatter(df, x='long', y='lat', 
                 color='price', 
                 size='price', 
                 title='Relationship between Location and Price',
                 color_continuous_scale=colorscale)

# Show the plot
fig.show()

NameError: name 'colorscale' is not defined

In [None]:
# Map indicating where the houses are located by geography
df_filtered = df[df['price'] > 2000000]

# Create a scatter plot on a Mapbox map using Plotly Express
fig = px.scatter_mapbox(df_filtered,
                         lat='lat', lon='long', hover_name='id', hover_data=['price'],size="sqft_living",
                         color='price', zoom=8, height=500, width=620, color_continuous_scale=colorscale)


fig.update_layout(mapbox_style='open-street-map')

# size and margins of the plot
fig.update_layout(height=440, margin={'r': 0, 't': 0, 'l': 0, 'b': 0})

#center and zoom level of the Mapbox map
fig.update_layout(
    mapbox=dict(
        center=dict(lat=47.5015, lon=-121.972),
        style='open-street-map',
        zoom=8.1
    )
)

# Display the resulting plot
fig.show()

In [None]:
# ScatterPlot Fig 3
df['condition_'] = df['condition']
fig3 = px.scatter(df, x="price", y="bathrooms", color="price", size='sqft_living')
fig3.update_layout(height=450, width=800)

In [None]:
# Bar chart that represents the sum of the prices by the condition

# Update the 'condition' column to create a new column called 'condition_cat'
# where properties with condition 2 are labeled 'low', properties with condition 3 or 4 are labeled 'medium',
# and properties with condition 5 are labeled 'high'
df['condition_cat'] = df['condition'].apply(lambda x: 'low' if x == 2 else 'medium' if x == 3 or x == 4 else 'high')

# Group the 'price' column by the 'condition_cat' column and calculate the sum
df1 = df[['price', 'condition_cat']].groupby('condition_cat').mean().reset_index()

# Create a bar chart using Plotly Express, specifying the 'x' and 'y' columns
fig = px.bar(df1, x='condition_cat', y='price', color_discrete_sequence=['#8FBC8F'])

# Set the plot and paper background colors to transparent
# fig.update_layout(plot_bgcolor='rgba(0,0,0,0)', paper_bgcolor='rgba(0,0,0,0)')

# Show the resulting plot
fig.show()

In [None]:
import plotly.graph_objects as go

x_data = df['waterfront']
y_data = df['price']
fig = go.Figure(data=[go.Bar(x=x_data, y=y_data)])
fig.update_traces(marker_color='rgb(144, 202, 249)' , marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

fig.show()

In [None]:
# Group the 'price' column by the 'yr_built' column and calculate the mean
df1 = df[['price', 'yr_built']].groupby('yr_built').mean().reset_index()

# Create a line chart using Plotly Express, specifying the 'x' and 'y' columns, and setting the chart title
fig = px.line(df1, x="yr_built", y="price", title='Average prices for the year of construction of the properties', 
              color_discrete_sequence=['#8FBC8F'])

# Show the resulting plot
fig.show()

In [None]:
# Create a plot with 'sqft_living' on the x-axis, 'price' on the y-axis, and 'zipcode' as the size and color
fig = px.scatter(df, x='sqft_living', y='price', size='zipcode', color='zipcode', 
                 title='Relationship between Price, Sqft Living, and Zipcode', color_continuous_scale=colorscale)

# Show the plot
fig.show()

In [None]:
df['condition_'] = df['condition'].apply(lambda x: 'low' if x == 2 else 'medium' if x == 3 or x == 4 else 'high')

# Group the 'price' column by the 'condition_cat' column and calculate the sum
df1 = df[['price', 'condition_']].groupby('condition_').mean().reset_index()

# Create a bar chart using Plotly Express, specifying the 'x' and 'y' columns
fig4 = px.bar(df1, x='condition_', y='price', color_discrete_sequence=['#8FBC8F'])
# set the height and width of the chart
fig4.update_layout(height=450, width=650)


In [None]:
# Create bar chart
fig5 = go.Figure(data=[go.Bar(x=df['waterfront'].replace({1: 'Waterfront', 0: 'No Waterfront'}), y=df['price'])])

# Customize aspect
fig5.update_traces(marker_color='rgb(158,202,225)', marker_line_color='rgb(8,48,107)',
                  marker_line_width=1.5, opacity=0.6)

fig5.show()

In [None]:
trace = go.Scatter(
    x=df['yr_built'],
    y=df['price'],
    mode='markers',
    marker=dict(color='#8FBC8F')
)
data = [trace]

layout = go.Layout(title='Scatter plot of price vs year built')

fig9 = go.Figure(data=data, layout=layout)
# Create and add slider
steps = []
for i in range(len(fig9.data)):
    step = dict(
        method="update",
        args=[{"visible": [False] * len(fig9.data)},
              {"title": "Slider switched to step: " + str(i)}],  # layout attribute
    )
    step["args"][0]["visible"][i] = True  # Toggle i'th trace to "visible"
    steps.append(step)

sliders = [dict(    active=10,    currentvalue={"prefix": "Frequency: "},    pad={"t": 50},    steps=steps)]

fig9.update_layout(
    sliders=sliders
)