# Data Visualization


**Data Science for Business**

**Instructor:  Chris Volinsky**

***Original notebook developed 2024 by Krutika Savani***

This notebook contains code adapted from [the book](https://www.wiley.com/en-us/Data+Mining+for+Business+Analytics%3A+Concepts%2C+Techniques+and+Applications+in+Python-p-9781119549840):

**"Data Mining for Business Analytics: Concepts, Techniques and Applications in Python"** by Galit Shmueli, Peter C. Bruce, Peter Gedeck, Nitin R. Patel
Chapter 3 - Data Visualization

*Minor modifications have been made to align with updated libraries.*


## Data



Importing all the packages we need

In [None]:
import os
import calendar
import numpy as np
import networkx as nx
import pandas as pd
from pandas.plotting import scatter_matrix, parallel_coordinates
import seaborn as sns
from sklearn import preprocessing
import matplotlib.pylab as plt

## Import King County Housing Data

We're going to use the King County Housing dataset from Kaggle which includes homes sold between May 2014 and May 2015. Each record (row) represents a specific property sold in the King County (Seattle) area.

Please [download this file](https://drive.google.com/uc?export=download&id=1gJKDv2seqeijYe3616wg5wxHdOmSj041) to your local machine.

In [None]:
## Once you have the file downloaded can use the method below to find it on your computer

from google.colab import files
uploaded = files.upload()


In [None]:
# import housing DF

housing_df = pd.read_csv("KingCountyHousing.csv")


In [None]:
housing_df.info()

In [None]:
housing_df.describe()

Here is the description of the columns of the data frame:

* id - Unique ID for each home sold
* date - Date of the home sale
* price - Price of each home sold
* bedrooms - Number of bedrooms
* bathrooms - Number of bathrooms, where .5 accounts for a room with a toilet but no shower
* sqft_living - Square footage of the apartments interior living space
* sqft_lot - Square footage of the land space
* floors - Number of floors
* waterfront - A dummy variable for whether the apartment was overlooking the waterfront or not
* view - An index from 0 to 4 of how good the view of the property was
* condition - An index from 1 to 5 on the condition of the apartment,
* grade - An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design.
* sqft_above - The square footage of the interior housing space that is above ground level
* sqft_basement - The square footage of the interior housing space that is below ground level
* yr_built - The year the house was initially built
* yr_renovated - The year of the house’s last renovation
* zipcode - What zipcode area the house is in
* lat - Lattitude
*  long - Longitude
* sqft_living15 - The square footage of interior housing living space for the nearest 15 neighbors
* sqft_lot15 - The square footage of the land lots of the nearest 15 neighbors

## Basic Charts: Bar Charts, Line Graphs, and Scatter Plots

Here we will plot some simple graphs to give insight into the features we can use in modelling.

Our target is price, lets look at that first:


In [None]:
plt.hist(housing_df.price)
plt.show()
plt.xlabell('Price')


## default histogram is UGLY!! play with edgecolor='black', bins=15
## make labels better with plt.xlabel()

Because this is a very skewed numeric variable, it will be helpful to "normalize" it via a transformation.  This both helps with the analysis, and the interpretation of the variable.    For prices and other monetary values, a `log` transformation is common - as you can see, it tends to make the distribution look more like a normal curve.  

In [None]:
housing_df['log_price'] = np.log(housing_df.price)
# housing_df.drop(columns=['price'], inplace=True)

plt.hist(housing_df['log_price'], edgecolor='black', bins=15)
plt.xlabel('log Price')
plt.show()



There are many of the features that would also benefit from this transformation,

In [None]:
# plot histograms of all features
housing_df.hist(bins=15, edgecolor='black', grid=False, figsize=(15, 10), layout=(5, 5))
plt.tight_layout()
plt.show()

Lets explore the relationship between square footage of the living space and the price.

Similar to Price, the sqft features will also be quite skewed and could benefit from a log transformation.

In [None]:
# lets transform all of the sqft features using a log(x+1) transform

for feature in ['sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement', 'sqft_living15', 'sqft_lot15']:
  housing_df[feature] = np.log(housing_df[feature]+1)


In [None]:
## simple scatter plot of log sqft_lot vs log_price

plt.scatter(housing_df.sqft_living, housing_df.log_price,alpha=0.1)
plt.xlabel('log sqft_living')
plt.ylabel('log price')
plt.show()


### Distribution Plots : Boxplots and Histograms

***Boxplots*** can show the difference of a numeric value across levels of a categorical one

Let's explore the feature `condition`, since we dont know if 1 is good or bad!

In [None]:
## boxplot of log_price for different values of view
ax = housing_df.boxplot(column='log_price', by='condition')
ax.set_ylabel('log_price')
plt.title("Price vs. Condition")
plt.show()

### Heatmaps: Visualizing Correlations

**Structure:**
- A correlation table for p variables has p rows and p columns.
- Represents all pairwise correlations between variables.

**Color Coding:**
- Darker shades indicate stronger (positive or negative) correlations.
- Easier and faster to interpret than scanning numerical values.

In [None]:
## lets restrict the correlations to numeric features

columns_to_remove = ['id', 'date','zipcode','lat','long','price']  # List of columns to remove
new_housing_df = housing_df.drop(columns=columns_to_remove)


In [None]:

corr = new_housing_df.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns)
#sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,vmin=-1,vmax=1,cmap="RdBu")

# Can we make this default easier to read by using parameters vmin, vmax, and cmap??
# cmap options can be found at ColorBrewer - maybe "RdBu"


In [None]:
# Even bettter:
# Include information about values for readability

fig, ax = plt.subplots()
fig.set_size_inches(11, 7)
sns.heatmap(corr, annot=True, fmt=".1f", cmap="RdBu", center=0, ax=ax, vmin=-1, vmax=1)

Perhaps explore correlations .. (`yr_built` and `condition`?)

### Multidimensional Visualisation

The enhancement of basic plots by incorporating features such as color, size, and multiple panels. These additions allow for the visualization of more than one or two variables at a time, providing a richer understanding of complex information.

Here we will use color to represent the `view` feature, and plotting character to represent `waterfront`

In [None]:

scatter_plot = sns.scatterplot(data=housing_df, x='sqft_living', y='log_price', hue='view',
                                style='waterfront')

# Adding legend
scatter_plot.legend(fontsize='small', loc='upper left')

plt.show()

# lots of overplotting...try a new palette = viridis or Blues, alpha and s=20 dont help too much

When you have a lot of points the plots can get very crowded with overplotting.  One technique to deal with this is to create a heatmap.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# First specify the bins for the x and y axes

x_bins = np.arange(5.5,9.5,0.2)  # Specify desired bin edges for sqft_living
y_bins = np.arange(11,16,0.2)  # Specify desired bin edges for log_price

# Create a pivot table with binned data
heatmap_data = pd.pivot_table(housing_df, values='view', index=pd.cut(housing_df['log_price'], bins=y_bins),
                             columns=pd.cut(housing_df['sqft_living'], bins=x_bins), aggfunc='mean')

heatmap_data = heatmap_data.iloc[::-1] # reverses the y-axis

# Create the heatmap
plt.figure(figsize=(10, 8))  # Adjust the figure size as needed
sns.heatmap(heatmap_data, cmap='YlOrRd', annot=True, fmt=".1f")  # Use a suitable colormap and display annotations with 2 decimal places
plt.title('Heatmap of Average View by sqft_living and log_price')
plt.xlabel('sqft_living')
plt.ylabel('log_price')
plt.show()

Another way to show multiple dimensions is called a _small multiples_ plot, where one of the features is used to create different "panels" of plots that allow you to compare relationships across panels.

In [None]:

g = sns.FacetGrid(housing_df, col="grade", col_wrap=2)   # 'col_wrap' controls the number of columns
g.map(sns.scatterplot, "sqft_living", "log_price", alpha=0.5)
g.add_legend()  # Add a legend if needed
plt.show()

A special plot that uses scatter plots with multiple panels is the **scatter plot matrix**. In it, all pairwise scatter plots are shown in a single display. The panels in a matrix scatter plot are organized in a special way, such that each column and each row correspond to a variable, thereby the intersections create all the possible pairwise scatter plots.

In [None]:
# Display scatterplots between the different variables
# The diagonal shows the distribution for each variable
# can use scatter_matrix in matplotlib or
# pairplot in seaborn

# filter only those with yr_built after 2010
housing_df_recent = housing_df[housing_df.yr_built > 2010]

df = housing_df_recent[['log_price','sqft_lot','sqft_living','grade']]
sns.pairplot(df)


## Plotly for interactive plots


Interactive plots allow you to quickly identify specific points that might be of interest.  You can define the features that show up on mouse-over.

In [None]:
## Here we are using plots from plotly, another visualization library
## plotly allows for interactive identification - to see which points are which


import plotly.express as px


#Adding a new column for index
housing_df_recent['Index'] = housing_df.id


# Create an interactive scatterplot using plotly
fig = px.scatter(housing_df_recent,  x='sqft_living', y='log_price',  title='SQFT v PRICE (color=floors)',color='floors',
                 hover_data={'floors': False,
                             'waterfront': ':.2f',
                             'condition': ':.2f',
                             'Index': True
                             },)


# Customize layout
fig.update_layout(
    xaxis_title='log sqft_living',
    yaxis_title='log price',
)
fig.update_layout(height=500)

# Display the interactive scatterplot
fig.show()


### Geographic plots using plotly

In [None]:
import plotly.express as px

fig = px.scatter_mapbox(housing_df_recent, lat="lat", lon="long", color="log_price",
                        size="log_price", color_continuous_scale=px.colors.sequential.Viridis,
                        size_max=5, zoom=9, mapbox_style="carto-positron",
                        hover_data=['bedrooms', 'bathrooms', 'sqft_living','price'])
fig.show()

### Other Plotly Graphs

The [plotly home page](https://plotly.com/python/) has many different ideas and inspiration for other types of visualizations.  There are infinite possibilities!  Play around with different visualizations for your own needs.

## Optional Assignment

The `tips` data is distributed along with `seaborn` and contains information about restaurant tips, the setting in which they were left, and the individual who left them.


In [None]:
tips = sns.load_dataset('tips')

Use visualization to discover things about this data set such as:

* What is the distribution of tips of male vs female diner?
* Do people leave more tips on the weekday vs weekend? Lunch vs. dinner?
* What is the relationship between total_bill and tip percentage?

