# Data Visualization


**Fall 2024 - Data Science for Business**

**Instructor:  Chris Volinsky**

***Notebook developed 2024 by Krutika Savani***

This notebook contains code adapted from [the book](https://www.wiley.com/en-us/Data+Mining+for+Business+Analytics%3A+Concepts%2C+Techniques+and+Applications+in+Python-p-9781119549840):

**"Data Mining for Business Analytics: Concepts, Techniques and Applications in Python"** by Galit Shmueli, Peter C. Bruce, Peter Gedeck, Nitin R. Patel
Chapter 3 - Data Visualization

*Minor modifications have been made to align with updated libraries.*


## Data
We're going to use the Boston Housing dataset which contains information on census tracts in Boston. Each record (row) represents data for a specific town or neighborhood in Boston.


The columns (features) are:

```
1.CRIM (Crime Rate): Represents the crime rate in the area.
2.ZN (Residential Zone Percentage):Indicates the percentage of residential land zoned for large lots (over 25,000 sq. ft).
3.INDUS (Non-Retail Business Percentage): Represents the percentage of land occupied by non-retail businesses.
4.CHAS (Charles River Indicator):A binary flag indicating whether the tract bounds the Charles River (1 if yes, 0 otherwise).
5.NOX (Nitric Oxide Concentration):Measures the concentration of nitric oxide in parts per 10 million.
6.RM (Average Number of Rooms):Represents the average number of rooms per dwelling.
7.AGE (Percentage of Older Units): Indicates the percentage of owner-occupied units built prior to 1940.
8.DIS (Weighted Distances to Employment Centers):Represents the weighted distances to five Boston employment centers.
9.RAD (Accessibility to Highways):An index indicating the accessibility to radial highways.
10.TAX (Property Tax Rate): Indicates the full-value property tax rate per $10,000.
11.PTRATIO (Pupil-to-Teacher Ratio):Represents the ratio of pupils to teachers in the area.
12.LSTAT (Percentage of Lower Status Population):Indicates the percentage of the population with lower socioeconomic status.
13. MEDV (Median Home Value): Represents the median value of owner-occupied homes in $1000s.
14. CAT.MEDV (Categorical Median Home Value):A binary variable indicating whether the median home value is above $30,000 (CAT.MEDV = 1) or not (CAT.MEDV = 0).

```

These features provide a comprehensive overview of various factors that can influence the median home value in different areas within Boston. The target variable is `CAT.MEDV`, which categorizes the median home value into two classes based on the $30,000 threshold.

Importing all the packages we need

In [None]:
import os
import calendar
import numpy as np
import networkx as nx
import pandas as pd
from pandas.plotting import scatter_matrix, parallel_coordinates
import seaborn as sns
from sklearn import preprocessing
import matplotlib.pylab as plt

## Import Boston Housing Data

Boston Housing data can be found in Brightspace under DataSets,
or click [this link](https://drive.google.com/uc?export=download&id=1qSRQnfCIPi4-NZooLic9Y1TYHe8Fzk7M) to download to your computer

Method 1)
Download the data somewhere in your Google Drive you can access it, OR

Method 2)
upload it dirctly into Colab from your laptop

In [None]:
## Once you have the file downloaded can use the method below to find it on your computer

from google.colab import files
uploaded = files.upload()
housing_df = pd.read_csv("BostonHousing.csv")


In [None]:
# rename CAT. MEDV column for easier data handling
housing_df = housing_df.rename(columns={'CAT. MEDV': 'CAT_MEDV'})

housing_df.head(10)

## Basic Charts: Bar Charts, Line Graphs, and Scatter Plots

Here we will plot some simple graphs to give insight into the features we can use in modelling


In [None]:
## simple scatter plot
housing_df.plot.scatter(x='LSTAT', y='MEDV', legend=False,color='blue')
plt.show()

# play with marker and alpha options

In [None]:
## barchart of CHAS vs. mean MEDV
# use groupby() to calculate means within a group
# compute mean MEDV per CHAS = (0, 1)
dataForPlot = housing_df.groupby('CHAS').mean()['MEDV']
ax = dataForPlot.plot(kind='bar', figsize=[5, 3], color = ['orange','blue'])
ax.set_ylabel('Avg. MEDV')
plt.show()

### Distribution Plots : Boxplots and Histograms

In [None]:
## histogram of MEDV
ax = housing_df.MEDV.hist() # please add edgecolor to histogrames!!
ax.set_xlabel('MEDV'); ax.set_ylabel('count')
plt.title("Histogram of MEDV")
plt.show()

## default histogram is UGLY!! play with edgecolor='black', bins=15, grid=False
## make labels better with plt.xlabel()


**Thought question:  What does this next variable mean and how should we treat it???**

In [None]:
# RAD (Accessibility to Highways)
housing_df['RAD'].value_counts()

***Boxplots*** can show the difference of a numeric value across levels of a categorical one

In [None]:
## boxplot of MEDV for different values of CHAS
ax = housing_df.boxplot(column='MEDV', by='CHAS')
ax.set_ylabel('MEDV')
plt.title("")

By thoughtful construction of plots, you can compare multiple categorical variables

In [None]:
## side-by-side boxplots
fig, axes = plt.subplots(nrows=1, ncols=4)
housing_df.boxplot(column='NOX', by='CAT_MEDV', ax=axes[0])
housing_df.boxplot(column='LSTAT', by='CAT_MEDV', ax=axes[1])
housing_df.boxplot(column='PTRATIO', by='CAT_MEDV', ax=axes[2])
housing_df.boxplot(column='INDUS', by='CAT_MEDV', ax=axes[3])
for ax in axes:
    ax.set_xlabel('CAT.MEDV')
plt.suptitle("")  # Suppress the overall title
plt.tight_layout()  # Increase the separation between the plots, avoid overplotting

### Heatmaps: Visualizing Correlations

**Structure:**
- A correlation table for p variables has p rows and p columns.
- Represents all pairwise correlations between variables.

**Color Coding:**
- Darker shades indicate stronger (positive or negative) correlations.
- Easier and faster to interpret than scanning numerical values.

Example:
- Displays pairwise correlations between 13 variables (MEDV and 12 predictors).
- Blue/red colors highlight positive vs. negative correlations.

In [None]:
## simple heatmap of correlations (without values)
corr = housing_df.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns)

# Can we make this default easier to read by using parameters vmin, vmax, and cmap??
# cmap options can be found at ColorBrewer - maybe "RdBu"


In [None]:
# Even bettter:
# Include information about values for readability

fig, ax = plt.subplots()
fig.set_size_inches(11, 7)
sns.heatmap(corr, annot=True, fmt=".1f", cmap="RdBu", center=0, ax=ax, vmin=-1, vmax=1)

### Multidimensional Visualisation

The enhancement of basic plots by incorporating features such as color, size, and multiple panels. These additions allow for the visualization of more than one or two variables at a time, providing a richer understanding of complex information.

In [None]:
# Plot first the data points for CAT.MEDV of 0 and then of 1
# Setting color to 'none' gives open circles
_, ax = plt.subplots()
for catValue, color in (0, 'C0'), (1, 'C1'):
  subset_df = housing_df[housing_df.CAT_MEDV == catValue]
  ax.scatter(subset_df.LSTAT, subset_df.NOX, color='none', edgecolor=color,marker='.')
ax.set_xlabel('LSTAT')
ax.set_ylabel('NOX')
ax.legend(["CAT.MEDV 0", "CAT.MEDV 1"])
plt.show()

In [None]:
# Similar plot using Seaborn (sns)

# Color the points by the value of CAT.MEDV

# use alpha instead of open circles
scatter_plot = sns.scatterplot(data=housing_df, x='LSTAT', y='NOX', hue='CAT_MEDV', palette=['C0', 'C1'], alpha=0.8)

# Adding legend
scatter_plot.legend(title='CAT_MEDV Legend', fontsize='small', loc='lower right')

plt.show()


A special plot that uses scatter plots with multiple panels is the **scatter plot matrix**. In it, all pairwise scatter plots are shown in a single display. The panels in a matrix scatter plot are organized in a special way, such that each column and each row correspond to a variable, thereby the intersections create all the possible pairwise scatter plots.

In [None]:
# Display scatterplots between the different variables
# The diagonal shows the distribution for each variable
# can use scatter_matrix in matplotlib or
# pairplot in seaborn

df = housing_df[['CRIM', 'INDUS', 'LSTAT', 'MEDV']]
sns.pairplot(df)


### Transformations

**Rescaling**:
Changing the scale in a display can enhance the plot and illuminate relationships. For example, we see the effect of changing both axes of the scatter plot (top) and the y-axis of a boxplot (bottom) to logarithmic (log) scale. Whereas the original plots are hard to understand, the patterns become visible in log scale.

Log scales can be very useful and effective when you have skewed data

In [None]:
# Avoid the use of scientific notation for the log axis
plt.rcParams['axes.formatter.min_exponent'] = 4
## scatter plot: regular and log scale
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(7, 4))
# regular scale
housing_df.plot.scatter(x='CRIM', y='MEDV', ax=axes[0])
axes[0].set_title("Original Scale")

# log scale
ax = housing_df.plot.scatter(x='CRIM', y='MEDV', logx=True, logy=True, ax=axes[1],alpha=0.6)
ax.set_yticks([5, 10, 20, 50])
ax.set_yticklabels([5, 10, 20, 50])
axes[1].set_title("Logarithmic Scale")

plt.tight_layout(); plt.show()

In [None]:
## boxplot: regular and log scale
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(7, 3))
# regular scale
ax = housing_df.boxplot(column='CRIM', by='CAT_MEDV', ax=axes[0])
ax.set_xlabel('CAT.MEDV'); ax.set_ylabel('CRIM')
# log scale
ax = housing_df.boxplot(column='CRIM', by='CAT_MEDV', ax=axes[1])
ax.set_xlabel('CAT.MEDV'); ax.set_ylabel('logCRIM'); ax.set_yscale('log')
# suppress the title
axes[0].set_title("Original Scale")
axes[1].set_title("Logarithmic Scale")

plt.tight_layout(); plt.show()

## Plotly for interactive plots


In [None]:
## Here we are using plots from plotly, another visualization library
## plotly allows for interactive identification - to see which points are which


import plotly.express as px

# Add a log-transformed column for MEDV
housing_df['log_MEDV'] = np.log(housing_df['MEDV'])

#Adding a new column for index
housing_df['Index'] = housing_df.index


# Create an interactive scatterplot using plotly
fig = px.scatter(housing_df,  x='CRIM', y='log_MEDV', log_x= True, title='CRIM vs. MEDV Interactive Scatterplot',
                 hover_data={
                             'log_MEDV': ':.2f',  # Customize hover for log(MEDV)
                             'LSTAT': ':.2f',  # Add LSTAT to hover data with customized formatting
                             'Index': True
                             },)


# Customize layout
fig.update_layout(
    xaxis_title='log(CRIM)',
    yaxis_title='log(MEDV)',
)
fig.update_layout(height=500)

# Display the interactive scatterplot
fig.show()


### Multiple dimensions using panels and color (with plotly)

This code generates an interactive scatterplot that visualizes the relationship between the logarithm of the crime rate and the logarithm of the median home value, **faceted by the 'CHAS' variable**. The color of the points represents the LSTAT (Percentage of Lower Status Population). Additional information is displayed on hover, including the logarithm of the median value, the percentage of lower status, and the row number (index).

In [None]:

# Create an interactive scatterplot using plotly with customized colors (showcasing usage of facet scatter plots)
# this is a plot in 4 dimensions!


fig = px.scatter(housing_df, x='CRIM', y='log_MEDV', log_x= True,facet_col='CHAS', color='LSTAT',
                 hover_data={'CHAS': False,  # Remove CHAS from hover data
                             'log_MEDV': ':.2f',  # Customize hover for log(MEDV)
                             'LSTAT': ':.2f',  # Add LSTAT to hover data with customized formatting
                             'Index': True
                             },
                 )


fig.update_layout(height=500)
fig.show()
