# A Notebook to Visualize Data Using a Scatterplot Matrix

This notebook will show you how to visualize a dataset using a [scatterplot](https://en.wikipedia.org/wiki/Scatter_plot) matrix. It is implemented by [Plotly](https://plot.ly), a popular graphing library for Python.

The next cell defines basic functions.  Make sure you run it before you proceed.

In [9]:
from plotly.offline import init_notebook_mode,iplot
import pandas as pd 
import plotly.figure_factory as ff
import numpy as np
import ipywidgets as widgets
from ipywidgets import interact_manual

init_notebook_mode(connected=True) 
    
def ScatterPlotMatrix(filename,attrs,classes):
    df = pd.read_csv(filename)
    dataframe=df.loc[:,attrs]
    for index,dtypes in enumerate(dataframe.dtypes):
        if str(dtypes) not in ('int64','float64'): 
            print('Warning: ',dataframe.dtypes.index[index],' is not a numerical variable!')
    dataframe[classes]=df[classes]
    fig = ff.create_scatterplotmatrix(dataframe, diag='box', index=classes,height=800,width=800,\
                                      title='Scatterplot Matrix with Box Plots along Diagonal')
    iplot(fig, filename='Box plots along Diagonal Subplots')

def searchFile():
    import os
    topdir='./Parallel Coordniates&Scatterplot Matrix/'
    f = []
    for (dirpath, dirnames, filenames) in os.walk(topdir):
        for filename in filenames:
            if filename.endswith('.csv'):
                f.append(os.path.realpath(os.path.join(dirpath,filename)))
    return f    
    
def update_attributes(*args):
    df = pd.read_csv(filename.value)
    classes.options=list(df.columns)
    attributes.options=list(df.columns)    
    
def createWidgets():
    style={'description_width': 'initial'}
    filename=widgets.Dropdown(options=searchFile(),description='Choose a file',disabled=False)
    attributes=widgets.SelectMultiple(description='Attributes you want to plot',style=style,disabled=False)
    classes=widgets.Select(description='Class name you want to observe',style=style,disabled=False)
    return filename,attributes,classes

def loadHead(dataset):
    import pandas as pd
    df=pd.read_csv(dataset)
    with pd.option_context('display.max_columns', None):
        display(df.head())

## Data Format

The functions above expect data to be in a specific format.  The first row contains feature names and the rest of the rows can be either categorical or numerical values of those different features. There cannot be any missing values for the features, otherwise the function returns an error.

We provide three example datasets in this directory: "Parallel Coordniates&Scatterplot Matrix". Run the following cell to display a dataset before you visualize it. All the datasets include multivariate data. Choosing how many and which specific attributes is at your discretion. The links of the three datasets are just for your reference. We have made some changes to the dataset so please do not use the datasets from these links.
* ["Edu-Data.csv"](https://www.kaggle.com/aljarah/xAPI-Edu-Data) is an educational dataset from Kaggle, a well-known platform for data science enthusiasts.
* ["iris.csv"](http://archive.ics.uci.edu/ml/datasets/Iris) is a classic classfication dataset from UCI Machine Learning Repository.
* ["Auto.csv"](http://www-bcf.usc.edu/~gareth/ISL/data.html) is an automobile dataset from a statistical learning textbook called An Introduction to Statistical Learning with Applications in R.


In [7]:
dataset=input('Please enter the dataset you want to display: ')
loadHead(dataset)

Please enter the dataset you want to display: ./Parallel Coordniates&Scatterplot Matrix/iris.csv


Unnamed: 0,sepal length in cm,sepal width in cm,petal length in cm,petal width in cm,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


## Scatterplot Matrix with Box Plots along Diagonal
The following function will visualize your data by a scatterplot matrix with box plots along diagonal. When prompted, please choose the csv file you want to analyze, the attributes you want to analyze as the dimensions of the matrix, and the classes or groups which your want to observe by. We suggest to use numerical variables as the attributes and a categorical variable as the class since following this convention would make the visualization more meaningful. Please note that "Attributes you want to plot" is a multiple selection widget so you can press command (Mac) or control (Windows) to select multiple variables. Once you finish choosing, click "Run Interact."  

In [10]:
filename,attributes,classes=createWidgets()
interact_manual(ScatterPlotMatrix,filename=filename,attrs=attributes,classes=classes)    
filename.observe(update_attributes,'value')

A Jupyter Widget

## Using Your Own Dataset

To use your own dataset, create a new file and put it in the directory "Parallel Coordniates&Scatterplot Matrix".  Make sure it follows the format of the datasets in this directory. Specifically, the first row needs to be feature names and the rest of the rows can be either categorical or numerical values of those different features. Make sure there are no missing values in your dataset, otherwise you will get an error.

We suggest to use numerical variables as the attributes and a categorical variable as the class since following this convention would make the visualization more meaningful. 


Once you have created the file, run the cell below.

In [11]:
filename,attributes,classes=createWidgets()
interact_manual(ScatterPlotMatrix,filename=filename,attrs=attributes,classes=classes)    
filename.observe(update_attributes,'value')

A Jupyter Widget

Now you can print this notebook as a PDF file and turn it in.