# A Notebook to Visualize Data Using Parallel Coordinates

This notebook shows an example of how to visualize data using a type of visualization called parallel coordinates.  [Parallel coordinates](https://en.wikipedia.org/wiki/Parallel_coordinates) are a common way of visualizing high-dimensional geometry and analyzing data that has many attributes.

For those of you interested in the code, it uses predefined functions from the [plotly](https://plot.ly) library to plot data and the [pandas](http://pandas.pydata.org) library to handle coordinates. 

In [None]:
from plotly.offline import init_notebook_mode,iplot
import plotly.graph_objs as go
import pandas as pd 
import ipywidgets as widgets
from ipywidgets import interact_manual
import os

init_notebook_mode(connected=True) 

def produceDict(df,label):
    dic={}
    if df.dtypes in ('int64','float64'):
        dic['range']=[df.min(),df.max()]
        dic['label']=label
        dic['values']=df
        #dic['visible']=False
    else:
        df=df.astype('category')
        encodedLabels=dict(enumerate(df.cat.categories))
        dic['range']=[0,len(encodedLabels)-1]
        dic['tickvals']=list(encodedLabels.keys())
        dic['label']=label
        dic['values']=df.cat.codes
        dic['ticktext']=list(encodedLabels.values())
    return dic

def ParallelCoordinates(filename,attrs,classes):
    df = pd.read_csv(filename)
    dimensions=[]
    for attr in attrs:
        dimensions.append(produceDict(df[attr],attr))
    lineDict=produceDict(df[classes],classes)
    line=dict(color=lineDict['values'],colorscale=[[0.0,'rgb(255,97,100)'],\
            [0.5,'rgb(131,245,115)'],[1.0,'rgb(109,172,244)']],showscale = True,\
              colorbar=dict(title=classes,ticks='outside'))
    if df[classes].dtypes not in ('int64','float64'):
        line['colorbar']['tickvals']=lineDict['tickvals']
        line['colorbar']['ticktext']=lineDict['ticktext']
    data=[go.Parcoords(line=line,dimensions=dimensions)]
    layout = go.Layout(title='Parallel Coordinates',\
                       font=dict(color='#292A2A',size=13))
    fig = go.Figure(data = data, layout = layout)
    iplot(fig, filename = 'parcoords-basic')  #,image='svg'

def searchFile():
    import os
    topdir='./Parallel Coordniates&Scatterplot Matrix/'
    f = []
    for (dirpath, dirnames, filenames) in os.walk(topdir):
        for filename in filenames:
            if filename.endswith('.csv'):
                f.append(os.path.realpath(os.path.join(dirpath,filename)))
    return f    
    
def update_attributes(*args):
    df = pd.read_csv(filename.value)
    classes.options=list(df.columns)
    attributes.options=list(df.columns)    
    
def createWidgets():
    style={'description_width': 'initial'}
    filename=widgets.Dropdown(options=searchFile(),description='Choose a file',\
                              disabled=False)
    attributes=widgets.SelectMultiple(description='Attributes you want to plot',\
                                      style=style,disabled=False)
    classes=widgets.Select(description='Class name you want to observe',\
                           style=style,disabled=False)
    return filename,attributes,classes

def loadHead(dataset):
    import pandas as pd
    df=pd.read_csv(dataset)
    with pd.option_context('display.max_columns', None):
        display(df.head())

## Data Format

The functions above expect data to be in a specific format.  The first row contains feature names and the rest of the rows contain the values of those features which can be either categorical values or numerical values. Categorical values are values of categorical variables, like nationality, which have a limited number of possible values on the basis of some qualitative property. Categorical variables can be further categorized into ordinal (the values have order) and nominal variables (the values do not have order). Numerical variables, like temperature, can be further categorized into discrete and continuous variables.

There cannot be any missing values (empty cells or cells showing NaN or Null) for the features, otherwise the function returns an error.

We provide three example datasets in this directory: "Parallel Coordniates&Scatterplot Matrix". Run the following cell to display a dataset before you visualize it. Choosing how many and which specific attributes is at your discretion. The links of the three datasets are just for your reference. We have made some changes to the dataset so please do not use the datasets from these links.
* ["Edu-Data.csv"](https://www.kaggle.com/aljarah/xAPI-Edu-Data) is an educational dataset from Kaggle, a well-known platform for data science enthusiasts.
* ["iris.csv"](http://archive.ics.uci.edu/ml/datasets/Iris) is a classic classfication dataset from UCI Machine Learning Repository.
* ["Auto.csv"](http://www-bcf.usc.edu/~gareth/ISL/data.html) is an automobile dataset from a statistical learning textbook called "An Introduction to Statistical Learning with Applications in R".

In [None]:
dataset=input('Please enter the dataset you want to display: ')
loadHead(dataset)

## Parallel Coordinates
The following function will generate a visualization for your data using parallel coordinates. When prompted, please choose the csv file you want to analyze, the attributes you want to plot, and the class name you want to observe by. Please note that "Attributes you want to plot" is a multiple selection widget so you can press command (Mac) or control (Windows) to select multiple variables. Once you finish choosing, click "Run Interact." 

**Tips: If you want to choose the first listed file, click other files first and then click the first listed file.**

In [None]:
filename,attributes,classes=createWidgets()
interact_manual(ParallelCoordinates,filename=filename,attrs=attributes,\
                classes=classes)    
filename.observe(update_attributes,'value')

## Using Your Own Dataset
To use your own dataset, create a new file and put it in the directory "Parallel Coordniates&Scatterplot Matrix".  Make sure it follows the format of the datasets in this directory. Specifically, the first row needs to be feature names and the rest of the rows can be either categorical or numerical values of those different features. Make sure there are no missing values in your dataset, otherwise you will get an error.

Once you have created the file, run the cell below.

In [None]:
filename,attributes,classes=createWidgets()
interact_manual(ParallelCoordinates,filename=filename,attrs=attributes,\
                classes=classes)    
filename.observe(update_attributes,'value')

Now you can print this notebook as a PDF file and turn it in.