## Parallel Coordinates with Python

The aim of these notebook is to explain how work the program pyParallelCoordinate and show some examples.

This kind of graphs are a common way of visualizing high-dimensional geometry and analyzing multivariate data

![Example of PC](images/examplePC.png)



In order to make this graph with a large amount of data, I am going to use the module `pandas` and `bokeh`.

The first module `pandas` allows to create Data Frame, whereas, the module `bokeh` permit us to create the graph. 


### Code

Now, I am going to explain, briefly, the necessary function. 

First of all, we require the following module, that are describe in the file *requirements.txt*:

- unittest
- bokeh
- bokeh selenium
- phantomjs
- pandas
- math

Then, we only need to import the corresponding file that contain the class `ParallelCoordinates`:

In [30]:
from ParallelCoordinates import ParallelCoordinates


And then, create the object `ParallelCoordinates` with the following arguments:

- `file_name`: The path to the file that we want to obtain the Parallel Coordinates.
- `header`: As default is None, but we can set to infer if there is a header at the beginning of the file.
- `my_delimiter`: The delimiter of the data. As default is ','.

For example:


In [31]:
fun_pc = ParallelCoordinates('data/FUN.BB20002.tsv', my_delimiter='\t')
iris_pc = ParallelCoordinates('data/iris.csv', header='infer')

The attribute of these class are:

- my_df: the data frame that is obtain from the file
- dict_categorical_var: empty dictionary that will contain the variables which are categorical
- my_df_normalize: the my_df data frame wich will be normalize if it is necessary
- parallel_plot: the final plot


Mainly, the constructor calls two functions:

- `read_file`
- `convert_categorical_to_number`


Now, we are going to see how is the implementation of these functions:

### **`read_file`**

In [32]:
    import pandas as pd
    
    def read_file(self, file_name: str, header, my_delimiter: str)-> pd.DataFrame:
        """
        This function read the file and compute the data frame. Also, use the function 'dropna' in the case
        that there are missing values in all of the columns. Moreover, if there are no header, it creates some names
        for the variables with the following structure --> 'Var-N'
        :param file_name: the corresponding name (with the path, if it isn´t at the source root)
        :param header: it can be 'None' (as default) or 'infer' if the file contain the header in the firs line
        :param my_delimiter: the type of delimiter that split the data in the file. As default is ','
        :return: The corresponding data frame
        """
        if header is not None and not 'infer':
            raise Exception("The header value must be 'None' or 'infer'")
        with open(file_name) as csv_file:
            df = pd.read_csv(csv_file, sep=my_delimiter, header=header)
            df.dropna(how='all', inplace=True)
            if header is None:
                my_header = list()
                for i in range(0, len(df.columns)):
                    my_header.append('Var-'+str(i))
                    df.columns = my_header
        return df

    


As we can see, the flexibility of the function is huge, since we can use different type of file, as csv, tsv... because we can edit the delimiter of the data, as well as if there is the header at the beggining of the file.

Now, we can show some examples of these function:


In [33]:
df_iris = iris_pc.read_file(file_name='data/iris.csv', header='infer', my_delimiter=',')
print(df_iris[1:10])

   sepal_length  sepal_width  petal_length  petal_width species
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
5           5.4          3.9           1.7          0.4  setosa
6           4.6          3.4           1.4          0.3  setosa
7           5.0          3.4           1.5          0.2  setosa
8           4.4          2.9           1.4          0.2  setosa
9           4.9          3.1           1.5          0.1  setosa



And, in the case that there is no header:


In [34]:
df_fun = fun_pc.read_file(file_name='data/FUN.BB20002.tsv', header=None, my_delimiter='\t')
print(df_fun[1:10])

      Var-0     Var-1      Var-2
1  1.316355  0.039683  23.878968
2  0.940633  0.000000  37.283147
3  0.940633  0.000000  37.283147
4  1.351411  0.037327  22.461739
5  1.283916  0.000000  23.888448
6  1.119206  0.000000  30.811572
7  1.265007  0.000000  28.819444
8  1.266593  0.000000  23.983659
9  1.079389  0.000000  35.775862


### **`convert_categorical_to_number`**

The following step is process this data and convert the data type string to categorical; and then codify the categorical value in numeric. Since, the constructor contain this two functions, we can access directly to the data frame that is stored in the object `df_iris`.

In [35]:
    def convert_categorical_to_number(self)->None:
        """
        It converts the columns which the type of data are string to categorical. Also, save the variable and the
        levels in a dictionary (dict_categorical_var) and then transform the categorical data to numeric code
        :return:
        """
        for col in self.my_df.columns:
            if type(self.my_df[col].values.tolist()[0]) is str:
                self.my_df[col] = pd.Categorical(self.my_df[col])
                values_of_cat = dict(enumerate(self.my_df[col].cat.categories))
                self.dict_categorical_var[col] = values_of_cat
                self.my_df[col] = self.my_df[col].cat.codes

In [36]:

print(iris_pc.my_df[1:10])


   sepal_length  sepal_width  petal_length  petal_width  species
1           4.9          3.0           1.4          0.2        0
2           4.7          3.2           1.3          0.2        0
3           4.6          3.1           1.5          0.2        0
4           5.0          3.6           1.4          0.2        0
5           5.4          3.9           1.7          0.4        0
6           4.6          3.4           1.4          0.3        0
7           5.0          3.4           1.5          0.2        0
8           4.4          2.9           1.4          0.2        0
9           4.9          3.1           1.5          0.1        0



Also, the variable and the different values of the categorical data is stored in the attribute `dict_categorical_var`:


In [37]:

print(iris_pc.dict_categorical_var)


{'species': {0: 'setosa', 1: 'versicolor', 2: 'virginica'}}


### **`plot`**

After we read and process the data, the next step is to plot. For doing that we have the function `plot`:


In [38]:
    def plot(self, normalize=False, width=500, height=500, title='Parallel-Coordinates', show=True, notebook=False)->None:
        """
        The function plot the data frame as Parallel Coordinates graph. For that, it requires the function
        get_multi_line_plot. In the case, that we want the normalize data, it compute both and show both thanks
        to a tab. Also, it verifies if we are working or notebook or not.
        :param normalize: Boolean that indicates if the data must be normalize or not. In the case of yes, the function
        show both
        :param width: The width measure for the figure. As default is 500
        :param height: The height measure for the figure. As default is 500
        :param title: The tittle  for the figure. As default is 'Parallel-Coordinates'
        :param show: Boolean that indicate if we want to show the plot in a browser
        :param notebook: Boolean that indicate if we are using a notebook, in order to show the plot
        :return:
        """
        if notebook:
            output_notebook()
        else:
            output_file(title + ".html")
        if normalize:
            self.my_df_normalize = self.normalize_data_frame(self.my_df)
            p1 = self.get_multi_line_plot(self.my_df_normalize, width, height, title)
            tab1 = Panel(child=p1, title="Normalize")
            p2 = self.get_multi_line_plot(self.my_df, width, height, title)
            tab2 = Panel(child=p2, title="No normalize")
            self.parallel_plot = Tabs(tabs=[tab1, tab2])
        else:
            self.parallel_plot = self.get_multi_line_plot(self.my_df, width, height, title)
        if show:
            bk.show(self.parallel_plot)


This function permit to plot the data, normalize or not, and show it.

It requires another two functions:

- `my_df_normalize`
- `get_multi_line_plot`

### **`my_df_normalize`**


In [39]:
    def normalize_data_frame(self, df: pd.DataFrame) -> pd.DataFrame:
        """
        Normalize a data frame. It requires that the data in the input data frame must be numeric
        :param df: The data frame that we want to normalize
        :return: Corresponding normalize data frame
        """
        df_norm = df.copy()
        for col in df_norm:
            total = sum(df_norm[col])
            df_norm[col] = df_norm[col]/total
        return df_norm


Mainly, this function compute the normalize data frame.


In [40]:
df_iris_norm= iris_pc.normalize_data_frame(iris_pc.my_df)
print(df_iris_norm[1:10])

   sepal_length  sepal_width  petal_length  petal_width  species
1      0.005590     0.006549      0.002483     0.001112      0.0
2      0.005362     0.006985      0.002306     0.001112      0.0
3      0.005248     0.006767      0.002661     0.001112      0.0
4      0.005705     0.007859      0.002483     0.001112      0.0
5      0.006161     0.008513      0.003015     0.002225      0.0
6      0.005248     0.007422      0.002483     0.001669      0.0
7      0.005705     0.007422      0.002661     0.001112      0.0
8      0.005020     0.006330      0.002483     0.001112      0.0
9      0.005590     0.006767      0.002661     0.000556      0.0



As we see at the beginning, there is also and attribute compute in the case that we order normalize plot --> `my_df_normalize`


### **`get_multi_line_plot`**

The last function that use `plot` is `get_multi_line_plot`. It create a figure in order to seem as possible to a Parallel Coordinates, and add the necessary information.



In [41]:
    def plot(self, normalize=False, width=500, height=500, title='Parallel-Coordinates', show=True, notebook = False)->None:
        """
        The function plot the data frame as Parallel Coordinates graph. For that, it requires the function
        get_multi_line_plot. In the case, that we want the normalize data, it compute both and show both thanks
        to a tab. Also, it verifies if we are working or notebook or not.
        :param normalize: Boolean that indicates if the data must be normalize or not. In the case of yes, the function
        show both
        :param width: The width measure for the figure. As default is 500
        :param height: The height measure for the figure. As default is 500
        :param title: The tittle  for the figure. As default is 'Parallel-Coordinates'
        :param show: Boolean that indicate if we want to show the plot in a browser
        :param notebook: Boolean that indicate if we are using a notebook, in order to show the plot
        :return:
        """
        if notebook:
            output_notebook()
        else:
            output_file()
        if normalize:
            bk.output_file("slider.html")
            self.my_df_normalize = self.normalize_data_frame(self.my_df)
            p1 = self.get_multi_line_plot(self.my_df_normalize, width, height, title)
            tab1 = Panel(child=p1, title="Normalize")
            p2 = self.get_multi_line_plot(self.my_df, width, height, title)
            tab2 = Panel(child=p2, title="No normalize")
            self.parallel_plot = Tabs(tabs=[tab1, tab2])
        else:
            self.parallel_plot = self.get_multi_line_plot(self.my_df, width, height, title)
        if show:
            bk.show(self.parallel_plot)


In [54]:
iris_pc.plot(notebook=True)

# Move the cursor to the bottom to see the label

Moreover, there is an option to plot normalize data:

In [43]:
iris_pc.plot(notebook=True, normalize = True)

### **`save`**

Finally, I am going to explain the functions that are related to save the plot:

- `save`
- `file_name_with_ext_and_path`
- `my_path`


In [44]:
  def save(self, format='html', file_name='Parallel-Coordinates', path=None)->None:
        """
        This function allows to save the plot in a specific format.
        :param format: The format that we want for our file
        :param file_name: The corresponding name of the file
        :param path: The path where we want to store the plot, as default we assign the actual directory and
        create new one in there; 'results'
        :return:
        """
        if path is None:
            path = self.my_path()
        valid_format = ['html', 'png', 'svg', 'all']
        if format not in valid_format:
            raise Exception('The format is incorrect')
        if format == 'html' or format == 'all':
            file_name = self.file_name_with_ext_and_path(file_name, 'html', path)
            save(self.parallel_plot, filename=file_name, title=title)
        if format == 'png' or format == 'all':
            file_name = self.file_name_with_ext_and_path(file_name, 'png', path)
            be.export_png(self.parallel_plot, filename=file_name)
        if format == 'svg' or format == 'all':
            file_name = self.file_name_with_ext_and_path(file_name, 'svg', path)
            self.parallel_plot.output_backend = "svg"
            be.export_svgs(self.parallel_plot, filename=file_name)


We can save the plot in different format:

- html
- png
- svg

Depends of the format this function use differente methods. Moreover, at the beginning, the method set as default the current path and create a directory callled 'results'. For that, there is a methods called `my_path()`

### **`my_path`**


In [45]:
    def my_path(self)->str:
        """
        Obtain the path where, as default, we want to store the data
        :return: The path correspond to the source root plus another directory; 'results'
        """
        path = os.getcwd() + '/results'
        if not os.path.exists(path):
            os.makedirs(path)
        return path


An example of the use:


In [46]:

print(iris_pc.my_path())


/Users/juanlu/Documents/Universidad/PAB/ParallelCoordinates/notebook/results


### **`file_name_with_ext_and_path`**


After, it checks if the `file_name` contains the corresponding extension with the function `file_name_with_ext_and_path`


In [47]:
    def file_name_with_ext_and_path(self, file_name: str, format: str, path: str)->str:
        """
        The objective of this function is to add the corresponding path and extension to the name which we want
        to store the plot. In case that the extension is added, it checks if both correspond.
        :param file_name: the file name which we want to store the plot.
        :param format: The corresponding format {.html, .png, .svg}
        :param path: The path where we want to store the data
        :return:
        """
        list_fn = file_name.split('.')
        if len(list_fn) == 1:
            file_name_plus_extension = list_fn[0] + '.' + format
        elif list_fn[1]!=format:
            file_name_plus_extension = list_fn[0] + '.' + format
        else:
            file_name_plus_extension = file_name
        return path + '/' + file_name_plus_extension


Some examples are:


In [48]:
print(iris_pc.file_name_with_ext_and_path('iris', 'svg', 'path/'))
print(iris_pc.file_name_with_ext_and_path('iris.html', 'html', 'path/'))
print(iris_pc.file_name_with_ext_and_path('iris.html', 'png', 'path/'))

path//iris.svg
path//iris.html
path//iris.png


Now, we can store the data and check if we do correctly:


In [49]:
iris_pc.save(file_name='iris.png', format='png')

![Resulst of Iris](results/iris.png)




Since we have seen how is implemented this class, now we can plot three different data:

- iris.csv
- FUN.BB20002.tsv
- MaF14PF_M10.txt

### Iris

In [50]:
iris_pc.plot(notebook = True)

### Fun


In [51]:
fun_pc.plot(notebook = True, normalize = True)

### MaF14PF_M10

This file contain large amount of data. Because these reason, we can see that the range of colous of `viridis` is not enough, so there is going plot only in one colour.

In [52]:
ma_pc = ParallelCoordinates(file_name='data/MaF14PF_M10.txt', my_delimiter=' ')
ma_pc.plot(notebook=True)