**Important:**
> If you are inspecting this notebook through Github, you should hover to the upper right part of the notebook (over theta symbol) and click on *external view available with nbviewer*. This is necessary if you want to see charts on Github.

---
# Exploratory Data Analysis
DataVis Supplementary Material

---


Exploratory data analysis is primarily used to investigate data sets before making any assumptions beforehand. Since human beings are visual creatures, data exploration often means using data visualization methods to better understand patterns within the data, detect anomalies and find relations among the variables. With that knowledge, existing questions could be refined, and new ones could be generated. The most common steps in exploratory data analysis are: **importing**, **cleaning**, **processing** and **visualizing** the data.

In [1]:
import numpy as np
import pandas as pd
import altair as alt
from sklearn import datasets
from sklearn.metrics import mutual_info_score as mis
from sklearn.preprocessing import StandardScaler

In [2]:
# Uncomment if you are using dark jupyter lab/notebook theme
#alt.renderers.set_embed_options(theme='dark')

---
## Data import

For the purposes of introducing you to explorative data analysis, we will import 3 different data sets, among which two of them have categorical and one has continuous *target* variable.
Through this notebook, we will only deal with one data set, named *data_wine*, which is a classic multi-class dataset. You should repeat explorative analysis with two other data sets. Since we are using the *pandas* library to handle data sets, we will use *data frame* and *data set* interchangeably.

In [3]:
data_iris = datasets.load_iris (as_frame = True).frame
data_diabetes = datasets.load_diabetes (as_frame = True).frame
data_wine = datasets.load_wine (as_frame = True).frame

First, we will inspect first five rows (entries) of the data set.

In [4]:
data_wine.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


Then, we will inspect the summary of the data set.

In [5]:
data_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  targe

Last, we will inspect the descriptive statistics of the data set in order to get a feeling of the underlying data.

In [6]:
data_wine.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,0.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,0.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,0.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,1.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,2.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,2.0


---
## Correlation matrix chart

As you might already know, a correlation matrix is a table showing correlation coefficients between variables. Determining the correlation is one of the most important parts of the data examination for machine learning purposes. Data having non-correlated features have many benefits, such as: learning will be faster, there won't be much bias, etc. Also, if we have too highly correlated features, we could easily remove one of them from the data set without losing vital information.

In [7]:
def create_corr_chart (data):
    
    # Removing target column
    new_data = data.drop ('target', axis = 1)
    
    # We are using Pearson correlation.
    # We need reset_index and melt to transform data set from wide to long format
    # For more info: https://altair-viz.github.io/user_guide/data.html#data-long-vs-wide
    corr = new_data.corr(method = 'pearson').reset_index().melt('index')
    corr.columns = ['var_1', 'var_2', 'correlation']
    
    # Create correlation matrix chart
    chart = alt.Chart(corr).mark_rect().encode(
        alt.X ('var_1', title = None, axis = alt.Axis(labelAngle = -45)),
        alt.Y ('var_2', title = None),
        alt.Color('correlation', legend = None, scale = alt.Scale(scheme = 'redblue', reverse = True)),
    ).properties(
        width = alt.Step(40),
        height = alt.Step(40)
    )
    
    # Create text values for each colored element on top of existing chart
    chart += chart.mark_text(size = 12).encode(
        alt.Text ('correlation', format = ".2f"),
        color = alt.condition("abs(datum.correlation) > 0.5", alt.value('white'), alt.value('black'))
    )
    
    # If we want to return the chart with upper triangle as well, we should remove .transform...
    return chart.transform_filter("datum.var_1 < datum.var_2")
    

In [8]:
chart_corr_wine = create_corr_chart (data_wine)
chart_corr_wine

### Exercise:
(**EASY**) Show only upper triangle of the chart above.

### Exercise:
(**MEDIUM**) Calculate the (pairwise) mutual information matrix for each feature (dataframe column) and plot it. Use built-in functions of the *sklearn* library.

---
## Parallel coordinates chart

Visualizing data sets with 2 or 3 dimensions is relatively easy, and could be done in multiple ways. Relations between variables could be visualized with scatter plots, bar charts, pie charts, etc. On the other hand, if we have datasets with more than 4 dimensions, visualizing relationships becomes challenging with conventional charts. However, parallel coordinates chart can include many dimensions, so that each axis represents one variable of the data set. The biggest strength of this charts is that the variables can have values with different ranges and even with different units.

In [9]:
def create_parallel_chart (data):
    
    new_data = data.reset_index().melt(id_vars = ['index', 'target'])
    
    chart = alt.Chart(new_data).mark_line().encode(
        alt.X ('variable:N'),
        alt.Y ('value:Q'),
        alt.Color ('target:N'),
        alt.Detail ('index:N'),
        opacity = alt.value(0.4),
    ).properties(width = 1000)
    
    return chart


In [10]:
# We have to omit 'proline' and 'magnesium' columns because they skew up the resulting chart. Try without removing
chart_parallel_wine = create_parallel_chart (data_wine.drop (['proline', 'magnesium'], axis = 1))
chart_parallel_wine

Scaling can sometimes play a crucial step in finding out patterns that were hidden before applying this data transformation. Scaling transforms the raw data to a new scale that is common with the scales of other variables.

In [11]:
# Let's see what happens if we scale data
scaler_wine = StandardScaler()
scaled_data_wine = scaler_wine.fit_transform (data_wine.drop ('target', axis = 1))
scaled_data_wine = pd.DataFrame(scaled_data_wine, columns = data_wine.drop ('target', axis = 1).columns)
scaled_data_wine['target'] = data_wine['target']
scaled_data_wine.head()


Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,1.518613,-0.56225,0.232053,-1.169593,1.913905,0.808997,1.034819,-0.659563,1.224884,0.251717,0.362177,1.84792,1.013009,0
1,0.24629,-0.499413,-0.827996,-2.490847,0.018145,0.568648,0.733629,-0.820719,-0.544721,-0.293321,0.406051,1.113449,0.965242,0
2,0.196879,0.021231,1.109334,-0.268738,0.088358,0.808997,1.215533,-0.498407,2.135968,0.26902,0.318304,0.788587,1.395148,0
3,1.69155,-0.346811,0.487926,-0.809251,0.930918,2.491446,1.466525,-0.981875,1.032155,1.186068,-0.427544,1.184071,2.334574,0
4,0.2957,0.227694,1.840403,0.451946,1.281985,0.808997,0.663351,0.226796,0.401404,-0.319276,0.362177,0.449601,-0.037874,0


In [12]:
chart_parallel_scaled_wine = create_parallel_chart (scaled_data_wine.drop (['proline', 'magnesium'], axis = 1))
chart_parallel_scaled_wine

With parallel coordinates charts, axis ordering is quite important, as it can reveal data structures that are not easily seen when axes are ordered in a certain way. Optimizing the order of the vertical axis can decrease the clutter of the parallel plot i.e. minimize the number of intersections between lines.

### Exercise:
(**EASY**) Try out different axis ordering, plot chart, and see if it gives better visual structure or not.

### Exercise:
(**EASY**) Instead of standardizing data, normalize it. Then plot that data with parallel coordinates chart.

### Exercise:
(**MEDIUM**) Implement filtering, so that when we hover the mouse over the blue line (class 0), all lines other than blue ones are grayed out.

### Exercise:
(**HARD**) Plot minimum, maximum and medium values for each and every variable on parallel coordinates chart. Help and examples can be found here:
* https://github.com/altair-viz/altair/issues/1034
* https://stackoverflow.com/questions/54671453/parallel-coordinates-in-vega-lite/54701776#54701776
* https://vega.github.io/vega/examples/parallel-coordinates/

---
## Scatter plot matrix

A scatter plot is an efficient way to visualize 2D or 3D data (Cartesian plane + color). When we want to quickly visualize all pairwise relationships of the multivariate data set, we can use a scatter plot for each relationship and create a matrix of those plots. With this type of chart we can answer some important questions about the data set we have: what features are correlated, is there multicollinearity, is the data linearly separable...

In [13]:
def create_scatter_matrix (data):
    
    # We want to remove 'target' from our list of features
    features = data.columns.values[data.columns.values != 'target']
    
    chart = alt.Chart(data).mark_circle().encode(
        alt.X(alt.repeat("column"), type = 'quantitative', scale = alt.Scale (nice = True)),
        alt.Y(alt.repeat("row"), type = 'quantitative', scale = alt.Scale (nice = True)),
        color = 'target:N'
    ).properties(
        width=150,
        height=150
    ).repeat(
        row = features,
        column = features
    )#.interactive()
    
    return chart


In [14]:
chart_scatter_wine = create_scatter_matrix (data_wine)
chart_scatter_wine

### Exercise:
(**EASY**) Implement linking and brushing for the chart above.
* Linking - *showing how a point, or set of points, behaves in each of the plots.*
* Brushing - *the points to be highlighted are interactively selected by a mouse and the scatterplot matrix is dynamically updated.*

### Exercise:
(**HARD**) Show only the lower triangle of the image above.

---
---
# Homework

Use everything you learned in this notebook to explore and analyze `ori.dat` data set. Use *pandas'* ***read_csv*** function, with adequate parameters, to import the data.

### Good luck!

---
---