# DATA VISUALIZATION

> by Dr Juan H Klopper

- Research Fellow
- School for Data Science and Computational Thinking
- Stellenbosch University

## INTRODUCTION

Visualizing data is not only a pleasing activity, but it can provide an even richer understanding of the data than summary statistics. Data visualisation is also part and parcel of data communication, a key activity in Data Science.

## PACKAGES USED IN THIS NOTEBOOK

In this notebook, we take a look at one of the myriad plotting packages in Python called plotly. We choose it for this course due to its ability to produce modern, multi-use plots, ready for online reports and print documents such as reseach papers. Matplotlib, seaborn, and altair are some of the other established plotting packages.

In [None]:
import pandas as pd

Two of the main plotting modules in plotly are the `graph_objects` and the `express` modules. We import these with the commonly used namespace abbreviations `go` and `px`. We also import the `io` module to set a theme for our plots. We choose `plotly_white` since this notebook uses a white background. You can type `pio.templates` which will return a list of all the themes.

In [None]:
import plotly.graph_objects as go
import plotly.express as px
import plotly.io as pio
pio.templates.default = 'plotly_white'

In [None]:
pio.templates # A list of the available plot themes

This notebook was created on a computer with a retina display. In order to plot high resolution plots, we use the %config magic command below.

In [None]:
%config InlineBackend.figure_format = "retina" # For Retina type displays

Finally, the `%load_ext` magic command is used to set the way tables are printed to the screen in Google Colab.

In [None]:
%load_ext google.colab.data_table

The `drive` function allows us to connect to the files in our Google Drive.

In [None]:
from google.colab import drive  # Connect to Google Drive

## DATA IMPORT

Since our data is in the `data` folder in Google Drive, we need to mount the drive.

In [None]:
# Log on and list files in the DATA directory of our Google Drive
drive.mount('/gdrive')
%cd '/gdrive/My Drive/DATA SCIENCE/DATA'

The data set contains data on banking customers and is used in machine learning to predict whether a customer is likely to close their account.

In [None]:
df = pd.read_csv('customers.csv')

Later, we will also use data about rainfall in Australia.

In [None]:
rain = pd.read_csv('australia_rain.csv')

In [None]:
df # Viewing a table of the customer data

We consider some meta data about the dataframe object.

In [None]:
df.shape # Number of observations (rows) and variables (columns)

In [None]:
df.columns # Statistical variables (column names)

In [None]:
# Data types of each column
df.dtypes

A method that we have not used before is the `info` method. It returns a pandas dataframe object with a column of all the variables, a column of the number of non missing data, and a column of the data types.

In [None]:
df.info()

We start our journey introdcuing only the most commonly used plots. In fact, the plotly package is enormous and we cannot possibly look at all the plots that it can create. There is a lot more information at the [plotly](#https://plotly.com/python/) website.

## BAR PLOT

Bar plots are great for indicating frequency and relative frequency.  In other words, counting the sample space elements of categorical or discrete data.

One axis, usually the horizontal axis is reserved to indicate the sample space elements of the catgeorical variable.  The other axis is used to show the frequency or relative frequency, i.e. the height of a bar.  There are spaces in between the bars to indicate that the sample space elements are indeed not a continuity.

Below, we use the `value_counts` method on the series object of the `Attrition_Flag` variable.

In [None]:
# Calculate the frequency count
df.Attrition_Flag.value_counts()

A total of $8500$ customers are still with the bank (`Existing Customer`) and $1627$ have left the bank (`Attrited Customer`). We visualize this data as a bar blot. The `Figure` function in the graph_objects module is assigned to a computer variable. We then use the `add_trace` method on this figure to add a bar chart using the `Bar` function. We set the `x` axis (sample space elements of the categorical variable as Python a list object) and the `y` axis values (frequencies also as a Python list object).

In [None]:
churn_fig = go.Figure() # Simple bar chart

churn_fig.add_trace(
    go.Bar(
        x=['Existing customers', 'Lost customers'],
        y=[8500, 1627]
    )
)

We entered the `x` and `y` values by hand.  While this is easy enough for smaller sample element numbers, it is not always the case.  We can get the sample space elements using the `unique()` method.  It will be returned in the order in which the method discovers them in the relevant pandas series object.

In [None]:
df.Education_Level.unique()

This might not be the order in which we want the bars to appear in.

With the default argument values of the `.value_counts()` method, we get the frequencies in descending order.

In [None]:
df.Education_Level.value_counts()

This might also not be useful as in the case of a statistical variable such as *Month*.  The bottom line is that we have to be careful with our code when designing plots.

The `Bar` function requires the frequency variable to be a list of values.  We extract that below using the `value` property and the `tolist` method.

In [None]:
totals = df.Education_Level.value_counts().values.tolist()
totals

Since we are using a series object, with the sample space elements as the index, we can use the `index` attribute. We can also convert it to a list using the `tolist` method.

In [None]:
levels = df.Education_Level.value_counts().index.tolist()
levels

With these lists saved, we can use them for our `x` and `y` arguments in the `Bar` function. We also use the `marker` argument and set its value using a dictionary. We specify seven colors and also a bar outline color and line thickness. Colors can be specified by their name such as `red` or `blue` or by RGBA values. The latter uses four values. The first three are for red, green, and blue pixel intensities are ranges from $0$ to $255$. The last digit a fraction ($0$ to $1$) for opacity.

In [None]:
# Simple bar plot
churn_fig = go.Figure()

churn_fig.add_trace(go.Bar(
    x=levels,
    y=totals,
    marker={'color':['red', 'rgba(50, 50, 50, 0.5)', 'rgba(75, 75, 75, 0.5)', 'rgba(100, 100, 100, 0.5)', 'rgba(150, 150, 150, 0.5)', 'rgba(175, 175, 175, 0.5)', 'rgba(200, 200, 200, 0.5)'],
            'line':{'color':'black', 'width':1}}
))

churn_fig.show()

The figure is assigned to a computer variable. We can use the `update_layout` method to add a title and axes labels.

In [None]:
# Add a title
churn_fig.update_layout(title='Number of customers in each education level')
churn_fig.show()

In [None]:
# Add axes labels
churn_fig.update_layout(xaxis=dict(title='Education level'),
                          yaxis=dict(title='Counts'))
churn_fig.show()

In the bar plot below, we add many more properties as illustration of what is possible.

In [None]:
churn_fig = go.Figure()

churn_fig.add_trace(go.Bar(
    x=levels,
    y=totals,
    text=levels,
    textposition='outside',
    marker={'color':'deepskyblue',
            'line':{'color':'black', 'width':1},
            'opacity':0.9}
))

churn_fig.update_layout(title='Number of customers in each education level')

churn_fig.update_layout(xaxis = dict(title='Education level'),
                          xaxis_tickangle=-25,
                          yaxis=dict(title='Counts'))

churn_fig.show()

We can also create horizontal bar plots using the `orientation` argument set to `h`, remembering to swicth the axes information.

In [None]:
churn_fig = go.Figure()

churn_fig.add_trace(go.Bar(
    y=levels,
    x=totals,
    marker={'color':'orange',
            'line':{'color':'black', 'width':1},
            'opacity':0.9},
    orientation = 'h'
))

churn_fig.update_layout(title='Number of customers in each education level')

churn_fig.update_layout(yaxis = dict(title='Education level'),
                        xaxis=dict(title='Counts'))

churn_fig.show()

We can group data by another categorical variable. Below, we plot the frequencies of the different educational levels but divided for each of the customers groups (lost versus existing). We start with the `crosstab` function in pandas. It returns a dataframe object. We pass two categorical variables. The variable listed first results in the rows and the sample space elements of the second variable becomes the column names.

In [None]:
attr_edu = pd.crosstab(df.Attrition_Flag, df.Education_Level)
attr_edu

Note that the order of the educational levels is now different from when we used the `value_counts` method on a series object. We save the educational level values as a list.

In [None]:
levels = attr_edu.columns.tolist()
levels

Since our object is a dataframe, we can used indexing. Below, we do this by using `iloc` indexing. With it we specify the row and the column values we require. Remember that the colon symbol is short-hand for including all the values (columns in this case).

In [None]:
attr_values = attr_edu.iloc[0, :].tolist()
attr_values

In [None]:
exis_values = attr_edu.iloc[1, :].tolist()
exis_values

No we can generate two traces using the `add_trace` function twice. We also set `barmode` argument to `group` in the `update_layout` method to create a grouped bar chart.

In [None]:
churn_edu_fig = go.Figure()

churn_edu_fig.add_trace(go.Bar(
    x=levels,
    y=attr_values,
    text=attr_values,
    textposition='outside',
    name='Lost customers',
    marker={'color':'orange', 'opacity':0.7}
))

churn_edu_fig.add_trace(go.Bar(
    x=levels,
    y=exis_values,
    text=exis_values,
    textposition='outside',
    name='Existing customers',
    marker={'color':'deepskyblue', 'opacity':0.7}
))

churn_edu_fig.update_layout(title='Education level by attrition group')

churn_edu_fig.update_layout(xaxis = dict(title='Education level'),
                           yaxis=dict(title='Counts'),
                           barmode='group')

churn_edu_fig.show()

#### Exercise (advanced)

Create the bar plot above, but group the horizontal axis by the customer groups.

#### Solution

One way to generate a plot for this exercise is to use the idea of creating arrays and lists so that we can loop over their elements.

We start by generating a dataframe object, using the reverse order of the categorical variables in the `crosstab` function.

In [None]:
attr_edu = pd.crosstab(df.Education_Level, df.Attrition_Flag)
attr_edu

We save the column values again.

In [None]:
groups = attr_edu.columns.tolist()
groups

We can extract all the dataframe values as an array.

In [None]:
values = attr_edu.values
values

We also still need a list of all the education levels. This is stored as the index of the dataframe obejct.

In [None]:
levels = attr_edu.index.tolist()
levels

Now we generate seven traces by looping over the seven elements in the (values) array and the (education level) list above. We could also have added each trace manualy. The idea is to use the power of the Python language to do all the hard work.

In [None]:
churn_edu_fig = go.Figure()

# Loop over elements in grps array object and nms list obejct
for i in range(len(values)):
  churn_edu_fig.add_trace(go.Bar(
      x=groups,
      y=values[i],
      text=values[i],
      textposition='outside',
      name=levels[i],
))

churn_edu_fig.update_layout(title='Educational level frequencies by customer group')

# Using alternate dictionary syntax
churn_edu_fig.update_layout({'xaxis':{'title':'Customer group'},
                            'yaxis':{'title':'Counts'},
                            'barmode':'group'})

churn_edu_fig.show()

Below, we turn this into a stacked bar chart.

In [None]:
churn_edu_fig = go.Figure()

# Loop over elements in grps array object and nms list obejct
for i in range(len(values)):
  churn_edu_fig.add_trace(go.Bar(
      x=groups,
      y=values[i],
      text=values[i],
      textposition='outside',
      name=levels[i],
))

churn_edu_fig.update_layout(title='Educational level frequencies by customer group')

# Using alternate dictionary syntax
churn_edu_fig.update_layout({'xaxis':{'title':'Customer group'},
                            'yaxis':{'title':'Counts'},
                            'barmode':'stack'})

churn_edu_fig.show()

For more information about bar plots click [HERE](https://plotly.com/python/bar-charts/).

## HISTOGRAM

A histogram is used for continuous numerical variables.  By creating bins, we can count how many times a value in that interval appears. Binning is a method of dividing the range of values for the variable up into intervals and coonting how many times values _fall within_ each interval.

Histograms are great at showing us the distribution of the data.  Below, we plot a histogram of the age of our customers.  This time, we make use of the plotly express library. The first argument is the dataframe object and the `x` argument is assigned to the `Customer_Age` column in the dataframe.

In [None]:
age_hist = px.histogram(df,
                        x='Customer_Age')
age_hist.show()

The bin size was set by default at one year increments. Note that there are no gaps between the bars as with a bar chart. This indicates the fact that the variable is continuous numerical.

We can create a **stacked** histogram by using the `color` argument to point to another categorical variable, `Attrition_Flag` in this case. The `labels` argument is set to a dictionary. It can be used to change text aspects of the axes labels of the plot. First, we see a plot without this argument and then with it to see the differences.

In [None]:
age_hist = px.histogram(
    df,
    x='Customer_Age',
    color='Attrition_Flag',
    title='Histogram of customer ages',
    opacity=0.7,
    marginal='rug'
)

age_hist.show()

Note the `Customer_Age` title of the `x` axis, taken from the columns name. The dictionary is used below to replace this value. This is only availabe in the express module.

In [None]:
age_hist = px.histogram(
    df,
    x='Customer_Age',
    color='Attrition_Flag',
    title='Histogram of customer ages',
    opacity=0.7,
    marginal='rug',
    labels={
        'Attrition_Flag':'Customer group',
        'Customer_Age':'Customer age'
    }
)

age_hist.show()

The rug plot shows the actual values as small marks at the top of the plot. Since the values overlap (many customers with the same age), we see that the rug plot does not give us a good idea of the distribution of the data.

The graph objects module of plotly provides more felxibility.

In [None]:
age_hist = go.Figure()

age_hist.add_trace(go.Histogram(
    x=df.Customer_Age
))

age_hist.show()

Let's look at the age distribution for each of the customer groups.  Using the `barmode='overlay'` option, we create a **non-stacked** histogram. Below we set a start and end interval for the _x_ axis and also a bins size of five year increments.

In [None]:
ages_cust = go.Figure()

ages_cust.add_trace(go.Histogram(
    x=df[df.Attrition_Flag == 'Existing Customer']['Customer_Age'],
    name='Existing customer',
    marker_color='deepskyblue',
    xbins=dict(start=25,
               end=80,
               size=5)
))
ages_cust.add_trace(go.Histogram(
    x=df[df.Attrition_Flag == 'Attrited Customer']['Customer_Age'],
    name='Lost customer',
    marker_color='orange',
    xbins=dict(start=25,
               end=80,
               size=5)
))

ages_cust.update_layout(barmode='overlay',
                         title='Age distribution of existing and lost customers',
                         xaxis=dict(title='Age'),
                         yaxis=dict(title='Count'))

ages_cust.update_traces(opacity=0.75)

ages_cust.show()

For more information on histograms from plotly click [HERE](https://plotly.com/python/histograms/).

## BOX PLOT

Box-and-whisker plots are another visual representation of the distribution of a continuous numerical variable.

Below, we create a box plot of the ages of each customer group.

In [None]:
# Simple box plots using express
ages_churn_box_px = px.box(
    df,
    x='Attrition_Flag',
    y='Customer_Age',
    title='Distribution of age in customer groups',
    labels={'Customer_Age':'Age', 'Attrition_Flag':'Customer group'})
ages_churn_box_px.show()

With the `Box` function in the graph_objects module, we can take more control over each trace than in the express module. Below, we extract ages for each customer group as list objects.

In [None]:
# Extracting list objects
exis_age = df[df.Attrition_Flag == 'Existing Customer']['Customer_Age'].to_list()
churn_age = df[df.Attrition_Flag == 'Attrited Customer']['Customer_Age'].to_list()

Now we create separate traces for each group. Code comments highlight the arguments.

In [None]:
# Adding separate traces and configuration
ages_churn_box = go.Figure()

ages_churn_box.add_trace(go.Box(
    y=exis_age,
    name='Existing customers',
    marker_color='green',
    boxmean=True, # Add a mean as a dotted line
    boxpoints='all' # Add all the age values as dots next to the box
))

ages_churn_box.add_trace(go.Box(
    y=churn_age,
    name='Lost customers',
    marker_color='red',
    boxmean='sd', # Add a mean and standard deviation as dotted lines
    boxpoints='all'
))

ages_churn_box.update_layout(title='Distribution of ages',
                             xaxis={'title':'Group'},
                             yaxis={'title':'Count'})

ages_churn_box.show()

You can learn more about bo-and-whisker plots [HERE](https://plotly.com/python/box-plots/).

## SCATTER PLOT

Scatter plots allow us to view the difference between observations with respect to continuous numerical variables.  If we restrict ourselves to two variables, each dot on a plane with and _x_ and a _y_ axis can represent the values for a pair of continuous numerical variables for each observation.

- Create a scatter plot of age ($x$ axis) vs systolic blood pressure ($y$ axis).

In [None]:
age_mob = go.Figure()

age_mob.add_trace(go.Scatter(
    x=df.Customer_Age,
    y=df.Months_on_book,
    mode='markers'
))

age_mob.update_layout(title="Age vs month on books",
                      xaxis=dict(title="Age"),
                      yaxis=dict(title="Months"))

age_mob.show()

We can add a third variable in the form of the size of the marker.  Below, we do just that and compare the age and months on books of the customers.  To this, we add the numer of dependents as the size of the marker.  We also split the customers by the sample space elements of the `Attrition_Flag` variable.

We can even further and add box-and-whisker plots, together with a linear model (using ordinary-least-squares).

In [None]:
# Box and whisker plots of the variables
age_mob_group_px = px.scatter(
    df,
    x='Customer_Age',
    y='Months_on_book',
    size='Dependent_count',  # Determines size of markers
    color='Attrition_Flag',  # Group by this variable
    marginal_y='box',
    marginal_x='box',
    trendline='ols',
    title='Age vs months on books',
    labels={'Customer_Age':'Age', 'Months_on_book':'Months'})  # Over-write column names
age_mob_group_px.show()

Instead of box-and-whisker plots, we can also add rug plots and histograms.

In [None]:
# Rug plot and histogram as marginal plots
px.scatter(
    df,
    x='Customer_Age',
    y='Months_on_book',
    size='Dependent_count',  # Determines size of markers
    color='Attrition_Flag',
    marginal_y='histogram',
    marginal_x='rug',
    trendline='ols',
    title='Comparing age vs months on books for each of the two groups',
    labels={'Customer_Age':'Age', 'Months_on_book':'Months'}).show()

Instead of the size of the marker as visual indicator of the third variable, we can also use color.  Below, we also add the `facet_col` argument.  This creates individual plots based on the sample space elements of a categorical variable and positions them as columns.

- Separate scatter plots
- _Third dimension_ by color scale

In [None]:
px.scatter(
    df,
    x='Customer_Age',
    y='Months_on_book',
    color='Dependent_count',
    facet_col='Attrition_Flag',
    trendline='ols',
    title='Sperate scatter plots per group',
    labels={'Attrition_Flag':'Group', 'Months_on_book':'Months', 'Customer_Age':'Age', 'Dependent_count':'Dependents'},
    color_continuous_scale=px.colors.sequential.Viridis).show();

(Hover on the trendline to see the regression model and the coefficient of determination, $R^2$.  We will learn more about this when we look at linear regression.)

You can learn more about scatter plots [HERE](https://plotly.com/python/line-and-scatter/).

## TIME SERIES DATA

Time series contain dates or sequential data of some form. Here, we use the Australian rainfall data set. We start by investigating the difference in maximum daily temperatures between Darwin and Hobart.

We start by using the `info` method to learn more about the dataframe object.

In [None]:
rain.info()

The data points for `MaxTemp`, the maximum temperature, are connect by line segments using the `line` function in the express module.

In [None]:
px.line(
    rain[(rain.Location == 'Darwin') | (rain.Location == 'Hobart')],
    x='Date',
    y='MaxTemp',
    color='Location',
    title='Maximum temperature in Darwin and Hobart between 2009 and 2017',
    labels={'MaxTemp':'Maximum temperature'}
)

We can also view the individual data points as a scatter plot. We also add a range slider at the bottom of the plot.

In [None]:
px.scatter(
    rain[(rain.Location == 'Darwin') | (rain.Location == 'Hobart')],
    x='Date',
    y='MaxTemp',
    color='Location',
    title='Maximum temperature in Hobart and Darwin between 2009 and 2017',
    labels={'MaxTemp':'Maximum temperature'}
).update_xaxes(rangeslider_visible=True)

Lastly, let's look at the difference in daily rainfall between Perth in the West and Sydney in the East.

In [None]:
px.scatter(
    rain[(rain.Location == 'Perth') | (rain.Location == 'Sydney')],
    x='Date',
    y='Rainfall',
    color='Location',
    title='Daily rainfall in Sydney and Perth between 2009 and 2017'
).update_xaxes(rangeslider_visible=True)

You can learn more about plot for dates and time [HERE](https://plotly.com/python/time-series/).

## CONCLUSION

There is so much more to learn about plotly and we will see new types of plots in later notebooks.  Visit the homepage [HERE](https://plotly.com/python/).  For references on all the arguments click [HERE](https://plotly.com/python-api-reference/).