# Data Visualization and Exploration

Data visualization can help us understand the structure of the data. This step is very important especially when we arrive at analyzing the data in order to apply Machine learning. Plotting consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data. Through the process, we can ask to define the problem statement or definition on our data set which is very important.

In addition to data visualization, we also have other ways of exploring the data, especially when the data are in text format. Unsupervised methods include word cloud, topic modeling, and so on. Unsupervised methods are helpful in data exploration, description, or visualization. They are useful when we knew little about the data but nonetheless need to tell a story about it. They are useful in data journalism. But they can be problematic in revealing false pattern or making misleading statement. For example, a topic modeling of twitter feeds may give us very different patterns over Super Bowl compared to an event such as political election.




## Scientific graphing and charting

First, we will use some Python modules to illustrate different types of data in different formats. 

Graphing and charting help us to understand the basic statistics of data, such as frequency and distribution along certain values. 


### Line chart, bar chart, pie chart (review)

We'll now use the Matplotlib package for visualization in Python. Matplotlib is a multi-platform data visualization library built on NumPy arrays, and designed to work with the broader SciPy stack. It was conceived by John Hunter in 2002, originally as a patch to IPython for enabling interactive MATLAB-style plotting via gnuplot from the IPython command line. IPython's creator, Fernando Perez, was at the time scrambling to finish his PhD, and let John know he wouldn’t have time to review the patch for several months. John took this as a cue to set out on his own, and the Matplotlib package was born, with version 0.1 released in 2003. It received an early boost when it was adopted as the plotting package of choice of the Space Telescope Science Institute (the folks behind the Hubble Telescope), which financially supported Matplotlib’s development and greatly expanded its capabilities.

In [None]:
# import matplotlib
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline

# change image size
plt.rcParams['figure.figsize'] = [15, 9]

import numpy as np

# generate evenly spaced numbers over a specified interval 
x = np.linspace(0, 10, 100)
print(x) 

# plot two mathematical functions -- sin and cos
plt.plot(x, np.sin(x))
plt.plot(x, np.cos(x))
 
plt.show()

You can change the color and style of the lines.

In [None]:
import numpy as np
x = np.linspace(0, 10, 100)

fig = plt.figure()
# solid line
plt.plot(x, np.sin(x), '-', color='r')
# dash line
plt.plot(x, np.cos(x), '--', color='b');

We can also make other types of graphs, such as Bar Charts.

In [None]:
import matplotlib.pyplot as plt

# Look at index 4 and 6, which demonstrate overlapping cases.
x1 = [1, 3, 4, 5, 6, 7, 9]
y1 = [4, 7, 2, 4, 7, 8, 3]

x2 = [2, 4, 6, 8, 10]
y2 = [5, 6, 2, 6, 2]

# Colors: https://matplotlib.org/api/colors_api.html

plt.bar(x1, y1, label="Blue Bar", color='b')
plt.bar(x2, y2, label="Green Bar", color='g')
plt.plot()

plt.xlabel("bar number")
plt.ylabel("bar height")
plt.title("Bar Chart Example")
plt.legend()
plt.show()

Or histograms.

(Note: A histogram is similar to a Bar Chart, but a histogram groups numbers into ranges .

The height of each bar shows how many fall into each range.

And you decide what ranges to use!)

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# Use numpy to generate a bunch of random data in a bell curve around 5.
n = 5 + np.random.randn(1000)

m = [m for m in range(len(n))]
plt.bar(m, n)
plt.title("Raw Data")
plt.show()

plt.hist(n, bins=20)
plt.title("Histogram")
plt.show()

plt.hist(n, cumulative=True, bins=20)
plt.title("Cumulative Histogram")
plt.show()

Or scatter plot

(Note: A Scatter (XY) Plot has points that show the relationship between two sets of data.)

In [None]:
import matplotlib.pyplot as plt

x1 = [2, 3, 4]
y1 = [5, 5, 5]

x2 = [1, 2, 3, 4, 5]
y2 = [2, 3, 2, 3, 4]
y3 = [6, 8, 7, 8, 7]

# Markers: https://matplotlib.org/api/markers_api.html

plt.scatter(x1, y1)
plt.scatter(x2, y2, marker='v', color='r')
plt.scatter(x2, y3, marker='^', color='m')
plt.title('Scatter Plot Example')
plt.show()

Or pie chart

In [None]:
import matplotlib.pyplot as plt

labels = 'S1', 'S2', 'S3'
sections = [56, 66, 24]
colors = ['c', 'g', 'y']

plt.pie(sections, labels=labels, colors=colors,
        startangle=90,
        explode = (0, 0.1, 0),
        autopct = '%1.2f%%')

plt.axis('equal') # Try commenting this out.
plt.title('Pie Chart Example')
plt.show()

Now try to generate some complicated data and visualize it with lines.

In [None]:
# Create some data
rng = np.random.RandomState(0)
x = np.linspace(0, 10, 500)
y = np.cumsum(rng.randn(500, 6), 0)

print("print out x:")
print(x)
print("print out y:")
print(y)

In [None]:
# Plot the data with Matplotlib defaults
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

### Exercise 1

Choose one type of charts that are described above to intuitively illustrate the gender composition of our class.

In [None]:
## Complete your code here
## Complete your code here

labels = 'in-china', 'outside-of-china'
sections = [6,7]
colors = ['b', 'g']

plt.pie(sections, labels=labels, colors=colors,
        startangle=90,
        autopct = '%1.6f%%')

plt.axis('equal') # Try commenting this out.
plt.title('location composition')
plt.show()

## Visualization with Seaborn

One advantage of using Python is to make use of various packages that are available at our disposal.

Now let's take a look at how it works with Seaborn, a newer visualization package. As we will see, Seaborn has many of its own high-level plotting routines, but it can also overwrite Matplotlib's default parameters and in turn get even simple Matplotlib scripts to produce vastly superior output. We can set the style by calling Seaborn's set() method. By convention, Seaborn is imported as sns:

In [None]:
import seaborn as sns
import pandas.util.testing as tm
sns.set()

# same plotting code as above!
plt.plot(x, y)
plt.legend('ABCDEF', ncol=2, loc='upper left');

We'll be using a cool Pokémon dataset (first generation). 

Once you've downloaded the CSV file, you can import it with Pandas.

Tip: The argument  index_col=0 simply means we'll treat the first column of the dataset as the ID column.

In [None]:
# pandas for managing datasets
import pandas as pd

# Read dataset
df = pd.read_csv('Pokemon.csv', encoding= 'unicode_escape')

Here's what the dataset looks like:

In [None]:
# Display first 10 observations
df.head(10)

### Seaborn's plotting functions

One of Seaborn's greatest strengths is its diversity of plotting functions. For instance, making a scatter plot is just one line of code using the lmplot() function.

The way is to pass your DataFrame to the data= argument, while passing column names to the axes arguments, x= and y=.

In [None]:
sns.lmplot(x='Attack', y='Defense', data=df)

Seaborn doesn't have a dedicated scatter plot function, which is why you see a diagonal line. We actually used Seaborn's function for fitting and plotting a regression line.

Thankfully, each plotting function has several useful options that you can set. Here's how we can tweak the lmplot():

First, we'll set fit_reg=False to remove the regression line, since we only want a scatter plot.

Then, we'll set hue='Stage' to color our points by the Pokémon's evolution stage. This hue argument is very useful because it allows you to express a third dimension of information using color.

In [None]:
# Scatterplot arguments
sns.lmplot(x='Attack', y='Defense', data=df,
           fit_reg=False, # No regression line
           hue='Stage')   # Color by evolution stage

Looking better, but we can improve this scatter plot further. For example, all of our Pokémon have positive Attack and Defense values, yet our axes limits fall below zero. Let's see how we can fix that...

### Customizing with Matplotlib

Remember, Seaborn is a high-level interface to Matplotlib. From our experience, Seaborn will get you most of the way there, but you'll sometimes need to bring in Matplotlib.

Setting your axes limits is one of those times, but the process is pretty simple:

1. First, invoke your Seaborn plotting function as normal.
2. Then, invoke Matplotlib's customization functions. In this case, we'll use its ylim() and xlim() functions.

Here's our new scatter plot with sensible axes limits:

In [None]:
# Plot using Seaborn
sns.lmplot(x='Attack', y='Defense', data=df,
           fit_reg=False, 
           hue='Stage')
 
# Tweak using Matplotlib

plt.ylim(0, None)
plt.xlim(0, None)

### The role of Pandas

Even though this is a Seaborn tutorial, Pandas actually plays a very important role. You see, Seaborn's plotting functions benefit from a base DataFrame that's reasonably formatted.

For example, let's say we wanted to make a box plot for our Pokémon's combat stats:

In [None]:
# Boxplot
sns.boxplot(data=df)

Well, that's a reasonable start, but there are some columns we'd probably like to remove:

* We can remove the Total since we have individual stats.
* We can remove the Stage and Legendary columns because they aren't combat stats.

In turns out that this isn't easy to do within Seaborn alone. Instead, it's much simpler to pre-format your DataFrame.

Let's create a new DataFrame called stats_df that only keeps the stats columns:

In [None]:
# Pre-format DataFrame
stats_df = df.drop(['Total', 'Stage', 'Legendary'], axis=1)
 
# New boxplot using stats_df
sns.boxplot(data=stats_df)

### Seaborn themes

Another advantage of Seaborn is that it comes with decent style themes right out of the box. The default theme is called 'darkgrid'.

Next, we'll change the theme to 'whitegrid' while making a violin plot.

* Violin plots are useful alternatives to box plots.
* They show the distribution (through the thickness of the violin) instead of only the summary statistics.

For example, we can visualize the distribution of Attack by Pokémon's primary type:

In [None]:
# Set theme
sns.set_style('whitegrid')
 
# Violin plot
sns.violinplot(x='Type 1', y='Attack', data=df)

As you can see, Dragon types tend to have higher Attack stats than Ghost types, but they also have greater variance.

Now, Pokémon fans might find something quite jarring about that plot: The colors are nonsensical. Why is the Grass type colored pink or the Water type colored orange? We must fix this!

### Color palettes

Fortunately, Seaborn allows us to set custom color palettes. We can simply create an ordered Python list of color hex values.

Let's use Bulbapedia to help us create a new color palette:

In [None]:
pkmn_type_colors = ['#78C850',  # Grass
                    '#F08030',  # Fire
                    '#6890F0',  # Water
                    '#A8B820',  # Bug
                    '#A8A878',  # Normal
                    '#A040A0',  # Poison
                    '#F8D030',  # Electric
                    '#E0C068',  # Ground
                    '#EE99AC',  # Fairy
                    '#C03028',  # Fighting
                    '#F85888',  # Psychic
                    '#B8A038',  # Rock
                    '#705898',  # Ghost
                    '#98D8D8',  # Ice
                    '#7038F8',  # Dragon
                   ]

Wonderful. Now we can simply use the palette= argument to recolor our chart.

In [None]:
# Violin plot with Pokemon color palette
sns.violinplot(x='Type 1', y='Attack', data=df, 
               palette=pkmn_type_colors) # Set color palette

Much better!

Violin plots are great for visualizing distributions. However, since we only have 151 Pokémon in our dataset, we may want to simply display each point.

That's where the swarm plot comes in. This visualization will show each point, while "stacking" those with similar values:

In [None]:
# Swarm plot with Pokemon color palette
sns.swarmplot(x='Type 1', y='Attack', data=df, 
              palette=pkmn_type_colors)

That's handy, but can't we combine our swarm plot and the violin plot? After all, they display similar information, right?

### Overlaying plots

The answer is yes.

It's pretty straightforward to overlay plots using Seaborn, and it works the same way as with Matplotlib. Here's what we'll do:

1. First, we'll make our figure larger using Matplotlib.
2. Then, we'll plot the violin plot. However, we'll set inner=None to remove the bars inside the violins.
3. Next, we'll plot the swarm plot. This time, we'll make the points black so they pop out more.
4. Finally, we'll set a title using Matplotlib.

In [None]:
# Set figure size with matplotlib
plt.figure(figsize=(12,9))
 
# Create plot
sns.violinplot(x='Type 1',
               y='Attack', 
               data=df, 
               inner=None, # Remove the bars inside the violins
               palette=pkmn_type_colors)
 
sns.swarmplot(x='Type 1', 
              y='Attack', 
              data=df, 
              color='k', # Make points black
              alpha=0.7) # and slightly transparent
 
# Set title with matplotlib
plt.title('Attack by Type')

Awesome, now we have a pretty chart that tells us how Attack values are distributed across different Pokémon types. But what it we want to see all of the other stats as well?

## Exercise 2

When we explore the pattern of data, we often use pair plots. This is very useful for exploring correlations between multidimensional data, when you'd like to plot all pairs of values against each other.

Create a pairplot using Pokémon dataset (Hint: you should drop columns that are not useful for plotting).

In [None]:
## your code here
sns.pairplot(data = stats_df)

### Putting it all together

Well, we could certainly repeat that chart for each stat. But we can also combine the information into one chart... we just have to do some data wrangling with Pandas beforehand.

First, here's a reminder of our data format:

In [None]:
stats_df.head()

As you can see, all of our stats are in separate columns. Instead, we want to "melt" them into one column.

To do so, we'll use Pandas's melt() function. It takes 3 arguments:

* First, the DataFrame to melt.
* Second, ID variables to keep (Pandas will melt all of the other ones).
* Finally, a name for the new, melted variable.

Here's the output:

In [None]:
# Melt DataFrame
melted_df = pd.melt(stats_df, 
                    id_vars=["Name", "Type 1", "Type 2"], # Variables to keep
                    var_name="Stat") # Name of melted variable
melted_df.head()

All 6 of the stat columns have been "melted" into one, and the new Stat column indicates the original stat (HP, Attack, Defense, Sp. Attack, Sp. Defense, or Speed). For example, it's hard to see here, but Bulbasaur now has 6 rows of data.

In fact, if you print the shape of these two DataFrames...

In [None]:
print( stats_df.shape )
print( melted_df.shape )

...you'll find that melted_df has 6 times the number of rows as stats_df.

Now we can make a swarm plot with melted_df.

* But this time, we're going to set x='Stat' and y='value' so our swarms are separated by stat.
* Then, we'll set hue='Type 1' to color our points by the Pokémon type.

In [None]:
# Swarmplot with melted_df
sns.swarmplot(x='Stat', y='value', data=melted_df, hue='Type 1')

Finally, let's make a few final tweaks for a more readable chart:

1. Enlarge the plot.
2. Separate points by hue using the argument split=True .
3. Use our custom Pokemon color palette.
4. Adjust the y-axis limits to end at 0.
5. Place the legend to the right.

In [None]:
# 1. Enlarge the plot
plt.figure(figsize=(20,12))
 
sns.swarmplot(x='Stat', 
              y='value', 
              data=melted_df, 
              hue='Type 1', 
              dodge=True, # 2. Separate points by hue
              palette=pkmn_type_colors) # 3. Use Pokemon palette
 
# 4. Adjust the y-axis
plt.ylim(0, 260)
 
# 5. Place legend to the right
plt.legend(bbox_to_anchor=(1, 1), loc=2)

### Heatmap

Heatmaps help you visualize matrix-like data.

In [None]:
# Calculate correlations
corr = stats_df.corr()
 
# Heatmap
sns.heatmap(corr)

### Histogram

Histograms allow you to plot the distributions of numeric variables.

In [None]:
sns.distplot(df.Attack)

### Bar Plot

Bar plots help you visualize the distributions of categorical variables.

In [None]:
# Count Plot (a.k.a. Bar Plot)
sns.countplot(x='Type 1', data=df, palette=pkmn_type_colors)
 
# Rotate x-labels
plt.xticks(rotation=-45)

### Factor Plot

Factor plots make it easy to separate plots by categorical classes.

In [None]:
# Factor Plot
g = sns.factorplot(x='Type 1', 
                   y='Attack', 
                   data=df, 
                   hue='Stage',  # Color by stage
                   col='Stage',  # Separate by stage
                   kind='swarm') # Swarmplot
 
# Rotate x-axis labels
g.set_xticklabels(rotation=-45)


### Density Plot

Density plots display the distribution between two variables.

Tip: Consider overlaying this with a scatter plot.

In [None]:
# Density Plot
sns.kdeplot(df.Attack, df.Defense)

In [None]:
# your code here

## Plotly

Another helpful module is plotly chart studio, a modern visualization toolbox.

In [None]:
!pip install --upgrade chart-studio

Chart Studio provides a web-service for hosting graphs. However, it does require signup.

In [None]:
import chart_studio
chart_studio.tools.set_credentials_file(username='datamama', api_key='Evu7CitKi6yNCYhdWF8A')
chart_studio.tools.set_config_file(world_readable=True, sharing='public')

After setting up the credentials, you can start plotting. 

In [None]:
import plotly.graph_objects as go
fig = go.Figure(
    data=[go.Bar(y=[2, 1, 3])],
    layout_title_text="A Figure Displayed with the 'colab' Renderer"
)
fig.show(renderer="colab")

In [None]:
# x and y given as DataFrame columns
import plotly.express as px
df = px.data.iris() # iris is a pandas DataFrame
fig = px.scatter(df, x="sepal_width", y="sepal_length")
fig.show()

Scatter plots with variable-sized circular markers are often known as bubble charts. Note that color and size data are added to hover information. You can add other columns to hover data with the hover_data argument of px.scatter.

In [None]:
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species",
                 size='petal_length', hover_data=['petal_width'])
fig.show()

The symbol argument can be mapped to a column as well. A wide variety of symbols are available.

In [None]:
import plotly.express as px
df = px.data.iris()
fig = px.scatter(df, x="sepal_width", y="sepal_length", color="species", symbol="species")
fig.show()

Scatter plots support faceting.

In [None]:
import plotly.express as px
df = px.data.tips()
fig = px.scatter(df, x="total_bill", y="tip", color="smoker", facet_col="sex", facet_row="time")
fig.show()

Line graph

In [None]:
import plotly.express as px
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.line(df, x='year', y='lifeExp', color='country')
fig.show()

The markers argument can be set to True to show markers on lines.

In [None]:
import plotly.express as px
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.line(df, x='year', y='lifeExp', color='country', markers=True)
fig.show()

The symbol argument can be used to map a data field to the marker symbol. A wide variety of symbols are available.

In [None]:
import plotly.express as px
df = px.data.gapminder().query("continent == 'Oceania'")
fig = px.line(df, x='year', y='lifeExp', color='country', symbol="country")
fig.show()

Line and Scatter Plots

In [None]:
import plotly.graph_objects as go

# Create random data with numpy
import numpy as np
np.random.seed(1)

N = 100
random_x = np.linspace(0, 1, N)
random_y0 = np.random.randn(N) + 5
random_y1 = np.random.randn(N)
random_y2 = np.random.randn(N) - 5

fig = go.Figure()

# Add traces
fig.add_trace(go.Scatter(x=random_x, y=random_y0,
                    mode='markers',
                    name='markers'))
fig.add_trace(go.Scatter(x=random_x, y=random_y1,
                    mode='lines+markers',
                    name='lines+markers'))
fig.add_trace(go.Scatter(x=random_x, y=random_y2,
                    mode='lines',
                    name='lines'))

fig.show()

Scatter with a Color Dimension

In [None]:
import plotly.graph_objects as go
import numpy as np

fig = go.Figure(data=go.Scatter(
    y = np.random.randn(500),
    mode='markers',
    marker=dict(
        size=16,
        color=np.random.randn(500), #set color equal to a variable
        colorscale='Viridis', # one of plotly colorscales
        showscale=True
    )
))

fig.show()