# Data Visualization in Python

Data visualization is the discipline of trying to understand data by placing it in a visual context so that patterns, trends and correlations that might not otherwise be detected can be exposed.

To extract the required information from the different visuals we create, it is quintessential that we use the correct representation based on the type of data and the questions that we are trying to understand. We will go through a set of most widely used representations below and how we can use them in the most effective manner.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [None]:
iris = pd.read_csv('iris.csv', names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])
print(iris.head())

In [None]:
wine_reviews = pd.read_csv('winemag-data-130k-v2.csv', index_col=0)
wine_reviews.head()

## Bar chart

A bar chart is used when we want to compare metric values across different subgroups of the data.

To plot a bar-chart we can use the plot.bar() method, but before we can call this we need to get our data. For this we will first count the occurrences using the value_count() method and then sort the occurrences from smallest to largest using the sort_index() method.

In [None]:
wine_reviews['points'].value_counts().sort_index().plot.bar()

It’s also really simple to make a horizontal bar-chart using the plot.barh() method.

In [None]:
wine_reviews['points'].value_counts().sort_index().plot.barh()

We can also plot other data then the number of occurrences.

In [None]:
wine_reviews.groupby("country").price.mean().sort_values(ascending=False)[:5].plot.bar()

In the example above we grouped the data by country and then took the mean of the wine prices, ordered it, and plotted the 5 countries with the highest average wine price.

## Line chart

A line chart is used for the representation of continuous data points. This visual can be effectively utilized when we want to understand the trend across time.

In Matplotlib we can create a line chart by calling the plot method. We can also plot multiple columns in one graph, by looping through the columns we want and plotting each column on the same axis.

In [None]:
# get columns to plot
columns = iris.columns.drop(['class'])
# create x data
x_data = range(0, iris.shape[0])
# create figure and axis
fig, ax = plt.subplots()
# plot each column
for column in columns:
    ax.plot(x_data, iris[column], label=column)
# set title and legend
ax.set_title('Iris Dataset')
ax.legend()

## Histogram

Column histograms are used to observe the distribution for a single variable.

In Matplotlib we can create a Histogram using the hist method. If we pass it categorical data like the points column from the wine-review dataset it will automatically calculate how often each class occurs.

In [None]:
# create figure and axis
fig, ax = plt.subplots()
# plot histogram
ax.hist(wine_reviews['points'])
# set title and labels
ax.set_title('Wine Review Scores')
ax.set_xlabel('Points')
ax.set_ylabel('Frequency')

## Scatter plot

Scatter plots can be leveraged to identify relationships between two variables. It can be effectively used in circumstances where the dependent variable can have multiple values for the independent variable.

To create a scatter plot in Matplotlib we can use the scatter method. We will also create a figure and an axis using plt.subplots so we can give our plot a title and labels.

In [None]:
# create a figure and axis
fig, ax = plt.subplots()

# scatter the sepal_length against the sepal_width
ax.scatter(iris['sepal_length'], iris['sepal_width'])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

We can give the graph more meaning by coloring in each data-point by its class. This can be done by creating a dictionary which maps from class to color and then scattering each point on its own using a for-loop and passing the respective color.

In [None]:
# create color dictionary
colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris['sepal_length'])):
    ax.scatter(iris['sepal_length'][i], iris['sepal_width'][i],color=colors[iris['class'][i]])
# set a title and labels
ax.set_title('Iris Dataset')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')

## Scatter matrix

It enables you to plot a grid of pairwise relationships in a dataset. 

This technique is always plotting two features with each other. The diagonal of the graph is filled with histograms and the other plots are scatter plots.

In [None]:
from pandas.plotting import scatter_matrix

fig, ax = plt.subplots(figsize=(12,12))
scatter_matrix(iris, alpha=1, ax=ax)

## Box plot

A box plot is used to show the shape of the distribution, its central value, and its variability.

In [None]:
fig = plt.figure(figsize=(5,7))
# Creating plot
plt.boxplot(wine_reviews['points'])
# show plot
plt.show()

## Bubble chart

Scatter plots can be leveraged to depict and show relationships among three variables.

In [None]:
#Creating the dataset
np.random.seed(42)
N = 100
x = np.random.normal(170, 20, N)
y = x + np.random.normal(5, 25, N)
colors = np.random.rand(N)
area = (25 * np.random.rand(N))**2
df = pd.DataFrame({
    'X': x,
    'Y': y,
    'Colors': colors,
    "bubble_size":area})
#Creating the bubble chart
plt.scatter('X', 'Y', s='bubble_size',alpha=0.5, data=df)
#Adding the aesthetics
plt.title('Chart title')
plt.xlabel('X axis title')
plt.ylabel('Y axis title') 
#Show the plot
plt.show()

## Pie chart

Pie charts can be used to identify proportions of the different components in a given whole.

In [None]:
#Creating the dataset
cars = ['AUDI', 'BMW', 'NISSAN', 
        'TESLA', 'HYUNDAI', 'HONDA'] 
data = [20, 15, 15, 14, 16, 20] 
#Creating the pie chart
plt.pie(data, labels = cars,colors = ['#F0F8FF','#E6E6FA','#B0E0E6','#7B68EE','#483D8B'])
#Adding the aesthetics
plt.title('Chart title')
#Show the plot
plt.show()

## Assignment

In [None]:
df=pd.read_csv("StudentsPerformance.csv")
df.head()

### Task 1. (1 point)

Prepare visualization of percentage distribution of gender and ethnicity.

In [None]:
# Your answer here

### Task 2. (1 point)

Prepare visualization of parental level of education.

In [None]:
# Your answer here

### Task 3. (1 point)

Check a correlation between test preparation and different test scores (math, reading and writing).

In [None]:
# Your answer here

### Task 4. (1 point)

Check a correlation between different test scores (math, reading and writing).

In [None]:
# Your answer here

#### The Amazon rainforest fires in Brazil

In [None]:
df=pd.read_csv("amazon.csv", encoding='latin1')
df.head()

### Task 5. (1 point)

Show how number of fires is changing over time.


In [None]:
# Your answer here

### Task 6. (2 points)

Show the distribution of the number of fires in the hottest months.

In [None]:
# Your answer here

### Task 7. (1 point)

Visualize the average number of fires in Brazil per month.

In [None]:
# Your answer here

### Task 8. (1 point)

Check a correlation between a state and a month and the number of fires.

In [None]:
# Your answer here

### Task 9. (1 point)

Visualize the relationship between state, year and a number of fires.

In [None]:
# Your answer here