# How to Create Effective and Engaging Data Visualizations with Jupyter Notebooks
Data visualization is the art and science of transforming data into visual forms that can communicate insights, patterns, trends, and relationships. Data visualization can help us understand complex data sets, tell compelling stories with data, and persuade our audience with evidence.

But how can we create effective and engaging data visualizations that stand out from the crowd? In this blog post, I will share some data visualization best practices and show you how to apply them using Jupyter notebooks, a powerful tool for interactive data analysis and presentation.

### What is Jupyter Notebook?
Jupyter notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. You can use Jupyter notebook for data cleaning, transformation, exploration, modeling, visualization, machine learning, and much more.

Jupyter notebook supports multiple programming languages, such as Python, R, Julia, and Scala. You can also use various libraries and frameworks, such as pandas, numpy, matplotlib, seaborn, plotly, scikit-learn, tensorflow, and pytorch, to enhance your data analysis and visualization capabilities.

One of the main advantages of Jupyter notebook is that it enables you to combine code, output, and explanation in a single document, making your work more transparent, reproducible, and shareable. You can also export your notebook as HTML, PDF, or slides, or publish it online using platforms like Binder, Colab, or GitHub.

### Data Visualization Best Practices
Let’s review some data visualization best practices that can help us create better charts and graphs.

But before we start dealing with data, let's first import some python libraries.
We will use pandas, numpy, matplotlib, and seaborn for this example. You can install these libraries using pip or conda, or use a pre-installed environment such as Anaconda.

To import the libraries, we can use the following code:

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Import data
df = pd.read_csv("fortune500.csv")
df.head() # Read the first 5 rows in a dataframe

Unnamed: 0,Year,Rank,Company,Revenue (in millions),Profit (in millions)
0,1955,1,General Motors,9823.5,806.0
1,1955,2,Exxon Mobil,5661.4,584.8
2,1955,3,U.S. Steel,3250.4,195.4
3,1955,4,General Electric,2959.1,212.6
4,1955,5,Esmark,2510.8,19.1


We can see that the data frame has six columns: Year, Rank, Company, Revenue, and Profit. The data frame has 25,500 rows, one for each company listed in the Fortune 500 from 1955 to 2005.

### Exploring and cleaning the data
The next step is to explore and clean the data, to make sure that it is accurate, complete, and consistent. We can use various pandas methods and attributes to check the basic information, summary statistics, and missing values of the data frame.

In [4]:
# Check for basic information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25500 entries, 0 to 25499
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   25500 non-null  int64  
 1   Rank                   25500 non-null  int64  
 2   Company                25500 non-null  object 
 3   Revenue (in millions)  25500 non-null  float64
 4   Profit (in millions)   25500 non-null  object 
dtypes: float64(1), int64(2), object(2)
memory usage: 996.2+ KB


We can see that the data frame has 25,500 non-null entries, and that the data types are mostly numeric, except for the Company and Profit columns, which are objects. This is strange, because we would expect the Profit column to be numeric as well. Let’s investigate this further.

In [5]:
# Check summary statistics of the dataframe
df.describe()

Unnamed: 0,Year,Rank,Revenue (in millions)
count,25500.0,25500.0,25500.0
mean,1980.0,250.499765,4273.329635
std,14.71989,144.339963,11351.884979
min,1955.0,1.0,49.7
25%,1967.0,125.75,362.3
50%,1980.0,250.5,1019.0
75%,1993.0,375.25,3871.0
max,2005.0,500.0,288189.0


In [16]:
df.drop(df[df['Profit (in millions)'] == 'N.A.'].index, inplace=True)

In [17]:
df['Profit (in millions)'] = df['Profit (in millions)'].astype(float).astype(int)

In [18]:
df.describe()

Unnamed: 0,Year,Rank,Revenue (in millions),Profit (in millions)
count,25131.0,25131.0,25131.0,25131.0
mean,1979.926784,249.744777,4304.96178,207.578688
std,14.764827,144.443,11396.723687,1173.703728
min,1955.0,1.0,49.7,-98696.0
25%,1967.0,124.0,357.9,8.0
50%,1980.0,250.0,1017.6,35.0
75%,1993.0,375.0,3916.1,150.0
max,2005.0,500.0,288189.0,25330.0


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25131 entries, 0 to 25499
Data columns (total 5 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Year                   25131 non-null  int64  
 1   Rank                   25131 non-null  int64  
 2   Company                25131 non-null  object 
 3   Revenue (in millions)  25131 non-null  float64
 4   Profit (in millions)   25131 non-null  int32  
dtypes: float64(1), int32(1), int64(2), object(1)
memory usage: 1.1+ MB


#### 1. Know your audience and purpose
This is a good practice because it helps you tailor your visualization to the needs and expectations of your audience, and to convey your message clearly and effectively. For example, if you are creating a visualization for a business audience, you might want to use a bar chart to show the sales performance of different products, and use a catchy title and a call to action to persuade them to buy more. If you are creating a visualization for a scientific audience, you might want to use a scatter plot to show the correlation between two variables, and use a descriptive title and a reference to support your hypothesis.