# Introduction to Data visualization
There are amazing tools available to us now for data analysis and visualization. Many of these are designed especially for the python ecosystem. We'll meet two of them today and use them to visualize the Gapminder dataset -- `pandas` and `plotly`.

In order to use any external library in python we need to `import` them. It is always a good idea to import your libraries at the beginning of your notebook:



In [13]:
#this line imports pandas and assigns it a nickname
import pandas as pd

#this line imports plotly express and assigns it a nickname
import plotly.express as px

import plotly.io as pio

## `pandas`

`pandas` is the standard python library for reading, writing, organizing, and manipulating tabular data. You are probably already familliar with tabular data in the form of a spreadsheet like microsoft excel or google sheets. `pandas` has a number of tools to read different spreadsheet formats into python.

In this case, we are starting with a `.csv` file -- short for comma separated values -- a standard method of saving spreadsheets. Once we read a file into python with `pd.read_csv` we need to save it to a variable. The object we are left with is called a `DataFrame`.


In [15]:
pio.renderers
pio.renderers.default = "jupyterlab"

In [7]:
#The line below reads our file -- we pass the location of our file as an argument to read_csv --
#and saves it to a variable I called data
data = pd.read_csv('https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/gapminder.csv')

In [8]:
# we can now display our DataFrame by typing its name
data

Unnamed: 0,Year,Country,fertility,life,population,child_mortality,gdp,region
0,1964,Afghanistan,7.671,33.639,10474903.0,339.7,1182.0,South Asia
1,1965,Afghanistan,7.671,34.152,10697983.0,334.1,1182.0,South Asia
2,1966,Afghanistan,7.671,34.662,10927724.0,328.7,1168.0,South Asia
3,1967,Afghanistan,7.671,35.170,11163656.0,323.3,1173.0,South Asia
4,1968,Afghanistan,7.671,35.674,11411022.0,318.1,1187.0,South Asia
...,...,...,...,...,...,...,...,...
10106,2002,Aland,,81.800,26257.0,,,Europe & Central Asia
10107,2003,Aland,,80.630,26347.0,,,Europe & Central Asia
10108,2004,Aland,,79.880,26530.0,,,Europe & Central Asia
10109,2005,Aland,,80.000,26766.0,,,Europe & Central Asia


If you scroll through the data, you may see some blank spots. We call this missing data, and it can cause problems for our analysis and visualization. 

There are many approaches to deal with missing data, but today we willl take the quickest and easiest- if not the best. We will simply drop any rows that are missing data.

In [16]:
#data.dropna() drops any rows in our dataframe that are missing data, we then assign
#this dataframe back to our variable called data
data = data.dropna()

## Plotly.
Plotly is a tool we can use to visualize data. It has a number of different submodules -- we'll start by working with plotly express. 

Plotly express gives you less control over how you displlay your data than some of Plotly's other modules, but it is much easier to use.

We can create a scatter plot using the code below. `px.scatter()` is a function in plotly express that makes a scatter plot. 

We need to pass it several arguments -- first a pandas dataframe, and then the columns we want to use for our `x` and `y` values. We can pass further arguments that specify how to specify other aspects of our graph, such as the size or color of our scatter points.

Make sure any column names you pass as arguments are surrounded by quotation marks and are written exactly as they appear in the dataframe (capitalization, underscores, spaces, etc.).


In [17]:
px.scatter(data,
           x="fertility",
           y="life",
           animation_frame="Year",
           animation_group="Country",
           size="population",
           color="region",
           hover_name="Country",
           size_max=75,
           range_x=[0, 8],
           range_y=[20, 85])

## Assignment
Use our gapminder dataframe called `data` and the `px.scatter()` function to make your own plot. Use different columns than the plot above for the various parts of your graph(x-values,y-values, bubble size, color, slider/animation, etc.). You don't need to use all of the options available, but you can. 

Make sure you run all the cells above so your variables are defined and your libraries are imported.

**Extra credit**: figure out what the "`animation_group=`" argument does, **because I have no idea.**

In [None]:
#call px.scatter with your arguments here

## Assignment 2
This is the beginning of a long-term assignment. Over the course of the next several weeks, you will find a dataset that interests you and visualize the data in some interesting way. One of the first steps of this is exploring what kinds of data is publicly available. You can find links to thousands of public datasets at the following sites:

- https://www.lib.ncsu.edu/teaching-and-learning-datasets
- https://guides.emich.edu/data/free-data

Explore some of the data that is available, and begin to think of an area that you might be interested in studying. You may work in groups of up to 3 on this project. I will allow you to select your own groups, but you must tell me what they are and I may refuse to allow particular groups of students (for any reason).
