[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TobGerken/ISAT300/blob/main/2_DataVisualization.ipynb)

# Data Visualization

**This notebook is published on my github. It is publicly accessible, but you cannot save your changes to my github. Learning git & github is beyond the scope of this course. If you are familiar with github, you know that to do. If you don't know github, you can save a personal copy of the file to your google drive, so that you can save your changes and can access them at a later date**

This notebook is a continuation from previous classes:

1. [GettingStarted](https://github.com/TobGerken/ISAT300/blob/main/1_GettingStarted.ipynb)

## Now lets get started 

This really only scratches the surface and there are so many good resources around. Come and talk to me if you want to know more. I am also pointing out some resources along the way and will link to them on Canvas. 

We previously covered some [Pandas](https://pandas.pydata.org/) dataframe basics and performed some initial statistics. Well crafted figures are a powerful tool to communicate our main findings and to [tell compelling stories with data](https://hbr.org/2013/04/how-to-tell-a-story-with-data).  

Because we are still using pandas we have to import it first. 

In [None]:
# running this will import pandas.
import pandas as pd

## Getting help

We all need a little help sometimes. Check out the help command. There is going to be a lot of information provided. There is no need to understand all of it, but sometimes this can be helpful. When you are unsure, try `help()`.

In [None]:
help(pd.read_csv)

As a side note, when you edit a cell, you can also hit the `Tab` key and it will provide some suggestions on what you can do. 
**Try it out: Type: `pd.` and hit the Tab key** the cell below.

## Reading data into a pandas dataframe

We will use the same data as before and will load this into a dataframe `df`. You can chose to load the data from the online source or to load a local copy. 

In [None]:
# This loads the data, which is saved online 
df = pd.read_csv('https://raw.githubusercontent.com/TobGerken/ISAT300/main/Data/mpg_cated.csv')
# This would read a local copy from the data, provided that it is stored in the base folder. 
# df = pd.read_csv('./mpg_cated.csv')

Let's remind us what the data we loaded looked like by looking at the fist few entries

In [None]:
df.head()
# You can also display the last few entries 
# df.tail()

In [None]:
# This will give you the dimension of the data
df.shape

The `df.info()` command is another great way of understanding our data. It will provide information about the types of data and how many valid data entries there are. 

In [None]:
df.info()

**Question: Look at `Non-Null` Count and the `Dtype` columns. What do they indicate?**

Missing data can be a big problem for statistics and data visualization, since they may not work or produce the wrong results. Therefore it is important to always check for missing values. 

## Getting Started with Data visualization

Python contains very powerfull capabilities for data visualziation. One of these is the matplotlib library, which capable of producing complex publication-quality figures with fine layout control in two and three dimensions. 

[Matplotlib](http://matplotlib.org) is quite old and is built to be familiar to Matlab users. While it is an older library, so many libraries are built on top of it and use its syntax. We might encounter some of these later in the semester.

One neat thing about matplotlib is that pandas is actually using the features of matplotlib to produce figures. This will become clearer later. 

For now, you just need to know that making plots is really easy. 

If we have a dataframe we can just call the `.plot()` method and a plot will appear. Neat ;-)

In [None]:
df.plot()

Now we have a plot of the dataframe, but it is not very useful to see what is going on.

**Q1: Why?** 

**Q2: What is missing from this plot?**

One of the things we notice is that we have created a line-plot for the entire dataframe. Each column, e.g. `mpg` or `weight`, is treated as a line, which makes it really difficult to differentiate between them. Also we lost any information about categorical data such as `origin`.

Let's try this again. Use you knowledge from the last lecture to only select the `mpg` and `horsepower` columns of the dataframe. Do you remember how to select two colums. The code below will only select a single column.

**Modify it to select two:** 
*Hint: You need to select a list of columns. Lists in python are specified within brackets `list = [item1, item2]`*. `df[list_of_columns]`

In [None]:
df['mpg'].plot()

### Scatter Plots 

Scatter plots or x-y plots are a great way of visualizing relationships between variables, such as correlations. They are also a great tool to show uncertainty!

Luckily, they are built into the pandas plotting methods:

In [None]:
df.plot(kind = 'scatter', x='mpg', y ='horsepower')


#### Making your plots pretty and useful. 

There are a lot of options for changing the appearance of your plots. For examplel, you can change the style of plot using the `kind` paramemter and setting it to the desired plot type. 

Your options are (not all of these may work for your data: 

- 'line' : line plot (default)
- 'bar' : vertical bar plot
- 'barh' : horizontal bar plot
- 'hist' : histogram
- 'box' : boxplot
- 'kde' : Kernel Density Estimation plot
- 'density' : same as ‘kde’
- 'area' : area plot
- 'pie' : pie plot
- 'scatter : scatter plot (DataFrame only)
- 'hexbin' : hexbin plot (DataFrame only)

Similarly, there are many other things you can specify. Go to the documentation for the [dataframe plotting function](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) and explore your options. 

**Use the documentation and your smarts to understand what the below parameters are doing>**

**Next, try changing the plot changing the `ylabel` to something more sensible and by changing the color. Your options are found [here](https://matplotlib.org/stable/gallery/color/named_colors.html#).**

In [None]:
df.plot(x='weight', y='mpg', kind = 'scatter', color = 'k', fontsize = 14, 
        xlabel = 'Weigth of car in pounds', title = 'some title')

You can see that the fontsize of the title and the x and y labels did not change.

Unfortunately, we cannot do this directly using pandas. We have to use matplotlib directly. So we import first the `pyplot` module from the matplotlib libary. [Pyplot](https://matplotlib.org/stable/tutorials/introductory/pyplot.html) is designed to easily create and manipulate matplotlib figures and works very similar to plotting in [Matlab](https://www.mathworks.com/products/matlab.html) a commonly used - but expensize - programming language in science & engineering
. 

In [None]:
# this imports pyplot and makes it available when calling plt (this is a commong convention to save some typing)
import matplotlib.pyplot as plt

df.plot(x='weight', y='mpg', kind = 'scatter', color = 'k', fontsize = 14, xlabel = 'Weigth of car in pounds')

# We can then use the ylabel and title method in pyplot, which are a bit more flexible to adjust the fontsize.
plt.ylabel('Miles Per Gallon', fontsize = 20, color = 'b')
plt.title('Some big Title in Red', fontsize = 20, color = 'r')

# I really recommend to check out the tutorial for pyplot. 
# https://matplotlib.org/stable/tutorials/introductory/index.html

### Histograms 
Text taken from: [Python for Data Vizualization](https://www.linkedin.com/learning/python-for-data-visualization/effectively-present-data-with-python?autoplay=true&u=50844473) on LinkedIn Learning
> It is a common practice to create histograms to explore your data as it can give you a general idea of what your data looks like. A histogram is a summary of the variation in a measured variable. It shows the number of samples that occur in a category. A histogram is a type of frequency distribution.

>Histograms work by binning the entire range of values into a series of intervals and then counting how many values fall into each interval. While the intervals are often of equal size, they are not required to be." 

It would be nice if we could summarize the distribution of our gas milage. 

In [None]:
df['mpg']

Let's create a histogram. 

**Compare the two histograms below, what does the `bins` specify in each? What is the difference?**

You may have also noticed that I used two different methods to create a histogram. Either one works, but they behave slightly differently. 

In [None]:
df['mpg'].plot(kind = 'hist', bins =7)

In [None]:
df['mpg'].hist(bins = [0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50])

Finally try creating a nice histogram that shows the distribution of `weight` and has all the features like lables, titles, etc that wou yould expect from a nice plot. 

In [None]:
# You can use this as a starter.

df['weight'].plot(kind = 'hist', grid = True)