# Tutorial 8 - Matplotlib

TA: Collin Sakal


In [1]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### IMPORTANT!!!

Make sure you **downloaded all files in the "Tutorial-8-Matplotlib" folder** for this tutorial. Both the jupyter notebook and the Data.csv file need to be in the same directory (folder) for the code below to work.

**Please follow the steps below:**
1. Download the entire "Tutorial-8-Matplotlib" folder from Canvas 
2. Open Anaconda, go to Jupyter notebook
3. Click File -> Open -> Tutorial-8-Matplotlib -> Tutorial-8-Matplotlin-Jupyter-Notebook
4. Run the cell below "Import Data" and make sure there are no errors

**Alternatively:**
1. Download the Tutorial 8 Jupyter Notebook and Data.csv file 
2. Place both files in the same folder
3. Open the Jupyter notebook and run the cell below "Import Data" and made sure there are no errors

### Import Data

In [2]:
df = pd.read_csv("Data.csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Diabetes
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1


### Goals for this tutorial

The aim for this tutorial is to explore how to use matplotlib to generate figures (historgrams, scatterplots, etc) to learn more about a data set. When conducting an analysis this would be called an "Exploratory Data Analysis" or "EDA" for short. 

### Reference table for this tutorial

In [None]:
# Selecting a column from a pandas data frame
df.loc[:,'Glucose'] 

# Histogram Code
plt.hist(df.loc[:,'Glucose'],    # Data to be ploted
         bins = 10,              # Number of bins (bars) in the histogram
         color = 'red',          # Color of the histogram bars. Example colors: 'red', 'blue', 'black', 'pink'
         edgecolor = 'black')    # Color of the histogram bar outlines. 

# Scatterplot Code
plt.plot(df.loc[:,'Glucose'],  # x-axis data
         df.loc[:,'Age'],      # y-axis data
         'o',                  # plot x,y coordinates as cicles
         color = 'red')        # color of the circles

# Lineplot code (to add another line, just add another plt.plot below)
plt.plot([1,2,3,4,5,6],              # x-axis data
         [9,8,7,6,5,4],              # y-axis data
         label = "Glucose by Age")   # label for the line

# Labeling plots
plt.title("Title")
plt.xlabel("x-axis label")
plt.ylabel("y-axis label")

# Adding a legend
legend = plt.legend(loc = 'upper center')

# Calculating column-wise statistics
df.loc[:, 'Glucose'].max()   # Maximum


**(1) Create a Histogram of the "Pregnancies" variable**

Please fill in the code below to create a histogram with:
1. Four bins 
2. The title "Histogram of Pregnancies"
3. X-axis label "Number of Pregnancies"
4. Y-axis label "Count," red bars, and black bar outlines.

In [None]:
# Please write your code below

# Histogram code


# Labeling Code


**(2) Re-create the histogram from (1) but with the number of bins equal to the maximum number of pregnancies**

**HINT:** you will need to calculate the maximum number of pregnancies in the pregnancy column first, then put the maximum number in the bins argument.

**BONUS 1:** Why is the maximum number of pregnancies a good choice for the number of bins?

**BONUS 2:** What is one **useful** piece of information about the pregnancies variable that a histogram cannot tell us? (There are multiple correct answers)

In [None]:
# Please write your code below

# Histogram code


# Labeling Code

**(3) Suppose we are interested in the relationship between BMI and skin thickness. What type of plot would be useful for examining the relationship?**

1. Histogram 
2. Scatterplot

**(4) Create a scatterplot of BMI and skin thickness such that:**

1. Skin Thickness is on the x-axis, BMI on the y-axis
2. The title is "BMI versus Skin Thickness"
3. The x-axis title is "Skin Thickness"
4. The y-axis title is "Body Mass Index"

In [None]:
# Please write your code below

# Scatterplot code

# Labeling Code

**(5) What is the relationship between skin thickness and BMI?**

**BONUS 1:** Would we expect to see this relationship? Recall that BMI (Body Mass Index) estimates the amount of body fat a person has.

**BONUS 2:** Is there anything unusual about the data in the plot? If so, what?

**(6) Please choose your answer to the question below** 

Suppose we gather data on the number of people in the city our data was gathered from. We have two arrays of data: population (number of people), and year (2010 - 2020). What type of plot would allow us to examine changes in the population from 2010 - 2020? 

1. Histogram
2. Scatterplot
3. Lineplot

**BONUS:** Why are the other two plot options not ideal to examine this relationship?

In [None]:
population_fake_city = [5500, 6000, 7500, 7000, 7300, 8500, 10000, 11000, 11100, 13400, 15000]
year = [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]

**(7) Make a lineplot with:**

1. Year on the X-axis
2. Population on the Y-Axis
3. Title "Population Trends in Fake City, USA"
4. X-axis label "Year"
5. Y-axis label "Population"

In [None]:
# Please write your code below

# Lineplot code

# Labeling code

**(8) Re-create the lineplot from (7) but:**

1. Add the population of city 2: Croton-on-Hudson New York, USA
2. Add a label for each line (HINT: you will need to add a legend too!!)
3. Change the title to "Population trends of Croton-on-Hudson and Fake City, USA"

**Fun fact**: Croton-on-Hudson New York, USA is my hometown

In [None]:
population_city_2 = [8070, 7983, 8031, 8113, 8168, 8202, 8209, 8257, 8183, 8155, 8121]

In [None]:
# Please write your code below

# Lineplot code

# Labeling code

# legend code

### Miscellaneous points for interested students

Though matplotlib is probably the most popular, Python has many other plotting libraries. Here are a few that you may find interesting:

1. For R-users, the [plotnine](https://plotnine.readthedocs.io/en/stable/) package is similar to ggplot but for Python
2. [Plotly](https://plotly.com/python/) creates really good looking plots without much code, but the plots are more difficult to customize  
3. [Seaborn](https://seaborn.pydata.org/) is also quite popular. You'll learn more about it in later lectures. 