----

# Data Visualization

----

### Table of Contents

1 - [Matplotlib](#section1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.1 - [Line Plots](#subsection1)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.2 - [Scatter Plots](#subsection2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.3 - [Bar Plots](#subsection3)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1.4 - [Histograms](#subsection4)<br>

2 - [Working with Data: Energy](#section2)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.1 - [Pie Charts](#subsection5)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.2 - [Bar Charts](#subsection6)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.3 - [Line Charts](#subsection7)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.4 - [Stacked Bar Charts](#subsection8)<br>

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2.5 - [Stacked Area Charts](#subsection9)<br>

3 - [What is a mathematical function?](#section3)<br>

4 - [Linear functions and simple models](#section4)<br>

5 - [BONUS: Gaussian functions and probability distributions](#section5)<br>

6 - [Exploring Python data visualization further...](#section6)<br>




Before we start this notebook, similar to our previous notebook, we will have to run some code at the beginning to make sure everything is set up correctly. Please run the cells below.

## 1. Matplotlib <a name='section1'></a>

We'll create visualizations in Python using a popular Python package called matplotlib. Let's import matplotlib along with several other Python libraries that we will be using:

In [None]:
import math
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt

To plot something, we need data! To begin, we will use very simple arrays. Let's try plotting just two points, `(x1, y1) = (1,1)`; `(x2, y2) = (5,5)`

In [None]:
# EXAMPLE

xpoints = np.array([1,5])
ypoints = np.array([1,5])

plt.plot(xpoints, ypoints);

**Quick note**: Use a semicolon `;` at the end of the last line in a Jupyter notebook cell to suppress the notebooks from printing the return value of the last line. Try removing the semicolon in the above cell and see how the output changes.

You can use the keyword argument `marker` to specify each point:

In [None]:
# EXAMPLE

plt.plot(xpoints, ypoints, marker='o');

There are other markers that you can find in [matplotlib's documentation page](https://matplotlib.org/stable/api/markers_api.html).

Choose a marker from the documentation page and try it out below!

In [None]:
# EXERCISE - Try plotting with a marker of your choice:

plt.plot(xpoints, ypoints, marker=...);

Without the marker keyword, you can plot without a connecting line:

In [None]:
# EXAMPLE

plt.plot(xpoints, ypoints, '^');

### 1.1 Line Plots <a id='subsection1'></a>

We've actually created line plots above, but let's go further. Any good plot needs to have a title, x and y axis labels:

In [None]:
# EXAMPLE

time = np.array([0, 5, 10, 15, 20])
distance = np.array([0, 12.5, 50, 112.5, 200])

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.plot(time, distance);

In [None]:
# EXAMPLE - even more customization with colors, linewidth and linestyle

time = np.array([0, 5, 10, 15, 20])
distance = np.array([0, 12.5, 50, 112.5, 200])

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.plot(time, distance, linewidth='5', color='darkmagenta', linestyle='dashed');

You can find a table of colors [here](https://www.w3schools.com/colors/colors_names.asp). Note that you can use either the color names like above or the hexadecimal value (for example the hexadecimal value for Dark Magenta is #8B008B).

### 1.2 Scatter Plots <a name='subsection2'></a>

Scatter plots are useful when you are looking for a relationship between two variables.

We can use the `scatter()` function to draw a scatter plot:

In [None]:
# EXAMPLE

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.scatter(time, distance);

In [None]:
# EXAMPLE - Two data sets

time2 = np.array([0, 5, 10, 15, 20])
distance2 = np.array([0, 50, 100, 150, 200])

plt.title('Distance over Time')
plt.xlabel('Time (s)')
plt.ylabel('Distance (m)')
plt.scatter(time, distance, color='red'); #Similar to before you can change the color
plt.scatter(time2, distance2, color='blue');

For the exercise below, you are given data points for average temperature (in Fahrenheit) and average humidity (in %) for Berkeley over 12 months (data from [Climate-Data.org](https://en.climate-data.org/north-america/united-states-of-america/california/berkeley-1266/)).  
Make two scatter plots - one for each data set. Don't forget the plot title and axes labels.

In [None]:
# Data sets for the exercise

months = np.array(["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"])
temperature = np.array([48.1, 49.9, 52.7, 55.3, 59.5, 63.9, 64.6, 64.9, 65, 61.3, 54, 48.6])
humidity = np.array([77, 78, 74, 66, 62, 60, 65, 66, 62, 63, 72, 76])

In [None]:
# EXERCISE - Create a scatter plot for Berkeley's average temperature data

plt.title('Berkeley\'s Average Temperature Over a Year')
plt.xlabel(...)
plt.ylabel('Temperature (Fahrenheit)')
plt.scatter(...);

In [None]:
# EXERCISE - Create a scatter plot for Berkeley's average humidity data

plt.title(...)
plt.xlabel(...)
plt.ylabel(...)
plt.scatter(...);

For the next exercises we will look at some data from the data survey taken earlier in BLDAP. Let's first read in the data.

In the cell below we will read in the survey data and also be running code to clean up the column names so they are easier to use. You do not need to modify the code, but we are keeping it here in case anyone would like to see how easy it is to do this with Python - just 3 lines!

In [None]:
### READING IN DATA - DO NOT MODIFY

# Survey as is saved from the system
survey_data = pd.read_csv('data/BLDAP2025DataSurvey.csv') #MODIFY AS NEEDED FOR THE YEAR

# Cleaning up the column names
old_columns = survey_data.columns
survey_data.rename(columns={old_columns[0]: 'Hours Slept', old_columns[1]:'Number of Siblings', old_columns[2]: 'Height', old_columns[3]: 'Home-LBNL Distance', old_columns[4]: 'Commute Time', old_columns[5]: 'Pineapple Pizza Rating'}, inplace=True)
survey_data.drop(columns=old_columns[0], inplace=True)

# Let's take a look at our dataframe
survey_data.head()

**EXERCISE**
Produce a scatter plot looking at participants' distances from home to the Lab and their commute time.
Is there a correlation or a trend?

*Hint: Check back to notebook 06 to see how to get just a column of data.*

In [None]:
#Grab relevant data from the dataframe (see notebook 06)
x_data = ... # Home-LBNL Distance data
y_data = ... # Commute Time data

#Create axes labels with units
plt.xlabel('...')
plt.ylabel('...')

#Plot data
plt.scatter(x_data, y_data);

**EXERCISE**

Try creating scatter plots for other data columns. What about the number of siblings and the pineapple pizza rating?

In [None]:
#Grab relevant data from the dataframe
x_data = ...
y_data = ...

#Create axes labels with units if relevant
plt.xlabel('...')
plt.ylabel('...')

#Plot data
plt.scatter(...);

### 1.3 Bar Plots <a name='subsection3'></a>

Bar plots represent categorical data with rectangular bars (vertical or horizontal) that are proportional to the values they represent.

In [None]:
# EXAMPLE

courses = np.array(["C", "C++", "Fortran", "Java", "MATLAB", "Python"])
students = np.array([10, 20, 5, 30, 30, 50])

plt.title('Number of students enrolled in coding classes')
plt.xlabel('Coding Courses')
plt.ylabel('Number of Students')
plt.bar(courses, students);

In [None]:
#EXAMPLE - same data but horizontal

plt.title('Number of students enrolled in coding classes')
plt.xlabel('Number of Students')
plt.ylabel('Coding Courses')
plt.barh(courses, students);

### 1.4 Histograms <a name='subsection4'></a>

A histogram is a common data distribution chart that is used to show the frequency with which specific values, or values within ranges, occur in a set of data.

The horizontal axis of a histogram displays the number range and the vertical axis represents the amount of data (frequency) that is present in each range.

In [None]:
# EXAMPLE

hoursSlept = np.array([3, 4, 5, 5, 5, 6, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 9, 10, 10, 11, 12])

plt.title('Number of Hours Slept for A Class of 11th Graders')
plt.xlabel('Number of Hours')
plt.ylabel('Frequency')
plt.hist(hoursSlept);

Make the histogram again, but this time with a bin size of 5. We can do this by adding `bins = 5` inside of the histogram command after the dataset.

Make sure to include the title and axis labels.

In [None]:
#EXERCISE

plt.title('Number of Hours Slept for A Class of 11th Graders')
plt.xlabel('Number of Hours')
plt.ylabel('Frequency')
plt.hist(hoursSlept, bins=...);

What do you notice when we reduce the number of bins to 5?

There is no "right" way to display a histogram, but it's important to keep in mind that some bin counts convey more information than others.  
For example, reducing the bin size even further to 2, we can see that we don't get a good grasp on the distribution of hours slept. Fill out the code below to have a bin size of 2:

In [None]:
#EXERCISE

plt.title...
plt.xlabel...
plt.ylabel...
plt.hist(hoursSlept, ...);

Lets play around with this a little more. In order to do this, we need to generate some data. We can do this using the command `np.random.normal()` (this will generate normally distributed data).  

Look up online to figure out how to use this command.

In [None]:
# EXERCISE -
# Use np.random.normal to generate a normally distributed dataset with 100000 values.

x = ...
plt.hist(x);

In [None]:
# play around with the bin size for this histogram with random numbers.
# Which bin size do you prefer?

plt.hist(x, ...);

**EXERCISE**

Go back to the data survey data and create a histogram of the heights. Recall that the survey data was read into a dataframe called `survey_data`. Look at your previous exercises under scatter plots to see how you grabbed columns of data.

Does your histogram look like a bell? Play around with the bin size.

In [None]:
height = ...

plt.hist(...);

**Discussion Question:** What is the difference between a bar plot and a histogram?

## 2. Working with Data: Energy <a name='section2'></a>

We will learn to plot some interesting energy sector related data. Most of this data is from US Energy Information Administration (EIA). EIA was established in 1977 driven by the energy crisis of the 1970s. But they have energy related data from much before that period. They have formalized data collection efforts for energy sector and we use that data extensively in our research.

### 2.1 Pie charts <a name='subsection5'></a>

Let's say you want to visualize how much of energy was consumed in each sector last year. The EIA summarizes this data using British thermal units (Btu). 1 Btu is approximately equal to 1055 Joules. In 2020, the total energy consumed in the United States was about 93 Quadrillion Btu (Quad Btu, in short).

Quick fact: 1 Quadrillion = 10^15. That's 1 followed by 15 zeros!

In [None]:
# EXAMPLE
tot_energy=pd.read_csv('data/total_consumption_2020.csv')
tot_energy

To create a pie chart, we will use the `pie()` function from matplotlib. We will pass in the Consumption column from our data set and use the Sector column as our labels. The `startangle` keyword allows us to place the start of the pie chart 90 degrees counterclockwise from the x-axis (you can experiment by changing the `startangle` back to its default value of 0 degrees and seeing how different the pie chart will look).

In [None]:
plt.pie(tot_energy['Consumption'], labels=tot_energy['Sector'], startangle=90);

What are some things you notice? Which type of energy consumption makes up the most of total energy consumption? The least?

*Write your answers here by editing this textbox.*



**Exercise:** Create the same pie chart for the California data and compare the fractions. Data is present in a csv file called `total_consumption_ca_2020.csv` located in the `data` folder.

In [None]:
#First read in the data file using pandas
ca_energy = pd.read_csv('...')
ca_energy

In [None]:
#Create your pie chart below
#Remember to use the Consumption column for the data and the Sector column for the labels

plt.pie(ca_energy[...], labels=ca_energy[...], startangle=90);

How does California's energy consumption compare to the total energy consumption?

*Write your observations here by editing this textbox.*

### 2.2 Bar Charts <a name='subsection6'></a>

Let us try to plot the US total consumption data using bar charts.

In [None]:
# EXAMPLE
plt.ylabel('Energy consumption in Trillion Btu')
plt.bar(tot_energy['Sector'], tot_energy['Consumption']);

If we sort this data in ascending or descending order, it is easier to compare. Go back to Notebook 07 (Pandas Dataframes) and use your good friend Google to see how to sort by values, then create a bar chart that shows the sector with highest consumption on the left - in other words, the data is presented in descending order.

If you did it correctly, your bar chart should look like this:

![](https://drive.google.com/uc?export=view&id=1YiH0eyWq9zuNMfxBzq3OdsmR-ZZ_I9Ez)

In [None]:
#First sort the values by Consumption
tot_energy_sort = tot_energy....
tot_energy_sort

In [None]:
#Now create a bar chart with the sorted values

plt.ylabel('...') #Don't forget to label your y-axis
plt.bar(...,...);

In [None]:
#Bonus - use the same sorted data but make it horizontal
#Hint - look at the earlier section on Bar charts in this notebook

plt.ylabel('...') #Don't forget to label your y-axis
plt....

### 2.3 Line Charts <a name='subsection7'></a>

Line charts are generally useful to depict continuous data (e.g., time series data). As an example, if we want to plot the total energy consumption from 2000 to 2021, then line chart would be a good option as it helps us understand how a value changes from one year to the next. Let us plot this data and see.

In [None]:
#First read in the data
energy_00_21 = pd.read_csv('data/consumption_2000_2021.csv')
energy_00_21.head()

In [None]:
#Now time to plot - don't forget labelled axes!
plt.xlabel('Year')
plt.ylabel('Energy consumption in Trillion Btu')
plt.plot(energy_00_21['Year'], energy_00_21['Consumption'], linewidth=2, color='blue');

What are some things you notice? When did total energy consumption decrease? Can you think why?

*Write your answers here by editing this textbox.*

**Exercise:** Let's plot the data again but this time make it so that the y-axis spans a range of 30,000 trillion Btu to 105,000 trillion Btu. We will use the `ylim` function to do this.

In [None]:
#EXERCISE

#First use plt.ylim() to set the y-axis to go from 30000 to 105000 trillion Btu
plt.ylim(...)

#Now create the line plot - don't forget labelled axes!
plt.xlabel('...')
plt.ylabel('...')
plt.plot(..., ..., linewidth=2, color='blue');

In the first plot, it appeared as if there are huge spikes (up and down) from one year to the next. But notice that the lower limit on y-axis is set to about 93,000 and upper limit is a little over 100,000 Trillion Btu and so, the yearly variation is amplified.

In the second plot, we see a more muted effect of the same data by setting the bottom value to a much smaller number.

Which plot gives a more accurate picture in the trend in energy consumption? Why? Discuss with your group and write what your group came up in the textbox below:

*Write your answers here by editing this textbox.*

**Exercise:** Let's visualize data over a larger range of years, from 1950 to 2021. The data file is called `consumption_1950_2021.csv` (located in the data folder). Try to figure out on your own the best values to set the y-axis range (you can look back to notebooks 06 and 07 to see if any pandas functions can help you out...)

In [None]:
#Use pandas to read in the data
energy_50_21 = pd....
energy_50_21.head()

In [None]:
#How will you figure out what values to use the y-axis range?



In [None]:
#Now let's plot the data! Let's add a title and axis labels

#Use plt.ylim() to set the y-axis range
plt.ylim(30000, 105000)

plt.title('...')
plt.xlabel('...')
plt.ylabel('...')
plt.plot(..., ..., linewidth=2, color='red');

If you did it correctly, your graph should look similar to the below (note that you don't have to choose red as your line color):

![](https://drive.google.com/uc?export=view&id=1DwFySSArqPndai3l5Z_HrsMk9knQclM5)

What are some things you notice? What are some things you wonder?

*Write your answers here by editing this textbox.*

### 2.4 Stacked Bar Charts <a name='subsection8'></a>

The next type of visualization is called stacked bar charts. If each datapoint comprises of several components that make up the total, this type of graph is useful to visualize those components as fractions of the total. We will use EIA's Residential Energy Consumption Survey (RECS) data from 2015 to create this graph. RECS is a periodic survey conducted by the EIA which provides several details about household energy consumption. Similarly, EIA also conducts surveys for commercial buildings (Commercial Building Energy Consumption Survey or CBECS) and for manufacturing sector (Manufacturing Energy Consumption Survey or MECS) as well.

Among other things, RECS contains information regarding the average values of end-use consumption across the various regions within the United States (North-east, South, Midwest and West). RECS considers space heating, water heating, air conditioners and refrigerators as the main end-uses in a household and the remaining are combined together in category called "other".

In [None]:
#EXAMPLE - Read in RECS data from 2015

recs_15 = pd.read_csv('data/recs_enduse_2015.csv')
recs_15

In [None]:
#First let's get the data by grabbing columns
x = recs_15['Region']
y1 = recs_15['Space heating']
y2 = recs_15['Water heating']
y3 = recs_15['Air Conditioning']
y4 = recs_15['Refrigerators']
y5 = recs_15['Others']

In [None]:
#Let's make our stacked bar chart
#The keyword bottom allows us to stack each data

#Plotting Space heating consumption data at the bottom
plt.bar(x, y1, label=recs_15.columns[1]) #recs_15.columns[1] outputs 'Space heating'

#Stacking Water heating consumption data on top
plt.bar(x, y2, bottom=y1,label=recs_15.columns[2])

#Stacking Air conditioning consumption data on top
plt.bar(x, y3, bottom=y1+y2, label=recs_15.columns[3])

#Stacking Refrigerators consumption data on top
plt.bar(x, y4, bottom=y1+y2+y3, label=recs_15.columns[4])

#Stacking Others consumption data on the very top
plt.bar(x, y5, bottom=y1+y2+y3+y4, label=recs_15.columns[5])

#Adding the y-axis label and a legend
plt.ylabel('Energy consumption in Million Btu')
plt.legend();

As we can see, stacking the components one above the other gives us an idea about how much each end-use contributes to the total energy consumption. From the above plot, we see that space heating is pretty much the biggest fraction of the total energy consumption, at least in the North-east and Midwest regions.

**Exercise:** Let's go back to the California total consumption data from 2020 (see section 3.1) and make a stacked bar chart.

In [None]:
ca_energy

In [None]:
#EXERCISE - First we will get the data
#Hint - All data is in the column 'Consumption'. First grab that column then grab each sector's value using its index
x = 'California'
y1 = ca_energy['...'][...] #Residential
y2 = ca_energy['...'][...] #Commercial
y3 = ca_energy['...'][...] #Industrial
y4 = ca_energy['...'][...] #Transportation

In [None]:
#EXERCISE - Create a stacked bar chart similar to the example above
plt.bar(..., ..., label='Residential')
plt.bar(..., ..., bottom=..., label='Commercial')
plt.bar(..., ..., bottom=..., label='Industrial')
plt.bar(..., ..., bottom=..., label='Transportation')

plt.ylabel('...')
plt.legend();

If you did it correctly, your graph should look similar to the below:

![](https://drive.google.com/uc?export=view&id=1MxARJ5r-PRYQcp1fp8CazOgq5Zfz6-sU)

### 2.5 Stacked Area Charts <a name='subsection9'></a>

Stacked area charts are similar to stacked bar charts. In the above example, while stacked bar charts are a good option to visualize a single year's data, stacked area charts are useful to see the trends in these components over several years. As an example, let us take a look at how the sector-wise energy consumption has evolved in the state of California from 1970 to 2020.

In [None]:
# EXAMPLE - Read in the data
ca_00_20 = pd.read_csv('data/ca_consumption_1970_2020.csv')
ca_00_20.head()

In [None]:
#Grab columns of data
x = ca_00_20['Year']
y1 = ca_00_20['Residential']
y2 = ca_00_20['Commercial']
y3 = ca_00_20['Industrial']
y4 = ca_00_20['Transportation']

In [None]:
# We will make this plot slightly bigger than default to notice the yearly variations
plt.figure(figsize=(12,6))

#Plot the data using the stackplot function
plt.stackplot(x , y1, y2, y3, y4, labels=['Residential', 'Commercial', 'Industrial', 'Transportation'])

#Axis labels and legend
plt.ylabel('Energy consumption in Billion Btu')
plt.xlabel('Year')
plt.legend(loc='upper left'); #Specify the location so that the trend is shown clearly

In this, we can clearly see that the energy consumption in the transportation sector has steadily increased over the years. If you observe closely, you will also notice an increase in the commercial sector as well.

**Exercise:** Try to create a similar graph for the renewable generation capacity projection data. Every year, EIA releases Annual Energy Outlook (AEO) where they assess long-term energy trends based on certain assumptions and methodologies. This is meant to give us an idea of what may happen in the future given these assumptions. One of the datasets released as a part of this study is the growth of renewable generation of electricity. Think of this as the growth in the electricity generation (in units of Billion kWh) contributed by the renewables.

The data is available in the file named `ren_generation_2020_2050.csv` located in the data folder.

In [None]:
#Read in the data using pandas
ren_20_50 = pd....
ren_20_50.head()

In [None]:
#Grab columns of data
x = ren_20_50['Year']
y1 = ren_20_50['Hydropower']
y2 = ren_20_50['...']
y3 = ren_20_50['...']
y4 = ren_20_50['...']
y5 = ...
y6 = ...
y7 = ...
y8 = ...

In [None]:
#We will again make the figure size slightly bigger to better see trends
plt.figure(figsize=(15,8))

#Use the stackplot function to plot data
plt.stackplot(......,
labels = ['Hydropower', ......])

#Don't forget axis labels and the legend.
#You may need to specify the location of the legend using the loc keyword
plt.ylabel('...')
plt.xlabel('...')
plt.legend(...);

What are some largest contributors of renewables?
Looking into the future, which type of renewable energy source is going to grow the largest?
What are some other renewable energy sources showing a growing trend? How about those that appear flat?

*Write your answers here by editing this textbox.*

## 3. What is a mathematical function? <a name='section3'></a>

You have seen the concept of functions in computer science during a previous class (see notebook `01_functions.ipynb`).  
Here, we will look at the concept of functions from a mathematical point of view.
We will learn what different functions may mean, how they could be useful in real life and in research, and how to make cool visualizations that help us understand them.  

A **function is a relationship that given one or many input values performs an operation using them and produces one or many output values** (technically, in computer science, the output can also be nothing).  

If a function that we call $f$ takes one input and gives one output, we write:
$$
y = f(x)
$$
and we can visualize it as a curve in the Cartesian plane $(x,y)$.  
We call $x$ the *independent variable* (because we can choose it arbitrarily, independently of anything else) and $y$ the *dependent variable* (because its value depends on $x$).

### EXAMPLE: Let's plot some functions

In [None]:
# First, we need to define the x axis
x = np.linspace(-10,10,200) # Try changing 200 to 20 and see what happens!

# Let's prepare our canvas - this is optional, but makes the figure prettier
plt.figure(figsize=(4,3), dpi=150)

# Let's say we want to plot the identity function y = x
y = x
plt.plot(x, y, label='identity', color='dodgerblue', lw=2)

# We can also use the functions provided by the libraries like numpy
plt.plot(x, np.sin(x), label='sinusoidal', color='crimson', lw=4, linestyle='dashdot')

# And we can control the color, width and style of the lines
plt.plot(x, np.log(x*x+1), label='logarithmic', color='lime', linewidth=3, linestyle='dashed')

plt.legend()
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.show()

We can also define more complicated functions!  
For example: $y = \sin(2 \sin(2  \sin(2 \sin(x))))$ or try your own!

In [None]:
# EXERCISE

def my_function(x):
    y = ... # define your function here
    return y

x = np.linspace(-10,10,200) # define the x axis here

plt.xlabel('x')
plt.ylabel('y')
plt.plot(x, my_function(x)); # plot the function in your favorite style here

We can even combine functions with for loops to plot many lines at once!

In [None]:
# EXAMPLE

def line(x, m, b):
    return m * x + b

x = np.linspace(-10,10,200)

for m in np.linspace(-10,10,8):
    for b in np.linspace(-10,10,8):
        plt.plot(x, line(x,m,b))

# Set the y limit if you want the plot to have the same scale along x and y
#plt.ylim(-10,10) # Try commenting in and out this line and see what happens

plt.xlabel('x')
plt.ylabel('y = m x + b');

### EXERCISE: Write your own `python` function and take a look at it

Make it as complex as you can.  
You can use and combine for example `np.sin`, `np.cos`, `np.log`, `np.exp`, `np.abs`, etc.\
[Here](https://numpy.org/doc/stable/reference/routines.math.html#) is a list of mathematical functions in numpy.

Choose your favorite color, type of line and width.  
[Here](https://matplotlib.org/stable/gallery/color/named_colors.html#css-colors) is a list of color names you can use.  
[Here](https://matplotlib.org/stable/gallery/lines_bars_and_markers/linestyles.html) is a list of line styles you can use.

**Note that if you randomly define a function, it may not be well-defined for all possible $x$-values!!!**  
Some examples:

* $\dfrac{1}{x}$ is undefined for $x=0$
  
* $\sqrt{x}$ is undefined for $x<0$
  
* $\log{x}$ is undefined for $x \leq 0$

In [None]:
# Define your function here
def my_weird_function(x):
    # do some magic
    y = ...
    return ...

# Plot your function here
plt.plot(..., ..., color='...', lw=...);

## 4. Linear functions and simple models <a name='section4'></a>

What are linear functions? When graphed, they look like straight lines.  
In fancy mathematical terms, a linear function is a polynomial of degree one or less.  

**They represent a quantity $y$ that increases or decreases at a constant rate when changing another quantity $x$.**

A general mathematical formula is:
$$
y = m x + b
$$

Technically, this is not the most general formula because is does not include vertical lines, which instead are described by:
$$
x = a
$$

Linear functions are useful for example to model the money you earn as the BLDAP weeks go on, or the distance you travel in time when moving at a constant speed.

### EXERCISE: BLDAP savings

Let's build a simple model for your savings during your experience at BLDAP.   
You started with a certain amount of savings. Each week, you earn and spend a certain amount of dollars.  
Can you write a simple equation that lets you predict how much money you will have at the end of BLDAP?


In [None]:
def my_savings(w): # w is the week number
    my_stipend = 500  # dollars per week
    my_expenses = 100 # dollars per week
    initial_savings = 0 # dollars you had on day 1 of BLDAP
    return ... # equation here

w = np.linspace(0,6,7)
plt.figure(figsize=(4,3), dpi=150)
plt.plot(w, my_savings(w), marker='o', markersize=5, color='blue')
plt.xlabel('week')
plt.ylabel('dollars')
plt.show()

### EXERCISE: Linear model with survey data

We can use some real life data from the survey taken by high school summer interns at the Lab. For this exercise, we will use the data from 2024.  
The data we have is:
* Hours Slept
* Number of Siblings
* Height (cm)
* Home-LBNL Distance (miles)
* Commute Time (minutes)
* Pineapple Pizza Rating

Do you think we can find a linear relationship between any of these quantities?  
For example, do you think that the number of siblings could be a linear function of height?  
Any other ideas?



In [None]:
### READING IN DATA - DO NOT MODIFY

# Survey as is saved from the system
survey2024_data = pd.read_csv('data/BLDAP2024DataSurvey.csv') # We are using the 2024 survey results for this exercise

# Let's take a look at a summary of our dataframe
survey2024_data.describe()

In [None]:
distance = survey2024_data['Home-LBNL Distance (miles)']
time = survey2024_data['Commute Time (minutes)']
plt.scatter(time, distance, marker='*', s=100, color='red')
plt.xlabel('time [minutes]')
plt.ylabel('distance [miles]');

<font color='red'>**Something is very weird! ⚠**</font>  
**Can you spot the issue?**

In [None]:
# If we do this it makes more sense...
survey2024_data.replace({'Home-LBNL Distance (miles)':230},{'Home-LBNL Distance (miles)':23.0}, inplace=True)

# Grab again distance and time
distance = survey2024_data['Home-LBNL Distance (miles)']
time = survey2024_data['Commute Time (minutes)']

# Plot data after fix
plt.scatter(time, distance, marker='*', s=100, color='green')
plt.xlabel('time [minutes]')
plt.ylabel('distance [miles]');

Can we model the commute time as a linear function of the distance from the lab?  
Let's **fit** our data with a straight line using `numpy` as:

$
distance = m \times time + b
$

In [None]:
# EXERCISE

# Let's compute the line that best approximates our data using numpy
# m and b are the 2 coefficients that define such a line!
m, b = np.polyfit(time, distance, 1)
print(f"m = {m} miles/minutes, b = {b} miles")

# Plot the raw data from the survey
plt.scatter(..., ...)

# Plot the regression (or fitting) line
plt.plot(..., ... * time + ... , color='red');
plt.xlabel('time [minutes]')
plt.ylabel('distance [miles]');

Not too bad, but not great! There are different **outliers**.  
We have obtained:  
$ distance \approx 0.16 \ miles/minute \times  time+ 9.20 \ miles$

**Does it makes sense?**

We might expect that if one lives at the lab ($distance = 0$) then it takes them a $time=0$ to commute.  
However, if we set the $distance = 0$, we obtain a negative $time$!  
Our model makes sense only if $time > 0$, which implies $distance > 9.20 \ miles$.  
Nonetheless the minimum $distance$ in our dataset is $2.5 \ miles$!

We can also estimate the average speed of your commute using the coefficient $m$.  
$m$ represents the speed in miles per minute.  
If we multiply by $60$ we can get the speed in miles per hour.   

Let's look at the value and compare it to the average speed from the raw data.

In [None]:
print(f'Average commuting speed from linear regression [miles/hour] = {60 * m}')
# Note: miles / minute = miles / (1/60 hour) = 60 miles/hour

print(f'Average commuting speed from raw data [miles/hour] = {np.average(distance/time)*60}')

**Weird!**  
**We can conclude that there are different inconsistencies in this model!**  
**Maybe we can figure out why.**

How do you guys commute?
1. shuttle
2. car + shuttle
3. walk + shuttle
4. bus + shuttle
5. bike + shuttle
6. others

There are 2 main outliers: commute time of 105 and 120 minutes.   
What if we remove them?

In [None]:
#EXERCISE

# Fix the data: remove the 2 outliers
mask = (survey2024_data['Commute Time (minutes)']==105) | (survey2024_data['Commute Time (minutes)']==120)
survey2024_data = survey2024_data[~mask]

# Grab distance and time after fix
distance = survey2024_data['Home-LBNL Distance (miles)']
time = survey2024_data['Commute Time (minutes)']

# Fit the data with a straight line after fix
m, b = np.polyfit(time, distance, 1)
print(f"m = {m} miles/minutes, b = {b} miles")

# Plot data + regression line
plt.scatter(..., ...,color='red', label='survey data')
plt.plot(..., ..., color='blue', label='regression line');
plt.xlabel('time [minutes]')
plt.ylabel('distance [miles]')
plt.legend()

print(f'Average commuting speed from linear regression [miles/hour] = {60 * m}')
# Note: miles / minute = miles / (1/60 hour) = 60 miles/hour

print(f'Average commuting speed from raw data [miles/hour] = {np.average(distance/time)*60}')

If everything went alright, these last results should look much better!
* $b$ now is close to zero!
* The average speed from the linear model is consistent with the data!
* The average speed is $25 \ miles/hour$, which makes sense since it is the speed limit in many nearby areas!

**Conclusion:
When building a model, one should carefully consider its consistency and meaningfulness!**


## 5. BONUS: Gaussian functions and probability distributions <a name='section5'></a>

Gaussian functions look like bells 🔔🔔🔔  
They are routinely used in science as **probability distributions**.  
This is useful when a quantity is random 🎲  

Take for example the height of the world population.  
There will be an **avarage** height, [which we know from data is around 160 - 170 cm](https://en.wikipedia.org/wiki/Human_height#Average_around_the_world).  
However, not everyone is 170 cm tall: there are shorter and taller people than average!  
In other words, we can define the deviation from the average, which tells us about the height spread. This is called **standard deviation**.

The probability distribution of height is approximately a Gaussian function (even though it's not fully consistent, [there can't be people of negative height](https://doi.org/10.1111/j.1740-9713.2013.00642.x)!)

The formula looks like this:
$$
f(x) = \frac{1}{\sqrt{2\pi} \sigma} \exp \left[ {-\frac{1}{2}\left(\frac{x - \mu}{\sigma}\right)^2} \right]
$$

Here are some Greek letters: $\mu$ is pronounced mu and represents the average, $\sigma$ is pronounced sigma and represents the standard deviation, meaning the varibility or fluctuations from the average.

### EXERCISE: plot a Gaussian function

In [None]:
## Define here the formula
def gauss(x, mu, sigma):
    return 1/(np.sqrt(...)*...)*np.exp(...)

# Define the x axis
x = np.linspace(-10,10,200)

# Plot different Gauss functions with different sigmas
plt.figure()
plt.plot(x, gauss(x,0,1))
plt.plot(x, gauss(x,0,2))
plt.plot(x, gauss(x,0,3))

plt.xlabel('x')
plt.ylabel('y');

# Plot different Gauss functions with different mus
plt.figure()
plt.plot(x, gauss(x,1,1))
plt.plot(x, gauss(x,2,1))
plt.plot(x, gauss(x,3,1))

plt.xlabel('x')
plt.ylabel('y');

If you did the above exercise correct, you should get the following plots:


<img src="https://drive.google.com/uc?export=view&id=1tKv5Imm2IoG-6ZVHo1JkgAgtmGdK0j7f" width="50%" height="50%">

### EXAMPLE: heights and Gaussians
From the survey we have your height data.  
Let's see what it looks like.

In [None]:
heights = survey_data['Height']
plt.plot(heights, 'o', color='violet')
plt.xlabel('Array Index [-]')
plt.ylabel('Height [cm]');

Let's plot a histogram of the heights and compare it to a Gaussian function.  
* For the histogram use `plt.hist` with the option `density=True` (info [here](
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html))
* For the Gaussian function use `gaussian(x,mu,sigma)` as defined above
* Calculate the average and standard deviation from the data. You can use the `np.average` and `np.std`.

**Note:**  
We need to use the option `density=1` to **normalize** the histogram and get a probability distribution.  
What does this mean?  

* Raw histogram: each bin will display the number of samples $N$ that fall in each bin $B$.
* Probability distribution: each bin will display the bin's raw count divided by the total number of counts and the bin width.

In other words, a probability distribution is obtained from rescaling (_normalizing_) a raw histogram.

In [None]:
# Make a histogram of your heights
plt.hist(heights, density=1, color='blue', alpha=0.2)

# Calculate the average and standard deviation of your heights
mean  = np.average(heights)
std = np.std(heights)

# Prepare the x axis = height values
x = np.linspace(145,190,200)

# Plot the Gaussian
plt.plot(x,gauss(x,mean,std), label='Gaussian', color='blue', lw=3)
plt.xlabel('Heights [cm]')
plt.ylabel('Probability distribution')
plt.legend();

It's not great. Why?  
Maybe because we have a limited data set from our survey. We might have some outliers too.

We can use a larger dataset from [this study](http://socr.ucla.edu/docs/resources/SOCR_Data/SOCR_Data_Dinov_020108_HeightsWeights.html).




In [None]:
# Read the data from the large dataset
data = pd.read_csv('./data/SOCR-HeightWeight.csv')

# Convert inches to cm
heights_new = data['Height(Inches)']*2.54

# Make the new histogram
plt.hist(heights_new, density=1, color='red', alpha=0.2, label='big dataset')

# Plot the new Gaussian function
mean_new  = np.average(heights_new)
std_new = np.std(heights_new)
plt.plot(x, gauss(x,mean_new, std_new), label='Gauss', color='red', lw=3)

# Re-plot survey data
plt.hist(heights, density=1, color='blue', alpha=0.2, label='survey data')
plt.plot(x, gauss(x,mean,std), label='Gauss', color='blue', lw=3)

plt.xlabel('Heights [cm]')
plt.ylabel('Frequency')
plt.legend();

Interesting! The average of our data and the one from the large dataset are similar.   
The fluctuations around the average are larger in our data, probably again because of the limited number of samples.

## 6. Exploring Python data visualization further...<a name='section6'></a>

We've only looked at matplotlib (and not even fully explored all its features!) Another popular data visualization library is [seaborn](https://seaborn.pydata.org/). There is also [Plotly](https://plot.ly/python/) and [Bokeh](http://bokeh.pydata.org/en/latest/) which can create interactive visualizations.

---
Notebook developed by: Alisa Bettale, Laurel Hales, Samanvitha Murthy, Arianna Formenti