# Data Analysis in Python!
Let's complete the Dry Lab activity, in Parallel, with your Spreadsheet work but using Python in this Notebook.

First, before doing anything else, do a "File - Save a Copy" and Rename the notebook.  Then, share it with Mr. P. just as you would share any other Google Doc or Sheet.

Our goal is not perfect mastery, but to get you started and learn how to work with and edit code "snippits" so that they work in your particular application.

For some more "formal" terminology, this is a "Jupyter notebook" with blocks of code called cells. You can press shift+ENTER to run a cell and go on to the next one. You can also edit the code and run it again to see how the output changes.

#Importing Python Data Packages
To start with, we are going to import some commonly used, mathematical "packages" that contain resources we'll want to use in our analysis.  Note that in some programming languages "packages" are referred to as "libraries."  More details are in the block of Code below...


In [1]:
# Always start by importing the analysis packages we'll be using
# Notice that any line that stars with a " # " symbol is treated as a "comment" in Python, meaning it's for our reference and is not treated as code.

import pandas as pd  # Helps with organizing and formatting data.  The "pd" is a shorthand reference we can refer to this package as
import numpy as np   # "Num Py" is a package that contains common mathematical and statistical functions
import matplotlib as mpl # "Mat Plot Lib" is a package that helps with plotting and graphing data
import matplotlib.pyplot as plt  #  "Py Plot" is a sub-package of Mat Plot Lib that particularly helps with plotting and graphing data

# Also, don't forget to "run" this block of code to actually import the packages
# To "Run" a code block, make sure the cursor is in it and then press "shift" and "enter" at the same time
# You can also an option from the "Runtime" menu too.  When running this block, it loads the pacages in the background, so no output will be seen

# Entering the Data
In your spreadsheet, you entered your data with column headers.  In python, we need to create a "Pandas Data Frame" to store our data.

There are many ways to import data, but we're going to start with manually entering it here.  Pandas Data Frames work best with data sorted in "columns" and "rows" with a label.  

The start of the data entry is seen below, but you should finish it out so that all of the data is present.

In [None]:
data = pd.DataFrame(
    { "d": [1.5, 2.0, 3.0, 5.0, 1.5, 2.0, 3.0, 5.0, __________ ],
     "h": [30.0, 30.0, 30.0, 30.0, 10.0, 10.0, 10.0, 10.0, _______],
     "t": [73.0, 41.2, 18.4, 6.8, 43.5, 23.7, 10.5, 3.9, ______]
     })

# To check to see if the data was entered correctly, you can use the "head" function to print a few lines
data.head(5)
# Use "shift and enter" to run the code and see the output

# Graphing the Data
Let's make our first "test plot" here, focusing on hole diameter vs time for a constant height of 30 cm.

Below is a chunk of code that creates a scatter plot, using the matplotlib pyplot library.  Take a look at each line and see if you can decipher what each does

In [None]:
xMin = 0
yMin = 0
h30 = data.loc[data["h"] == 30.0]
print(h30.head(5)) # view the data to show the 30cm height data has been isolated
plt.scatter(x = h30["d"], y = h30["t"])
plt.axis(xmin = xMin, ymin = yMin)
plt.title("Time vs Diameter, Height = 30 cm")
plt.xlabel("Diameter (cm)")
plt.ylabel("Time (s)")
plt.show()

# Manipulating Data in a DataFram
As you can see, a linear result is not seen when relating Time and Diameter.

In your spreadsheet, you hopefully took the inverse (and maybe another step?) to achieve a linear result.  Here we'll do the same thing to our data, but in Python.

Nicely, the Pandas DataFrame allows us to do math operations on data quickly and easily.  In other programming languages, a "For Loop" structure would be needed for this, but Pandas will do the math operation of ALL of the data in a column for us.

In [None]:
data["inverseD"] = 1 / data["d"]
data.head(5)

In [None]:
# Plot the data again, focusing on the 30.0 cm constant height.  Copy and Paste are your friends here...
xMin = 0
yMin = 0
h30 = data.loc[data["h"] == 30.0]
print(h30.head(5)) # view the data to show the 30cm height data has been isolated
plt.scatter(x = h30["inverseD"], y = h30["t"])
plt.axis(xmin = xMin, ymin = yMin)
plt.title("Time vs Inverse Diameter, Height = 30 cm")
plt.xlabel("Inverse Diameter (1/cm)")
plt.ylabel("Time (s)")
plt.show()

# Do the next conversion yourself
Note that another manipulation of the diameter values is needed to achieve a linear result.  Do this in the next code block, including a graph showing the result.  As mentioned before, Copy and Paste are your friends...and then manipulate the details.

In [None]:
data["inverseSquaredD"] = (1 / data["d"])**2  # Note:  use a "double star" operator to raise a value.  Squaring "x" would be: x**2
data.head(5)

In [None]:
xMin = 0
yMin = 0
h30 = data.loc[data["h"] == 30.0]
print(h30.head(5)) # view the data to show the 30cm height data has been isolated
plt.scatter(x = h30["inverseSquaredD"], y = h30["t"])
plt.axis(xmin = xMin, ymin = yMin)
plt.title("Time vs Inverse Diameter Squared, Height = 30 cm")
plt.xlabel("Inverse Diameter Squared (1/cm)^2")
plt.ylabel("Time (s)")
plt.show()

# Plot them All!
To truly test and make sure the relationship holds for the diameters at all different heights, we should plot all of the data, separating for the different height values.

Below, the setup for two of the heights are shown.  Finish it out so that all 4 heights are included.

In [None]:
xMin = 0
yMin = 0
h30 = data.loc[data["h"] == 30.0]
h10 = data.loc[data["h"] == 10.0]
# include the other 2 height isolations

plt.scatter(x = h30["inverseSquaredD"], y = h30["t"], label = "h = 30 cm")
plt.scatter(x = h10["inverseSquaredD"], y = h10["t"], label = "h = 10 cm")
# plot the other 2 heights too

plt.axis(xmin = xMin, ymin = yMin)
plt.title("Time vs Inverse Diameter Squared, Height = 30 cm")
plt.xlabel("Inverse Diameter Squared (1/cm)^2")
plt.ylabel("Time (s)")
plt.legend()
plt.show()  # Notice, putting multiple "scatter" calls will put them on the SAME plot, as long as they occur BEFORE a call to "show"

# Break and Head Back to your Spreadsheet
Take a moment and head back to the [reference document](https://docs.google.com/document/d/1aQ-b2ok1HR8vT8NGuAYUo-NDDfC9xxa9W5ASYKYT1Lo/edit?usp=sharing) you were working out of before coming to this notebook to complete the next phase of the analysis in the spreadsheet environment.

Once completed, then come back here to continue.

# Analyze Time and Height Next!
Essentially, repeat the process completed for Time and Diameter, but for Time and Height.

Some code blocks are set for you, but feel free to add more as needed.
Do your best to add some comments from time to time too to help.

In [None]:
# Create a dataframe that called "d1_5" that isolates the data where the diameter is 1.5 cm


In [None]:
# Graph Time vs Height for a constant 1.5 cm diameter


In [None]:
# Create a new column in the data that manipulates the Height values to attempt to achieve a proportional relationship between Time and Height


In [None]:
# Graph Time vs the adjusted Height column, isolating only the 1.5 cm diameter data, checking for a proprotional result


In [None]:
# If a proprotional result has been reached, then graph Time vs adjusted Height for ALL of the data, but with the data separated into
# sets that each have constant diamters


# Break and Head Back to your Spreadsheet
Take a moment and head back to the [reference document](https://docs.google.com/document/d/1aQ-b2ok1HR8vT8NGuAYUo-NDDfC9xxa9W5ASYKYT1Lo/edit?usp=sharing) you were working out of before coming to this notebook to complete the next phase of the analysis in the spreadsheet environment.

Once completed, then come back here to continue.

# The Combined Graph!
Finally, we need to create a graph that creates a proportional result.  

Our end goal is to create an equation where if "diamter" and "height" are entered into it, it will correctly predict the "time" value result.

You've already done most of the steps needed, but some guidance would be...
- Add a new column to the "data" DataFrame that manipulates the "Diameter" and "Height" values, based on the results you seen before
  - ie if T is proportional to height and T is proportional to diameter squared, the resulting column would be:  height * diameter^2
- Graph Time vs the new "Combined" column to assure the result is proportional

# Best Fit Line and Equation
As a final result, we should add a "fit" line and have Python determine the equation.  This process is more involved than on a spreadsheet, unfortunately.  To help, there's a sample setup (with different data) showing how to do this below.

As before, use "Copy and Paste", but then manipulate the code so it creates a best fit line and equation for YOUR data.

In [None]:
# Combined Graph of your Data and Best Fit Line and Equation


# Best Fit Sample Code
The block of code below shows how to create a best fit line, showing it on a graph and also printing the resulting equation.

In [None]:
# The below set of code graphs a linear, best fit line, of a sample data set.  It also produces the resulting linear equation as well.

sampleData = pd.DataFrame(
    { "sampleX": [1, 2, 3, 4, 5, 6, 7, 8],
     "sampleY": [4, 8, 12, 16, 20, 24, 28, 32]
     })

xMin = 0
yMin = 0
xMax = np.max(sampleData["sampleX"])

plt.scatter(x = sampleData["sampleX"], y = sampleData["sampleY"])
plt.axis(xmin = xMin, ymin = yMin)
plt.title("Sample Plot")
plt.xlabel("Sample X")
plt.ylabel("Sample Y")

# Code to create best fit line is here
slope, intercept = np.polyfit(sampleData["sampleX"], sampleData["sampleY"], 1)
xValues = np.arange(xMin, xMax, (xMax - xMin)/200) # Creates a set of 200, evenly spaced "x" values
plt.plot(xValues, slope*xValues + intercept, color = "r") # Plots the x values and calculates the "y" values based on the best fit line result
plt.show()

print("y=%.3fx+%.3f"%(slope, intercept)) #Code to print the resulting best fit equation.  Should show below the graph