<a href="https://colab.research.google.com/github/Jonathan-Nyquist/PLAM/blob/main/Class06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a id='Top'></a>
# Class06:
## Learning Objectives
* [Modules](#Modules)
* [Reading from and writing to a file](#Files)
* [Martian Challenge: Crater Analysis](#Challenge)

<a id='Modules'></a>
## Modules
[Top of notebook](#Top)

Notice this additional Markdown feature. From each link in the table of contents I've added a return link to the top of the notebook. Just examine this cell to see how it was done.

We already touched on modules in a previous notebook. Each module contains a bundle of functionality structured around some common theme, but functions that you may not need for every application, so Python let's you load them as needed. Obviously, you have to import the module before you can use the functions and variables it contains. The example we used previously was the math module.

Here is a reminder of how that works.

In [None]:
import math
x = 2
y = math.sqrt(2)
print(f'The square root of {x} is {y}')

### Name Spaces: A key programming concept
When we imported math above all of the functions in the math module were imported, but they could only be accessed by prefixing each with math.xxxx. There is a reason for this. What if one of the functions you imported had the same name as a function you already defined in your code? Your function would be replaced by the one you imported.  The authors of the math module have no idea what function names you might be using, but because all the math module functions have the module name as a prefix, you can be pretty sure there will not be any name collisions. There will instead be two sets of function names in your program's namespace: the functions you have defined yourself, and the ones that start with "math."

But, if you are sure the name of function you want to import won't duplicate one of the functions you already have defined, there is another way to import just the functions you need instead of loading the entire module.

In [None]:
from math import sqrt
y = sqrt(2)
print('The square root of {0} is {1}'.format(x, y))

In this example only the sqrt function was imported from the math module and no prefix is needed.  The onus is on you to watch out for namespace collisions. You can even import all of the functions in math using the from command, but that really increases your risk of a namespace collision. Suppose, for example, somewhere in your code you created a list of the seven deadly sins.

In [None]:
sin = ['pride', 'envy', 'gluttony', 'anger', 'lust', 'greed', 'sloth']
print(sin)

Later on you import all of the contents in the math library using "from"

In [None]:
# Imports everything
from math import *
print(sin)

What's this? Sin is now a function? What are the theological implications? Maybe that we can't function without sin ;-).

The math module includes the trigonometric functions sine and cosine named sin and cos, respectively. Your sin was purged by the flame of pure mathematics. A Python parable that illustrates why you should generally prefer "import" to "from" for loading modules.

It does get tiresome to type the full module name before every function, especially when, unlike the math module, the module name is a long one. But rather than yield to devil on you shoulder softly whispering "Just use 'from.' You know you want to." there is, as Budda preached, a middle path.  You can abbreviate the module name. For example, numpy is a popular module for working with arrays of numbers. Numpy is commonly abbreviated at np, but that is just convention. When using "import .... as" you can use any alias you wish.

In [None]:
# Regular import
import numpy
x = numpy.array([1,2,3,4,5])
x2 = x + 2
x3 = x ** 2
print(x)
print(x2)
print(x3)

In [None]:
# Import using an alias so we can type less
import numpy as np
x = np.array([1,2,3,4,5])
x2 = x + 2
x3 = x ** 2
print(x)
print(x2)
print(x3)

Since numpy was imported as np, instead of typing numpy.array() we just typed np.array(), which is quicker, but we avoided the Devil's temptation to type

from numpy import \*

which means we can define our own function or variable named "array" without worrying about name space conflicts because array and np.array are clearly distinct.

### Student Exercise
Import the math module as m and use m.sqrt() to print the square root of 144.

### Some examples of modules and their applications
There are literally thousands of different modules (450,000 as of May, 2023) you can import to add functionality to your program. Before you write a function from scratch, it pays to check whether an existing module does what you need.

In [None]:
# The sys module provides system-level information
import sys
print('The version of Python running is:')
print(sys.version_info)

In [None]:
# Get the local time and date
import time
localtime = time.localtime(time.time())
print("Local current time :", localtime)

# Get the timezone of the computer running your application
print("The current time zone is: ", time.tzname)

In [None]:
# Easy to build calendars into applications
import calendar
cal = calendar.month(2023, 9)
print ("Here is the calendar:")
print(cal)

In [None]:
# Pandas is a popular module for working with information in the
# form of data tables similar to spreadsheets
import pandas as pd
df = pd.DataFrame({
   'col1': ['Item0', 'Item0', 'Item1', 'Item1'],
   'col2': ['Gold', 'Bronze', 'Gold', 'Silver'],
   'col3': [1, 2, 3, 4]
})
print()
print('A pandas dataframe')
print(df)

In [None]:
# Print one column of the dataframe
print(df['col2'])

In [None]:
# Adding a Youtube Video
import IPython
# Video credit: https://www.youtube.com/watch?v=Ue4PCI0NamI
IPython.display.YouTubeVideo('Ue4PCI0NamI')

The examples above are just a few of modules available for Python. Some are well-established, others are experimental projects still under development. We will see many more examples in this class, and I will call your attention to them when they are used. In all cases, the web is your best source of documentation.

**The take-home messages is that before you write your own python library, check what is available!**

<a id='Files'></a>
## Reading from and writing to a file
[Top of notebook](#Top)

Until now, we have been working mainly with data and examples contained within the Jupyter Notebook. The one exception was when we loaded in the newspaper article to count the number of times NASA was referenced.  At the time I told you we'd talk about files later. Later is now.

Files come in two flavors: those that contain human-readable text (ASCII files, UTF files) and those that are written in a format only computers can easily read (Binary). Why not always write files in human-readable form? Because binary files are more compact, so they take less space to store, and they are faster for computers to read and write.

Typically, large data files are going to be stored as binary numbers. Because we can't open and read binary files with an ordinary text editor, we have to know the format of the binary file (What varibles? What order? How many?) before we can write python programs to read or write the data. If the files were created with an open-source program the format is probably published, but if they were written using propriety commercial software the format may or may not be publicly available depending on the vendor or the cleverness of hackers.

An example of a simple ASCII file is a .txt file can be created with a program such as NotePad(PC) or TextPad(Mac). An example of a binary file would be a Microsoft Word document, with formatting codes, images, and other elements not easily expressed as plain text characters.

### Read a pure text file
Let's begin with a file that contains nothing but text. No numbers. No special characters. No binary data.

The file we will work with first is named "green_eggs.txt".  Let's see how to read it into a python string variable. First, however, we must download the file to the machine where your instance of Colab is running.

In [None]:
!wget https://raw.githubusercontent.com/Jonathan-Nyquist/PLAM/main/green_eggs.txt

In [None]:
textfile = open('green_eggs.txt', 'r')
sam = textfile.read()
textfile.close()
print(sam)

The first line opened the file for reading, and returned a "file object" that we named "textfile."  The file object has a read() method that returns the entire content of the file as a string, which we assigned to the variable "sam" and printed. Then we closed the file object.  Not closing the file object is like leaving the fridge door open.  Maybe nothing bad will happen, but closing it is a good habit.

Notice that the open function is built into Python; we didn't need to import a module. The function was called with two parameters: the file name and 'r', which told Python to open the file for reading, not writing.

If we wanted to read the file one line at a time, we could do so with a loop.

In [None]:
textfile = open('green_eggs.txt', 'r')
for line in textfile:
    print(line)
textfile.close()

Note we got the extra breaks because the print() automatically inserts a carriage return, but each line in the file had one already. We could strip off the carriage returns from the lines in the file and just go with the one the print command inserts.

In [None]:
textfile = open('green_eggs.txt', 'r')
for line in textfile:
    print(line.strip())
textfile.close()

We can close the file automatically if we wrap our commands in a "with" block (called a "context").

Notice that the python code has one indent to show the lines inside the with block and a second indent to show the lines inside the for loop, which is inside the "with" block. When we exit the with block the file will be closed for us.

In [None]:
with open('green_eggs.txt', 'r') as textfile:
    for line in textfile:
        print(line.strip())

**Student Challenge:** Download go.txt with the wget command below, then
open, read, and print the contents of the file "go.txt" in the cell below.

In [None]:
!wget https://raw.githubusercontent.com/Jonathan-Nyquist/PLAM/main/go.txt

### Writing to a text file
Writing works similarly, but we open the file for writing.

In [None]:
# Create a multiline string with triple quotes
poem = '''
Look at me!
Look at me!
Look at me NOW!
It is fun to have fun
But you have to know how.
'''

# Print the string to a file
with open('look_at_me.txt', 'w') as textfile:
    for line in poem:
        textfile.write(line)

Tip: Stings in triple single quotes can continue over multiple lines.

There is now a file in the same folder as the notebook that contains this poem.

### Reading in CSV files
A very common file format is a table with multiple columns of variables separted by commas, cleverly called a comma separated variable (CSV) file.

Something line this:
```
Name, Age_When_Elected
Biden, 78
Truman, 60
Madison, 57
Carter, 52
Clinton, 46
```

Now we will read in some made-up quiz scores. The pandas module demonstated above is ideal for working with tabular data and has a function for reading csv files that automatically takes care of opening and closing the file.

In [None]:
!wget https://raw.githubusercontent.com/Jonathan-Nyquist/PLAM/main/scores.csv

In [None]:
import pandas as pd
quiz_scores = pd.read_csv('scores.csv')
print(quiz_scores)

Pandas read the data into a "datafreme" named quiz_scores, used the first row of the file for column headers and numbered the rows sequentially.

In [None]:
# Print a single column by using the column name to index the dataframe.
quiz_scores['Quiz']

** Student Challenge **
Print out the column of the quiz_scores dataframe that has the student ids.

In [None]:
# Calculate some useful statistics on the scores.
quiz_scores['Quiz'].describe()

Whole books have bean written about using the pandas module. It is very popular in the data science community. If you work a lot with data tables, buy a pandas book one. I own three!

### Indego Bikes
But before we move on to the Martian Challenge, let's get an inkling of the real power of writing Python programs by working with a full-size data set. All sorts of data are available on the website [openphillydata.org](https://www.opendataphilly.org/), including trip information for indego bikes, https://www.rideindego.com/about/data/, which is downloadable in, you guessed it, CSV format. I've downloaded the data for the second quarter of 2016, a file with over 10,000 data records.

Temple University has several Indego Bike stations on campus, those blue bikes you can rent as needed.

[![Station-Map-Indego.jpg](https://i.postimg.cc/k5hjz0HQ/Station-Map-Indego.jpg)](https://postimg.cc/LqLBZb25)

In [None]:
!wget https://raw.githubusercontent.com/Jonathan-Nyquist/PLAM/main/Indego_Trips_2016Q2.csv

In [None]:
import pandas as pd

# Load bike trip data for the second quarter of 2016 into a dataframe
indigo = pd.read_csv('Indego_Trips_2016Q2.csv')

# We don't want to print the whole 10,000+ lines, so print the
# first 10 records using the dataframe head() method to print the head of the file.
# The paramenter you pass is the number of lines to display.
indigo.head(10)

** Student Challenge**
Print the last ten lines of the indigo dataframe using the tail() method.

In [None]:
# Lots of columns, lets find their names using the keys() method.
# This reminds you of dictionary keys, it should!
print(indigo.keys())

In [None]:
# Get descriptive statistics for the duration of indigo trips.
indigo['duration'].describe()

In [None]:
# Wow! Over 17,000 trips. If you read the website,
# duration is in seconds, convert to minutes
stats = indigo['duration'].describe()
print(f'The average Indigo trip lasted {stats["mean"]/60:.2f} minutes.')

The walk-up price for using an Indego Bike in 2016 was \$4 for 30 minutes. As we've just seen, the average trip then was just under 30 minutes. Probably not a coincidence, eh?

Interestingly, the minimum rental now is a $15 day pass.

In [None]:
# A sneak peek at plotting data. Dataframes have a built in method called hist()
# that plots a histogram. It returns
# an axis object that has properties such as axis labels you can adjust.

# This tells Jupyter to plot inside the notebook rather than popping open a new window.
# Commands that start with % are called "magic" commands. They are used to adjust
# the Jupyter notebook's behavior, and are not part of the Python language.
%matplotlib inline

# Convert seconds to minutes
duration_in_min = indigo['duration']/60

# Make the plot
ax = duration_in_min.hist(bins=100)
ax.set_xlabel('Trip Duration (min)')
ax.set_ylabel('Number of Trips');

The histogram show lots and lots of short trips and a few really long ones. What's up with the overnight trips? Did they steal the bike? Let's ignore those outliers and plot trips less than two hours.

In [None]:
# Create a new dataframe that meets a logical condition
short_trips = duration_in_min[duration_in_min < 120]

# Plot the new dataframe
# the .hist() method returns an axis object that can be modified to customize the plot
ax = short_trips.hist(bins=50)
ax.set_xlabel('Trip Duration (min)')
ax.set_ylabel('Number of Trips');

So we see the vast majority of trips are between a few minutes and half an hour.

**Don't worry if you didn't follow all of that code.** The point was to give you an idea of the power of Python. With just a few lines of code we analyzed the data from over 17,000 Indego bike trips between April and June of 2016.  Awesome!

Does ridership fall off in the winter? Do the rides get shorter in cold weather? How does ridership from Temple University compare with other Indigo stations? It would not be hard to investigate such questions. This is only the tip of the iceburg -- there is an amazing amount of data available on this and other websites. No wonder  careers in data science are so lucrative!

By the way, for a distribution that is not a bell-curve, but has a long tail with mostly short trips and a few long ones, the average is not a very good summary statistic because it is so affected by the few extreme values. The median (middle value) is a more robust measure of the typical trip duration. The result (below) is reasonable, as you can bike from point A to point B in most of Philly in less than fifteen minutes. Also, as mentioned, if you don't have an Indego membership, the walk-up price in 2016 was $4 for 30 minutes, which is stong incentive to keep the ride short. It makes you wonder if the new pricing model has changed the pattern of ride duration. I'm sure the folks at Indego have their own data scientist working to optimize revenue without diminishing ridership.

In [None]:
med = indigo['duration'].median()/60
print('The median trip duration is {0} minutes.'.format(med))

<a id='Challenge'></a>
## Martian Challege: Crater Analysis
[Top of notebook](#Top)

Mark Whatney might wish he had a bike on Mars, but riding is probably tough in one of those pressurized suits. Besides, Mars is known for little blue men, not little blue bikes.

Even driving the rover, Watney had to contend with a problem we don't see in Philly - impact craters. (Well, some of the Philly potholes might come close.)

![Image of Martian Crater](https://d2pn8kiwq2w21t.cloudfront.net/images/PIA25889_MAIN_fullres_b.2e16d0ba.fill-1800x775-c10.jpg)

Driving the rover into and out of the really deep craters could be dangerous and use a lot of battery power. Let's download a file named 'marscraters.csv', which has information on several thousand Martian craters located between a longitude of 0-30 degrees and a latitude of 0-5 degrees (arbitrary subregion choosen to reduce the size of the data set).

**The Student Task** is to use an approach similar to the one demonstrated for the Indego bikes to plot the histogram of crater depths.  Depths will be in a column named 'DEPTH_RIMFLOOR_TOPOG', which measures from the crater floor to the rim in units of kilometers. The data were obtained from: http://craters.sjrdesign.net/index.php

In [None]:
!wget https://raw.githubusercontent.com/Jonathan-Nyquist/PLAM/main/marscraters.csv