## [EEP153] Week 0



### Learning goals



1.  Open a `jupyter` notebook on `datahub.berkeley.edu`.
2.  Understand simple `python` expressions.
3.  Work with lists & dictionaries.
4.  Work with `pandas.DataFrames`.
5.  Submit indication of completion.



#### Simple Expressions



Python is a general purpose language, extended via *modules*.
One important aspect of python are *expressions*.  Here are some examples:



In [1]:
# Arithmetic 
1 + 1

# String (Delineated using single- or double-quotes)
"Hello world!"

# To see output, using a "print" statement:
print(1+1)
print("Hello" + "world!")

The above provides examples of:

-   **Comments:** Text that begins with a &ldquo;#&rdquo; character.
-   **Function calls:** Something that takes arguments (in parentheses)
    and returns some output.  Here &ldquo;print&rdquo; is a
    function.
-   **Objects:** 1 and &ldquo;Hello&rdquo; are examples of objects.
-   **Operators:** &ldquo;+&rdquo; is an operator.  Notice that it functions
    differently depending on what it&rsquo;s operating on;
    that&rsquo;s because its operation depends on the `type`
    of the objects (*operands*) it&rsquo;s operating on.

Predict the output of the following two lines:



In [1]:
print(type(1+1))
print(type("Hello"))

#### Lists



Strings and integers are simple examples of different data types (or
objects).  A very important type that is more complicated are
`lists`.  Here are some examples.  Examine them, and predict the output:



In [1]:
a = [1,2,3]
b = ["Hello","world"]

c = a + b
print(c)
print(len(c))  # Here len returns the "length" of the list.

Extra optional practice to refresh yourself with list manipulation and slicing.
Examine these first, then predict the output.



In [1]:
print(c[2]) # Remember how Python counts arrays.
print(c[1:2]) # Why does Python only return one item instead of two?
print(b[0]*4)
print(c[0::2])
print(list(map(lambda x: x*2,a)))

#### Dictionaries



Another very basic kind of compound object are `dicts` (dictionaries;
also called associative arrays or hashes in other languages).  Predict
the output:



In [1]:
d = {'name': "Barney", 'species': "Dinosaur", 'age': 27, 'color': "Purple"}

print("{name} the {species} is {color}.".format(**d))

#### DataFrames



A much more complicated data structure is provided by a module called
`pandas`; you can find a quick tutorial at
[http://pandas.pydata.org/pandas-docs/version/0.23/10min.html](http://pandas.pydata.org/pandas-docs/version/0.23/10min.html).  The
DataFrame object will be very important for us.

The `pandas` module provides a data structure called a `DataFrame`.
These are basically rectangular arrays of data, with names for rows
and columns, rather like a spreadsheet.  In fact, one important thing
one can do with DataFrames is to import data *from* spreadsheets.



In [1]:
import pandas as pd

# Try looking at https://docs.google.com/spreadsheets/d/1ObK5N_5aVXzVHE7ZXWBg0kQvPS3k1enRwsUjhytwh5A in your browser.
SHEET = "https://docs.google.com/spreadsheets/d/1q1ikP1CXCcLf_Tq6VbhoskOYRvn_nDU5MHStIoVzMgA"

# The following line goes on-line and turns the spreadsheet into a pandas DataFrame:
df = pd.read_csv(SHEET + "/export?format=csv")

# This line will show us only the first five rows of data. Try removing .head() to see the full list of items.
# Guess what happens if you replace .head() with .tail(). Try it out!
df.head()

If this worked (!) you should be able to see some data from a recent
shopping trip of mine.  What are the different variables available in
the DataFrame `df`?  They correspond to the *columns* of the spreadsheet.



In [1]:
df.columns

Now, what else can we do?  Let&rsquo;s figure out how much my total grocery
bill was:



In [1]:
df['Price'].sum()

Let&rsquo;s say I&rsquo;m on a budget. Naturally, we&rsquo;d want to identify the item(s)
I&rsquo;m spending the most on. We can sort the values to investigate further.



In [1]:
# Note that we're indicating we want to sort by the 'Price' column and specify that it should be in descending order.
df.sort_values(by='Price', ascending=False).head()

Everything here looks straightforward, but let&rsquo;s take a closer look at
Red Endive and calculate the price per pound to make comparison easier.



In [1]:
# This line selects the 7th item in the dataframe (note the index number is 6 because we start counting at 0 when we use Python)
# and selects the 'Price' value for this particular item. It divides it by 'Quantity' to get the price per pound.
df.iloc[6]['Price']/df.iloc[6]['Quantity']

You&rsquo;ll find throughout the semester that unit price is a pretty useful statistic
to calculate. Let&rsquo;s do it for all the items on this grocery list. Thankfully we
don&rsquo;t have to do this one by one.



In [1]:
# This line creates a new column in our dataframe named 'Unit Price' and populates each row with the respective price value 
# divided by the quantity value.
df['Unit Price'] = df['Price']/df['Quantity']
df.head()

Almost there! Let&rsquo;s pare down our dataframe to look more friendly to the eye. We
don&rsquo;t want to see the following columns: Date, Location, NDB. Also, we only want to see
the first five items of the dataframe. 

In the previous blocks, we used .iloc which stands for index (or integer) location. We used integers to specify which
columns we wanted. In this section, we&rsquo;ll use .loc which allows us to use column labels. For extra practice, try to
achieve the same result but by using .iloc instead.



In [1]:
# Note that in both the .iloc and .loc syntax, the first set of parameters refer to rows and the second set refer to columns.
df.loc[0:5, ['Food', 'Quantity', 'Units', 'Price', 'Unit Price']]

Here&rsquo;s one last exercise that might be useful. Often times you will only want to view data
that fits a certain criterion. In this case, let&rsquo;s only look at items where the unit price
is less than 1.



In [1]:
# This line will return all rows in the dataframe where the Unit Price is < 1. Using what we've covered prior,
# modify the view of this dataframe to only include Food and Unit Price.
df[df['Unit Price'] < 1]

Extra things to refresh that may be helpful for Project 1: basic visualizations, datatypes, index, joins.



#### Final words



Throughout this class, you will be exposed to a variety of Python modules and tools and the data that you work with
may or may not be cleaned. In any case, learning how to find and use online documentation/resources is a
valuable skill that will benefit you greatly in this course and beyond. Be sure to utilize our course discussion
for any questions you might have - there&rsquo;s a good chance a peer may have a similar question or have the answer.
As the semester goes on, course staff will update the &ldquo;Useful Links/Resources&rdquo; post with any outside Python resources 
that may be helpful for the whole class.

