# Introduction to Python

Python is a powerful, object-oriented, open source, and widely used programming language that is one of the most commonly used in data science and increasingly across academic disciplines and the private sector.  However, if you are accustomed to other languages like MATLAB or R, or are a novice programmer, it can take awhile to get used to the Pythonic way of doing things.  This notebook seeks to provide you with a basic introduction to Python with a focus on the syntax and the type of tasks you might do when you use `dplpy`.

Python is an **interpreted** language (like MATLAB or R), which means you don't need to compile code yourself for it to run.  This allows Python code to be relatively easy to debug, largely portable (e.g. cross-platform), easily readable, and tractable.  The primary trade off is that it will be slower than a compiled language like FORTRAN.  It is also very extensible - a large community has written and curated packages of code that allow you to do a lot with relatively little coding.

Python also is easily useable in Juypter notebooks (like you're using now).  This makes it good for both teaching and sharing your code (and results and figures) with others.  We'll talk more about how to use Juypter notebooks in the introductory part of the class. 

In notebooks and Python code in general, you can comment out lines of code so they aren't read when the road runs. This can be useful for documenting what a line of code does or for troubleshooting or debugging.  To comment our a single line or a part of a line you can use `#`.  

To begin with though, Python has the capacity to function as a simple calculator.  You can do arthimatic operations in a straightforward and intuitive way:

In [2]:
3 + 4 # this is a commment. When you run this code cell, you will see the answer displayed below the cell

7

Like other programming languages, you can also use **variables** to stand in for numbers and perform operations on those:

In [3]:
a = 3
b = 4

a+b  # when you run this code cell, you will see the answer displayed below the cell if it is the last line of code

7

You can also assign the answer to an equation with variable (or numbers) to another variable.  Here we also use the `print` function to have the result displayed after the code block:

In [4]:
c = b * a
print(c) # when you assign the answer to an equation to a variable, the answer doesn't automatically print to the notebook

12


In base Python we can also create **lists** of variables (numbers or strings). The list is a common data type, which can hold more than one value at a time and allow us to access (or _index_ into) these. Let's look at a simple example - note that we use **square brackets** to indicate the new list:

In [5]:
my_list = [1, 2, 3, 4, 5, 6]
my_list # just typing the list name here will show the list in the notebook outbook

[1, 2, 3, 4, 5, 6]

One of the most common things you'll need to do with lists (or other sequences) is get individual items out of the them. To do so you need to "index" into the sequence, which basically means to point to the location of the sequence where the item or items that you want are. **Python is a zero-based language,**_ so the first item in a sequence type has an index of 0 (we say, the 0th position or location). To index on a sequence you use **brackets** with integer value(s) inside. 

A common pitfall is confusing indexing and function calls. Both list declaration and indexing uses square brackets [] but function calls use parentheses (). 


In [6]:
my_list[0], my_list[3]

(1, 4)

We can ask for values in other ways as well.  For instance, if we use an index of `-1` we will get the last item in the list:

In [7]:
my_list[-1] # in general, using the negative means to count backwards from the end of the sequence

6

### An aside on zero-based counting in Python

Python is a zero-index language.  Instead of starting to count at 1 (which is intuitive), Python counts starting at zero. So in the 'my_list' list, 1 is in position 0 (the 0th position). 

Yes, this is really counterintuitive and I personally still struggle with it (especially coming from MATLAB and FORTRAN which start numbering with 1).  The reason for zero indexing is largely [mathematically motivated](https://www.cs.utexas.edu/users/EWD/transcriptions/EWD08xx/EWD831.html), where describing e.g. ranges of values is more naturally thought about with respect to zero, whereas index schemes that start at 1 are more typical of counting actual things.  Unfortunately, most of us do tend to think of numbers as primarily for counting ('I have 2 eggs', 'Take the 3rd right turn').  See a broader discussion of this and links here: https://craftofcoding.wordpress.com/2017/03/12/why-1-based-indexing-is-ok/

The only intuitive way of thinking about this I can offer you at this point is this:  Imagine Python indexing as describing the solution to the following description: 'Once I am in a starting position, how far do I have to go to get to the result I'm looking for?'.  So in the example below, if you start in the first position of the `fruits` list (orange), you need to move 3 positions to get to banana (or 'Starting at orange, I need to move three positions, 1st to apple, 2nd to pear, and 3rd to banana').  You could also think of how we talk about birthdays - e.g. your 1st birthday is actually the anniversary of your birth day (which is your 0th birthday).


In [8]:
fruits = ['orange', 'apple', 'pear', 'banana', 'kiwi', 'apple', 'banana']
fruits[0], fruits[3]

('orange', 'banana')

## From lists to arrays to DataFrames

Lists are ok, but we're actually going to want to work more commonly with _arrays_ -- imagine a 2 dimensional Excel spreadsheet (rows and columns) or even a 3 dimensional dataset like a gridded climate product (latitude, longitude, and time).  For working in multiple dimensions, we'll use arrays and a package of software called `numpy`, which extends the numerical methods in base Python.

To add other packages in Python, we use the `import` command.  One of the strengths of the Python language are the many, many additional   **Packages** and **Libraries** available that allow you to extend and enhance Python with new modules.  A **Package** then is a collection of modules.  **Modules**, at their core, a collection of code stored in a **.py** file.  Modules may contain **functions** (that do things to your data, make calculation, etc.), but also define data classes, types, or variables.  

The term **Library** and **Package** are often used interchangably now (I'll definitely do so) to mean a collection of modules, although libraries can also specifically refer to a collection of several packages. In order to make use of these additional modules, we need to `import` them. 

In [9]:
import numpy as np

Here, we shorten the imported name to `np` for better readability of code using NumPy. This is a very widely adopted convention that makes your code more readable for other Python users.  I do recommend to always use import numpy as `np`.  In theory, you could alternative call the numpy functions as `numpy.function`, but convention in Python has become to use `np.function`.

There are a lot of functions in the NumPy package.  You can see the documentation here: https://numpy.org/doc/stable/.  

Let's create an array using numpy:

In [10]:
my_array = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
my_array

array([[ 1,  2,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

Let's go ahead and create another example array and look at some of NumPy's functionality.  In the code block below, we will use NumPy (`np`) to create a vector of the integers from 0 to 14 using `.arange`, then reshape (using `.reshape`) the resulting 15 numbers into a matrix with 3 rows and 5 columns.

In [11]:
a = np.arange(0,15,1).reshape((3, 5)) # since start and step are optional in .arange, you could also write a = np.arange(15).reshape(3, 5)
a

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

Observe a few things about the commands above.  First, the object-oriented focus of Python allows us to string together several operations or methods in a row, instead of doing these on several lines.  Second, take a look at the module `.arange`.  This module takes an (optional) starting number, a stopping number, and an optional step size and generates a set of evenly spaced values separated by a given step between a certain interval.  As we learned in the Introductory notebook, an odd behavior of `.arange` is that the interval does not include the stop value (!) if it is an integer (you can see this perhaps as 'starting with 0, give me 15 numbers', but it is still counter-intuitive to me).  See more about `.arange` here: https://numpy.org/doc/stable/reference/generated/numpy.arange.html.

`.reshape` allows us to change the shape of a set of numbers to a certain dimension of row, columns, etc.  See here for more: https://numpy.org/doc/stable/reference/generated/numpy.reshape.html.  Notice that the shape of the new array (a matrix with 3 rows and 5 columns) is a _tuple_ indicated between parentheses (e.g. these values are both related to one another and immutable, since they together define the shape of the new array). 

Now, our array (or, since it is a 2 dimensional array, we can call it a _matrix_ as well) also has other characteristic we can reveal using other modules.  For instance, we can get the dimensions (how many rows, how many columns, etc.) of the matrix using `.shape`:


In [12]:
a.shape

(3, 5)

As we discussed before, that methods applied to an object as above could also be applied using the following command, instead of treating `a` like an object:

In [13]:
np.shape(a)

(3, 5)

## From arrays to DataFrames

Arrays are nice, but they remove the data from any context - we don't have column headers or know what the row order means.  `Pandas` is another library for Python with tools for handling and operating on tabular-like data.  It particularly excels at the type of data we might usually expect to find in flat text files, like CSV files or Excel spreadsheets.  It uses a powerful data structure called a `DataFrame` that is similar to arrays, but that streamlines the process of data handling and manipulation and time series analysis.  

Let's import `Pandas` as `pd`, following convention

In [14]:
import pandas as pd


In [15]:
text_data = pd.read_csv('myData.csv')
text_data

Unnamed: 0,position,score,accuracy
0,1,43,3
1,2,21,6
2,3,39,4
3,4,11,2
4,5,47,3


In [16]:
text_data.columns

Index(['position', 'score', 'accuracy'], dtype='object')

In [17]:
text_data.index

RangeIndex(start=0, stop=5, step=1)

In [18]:
text_data.describe()

Unnamed: 0,position,score,accuracy
count,5.0,5.0,5.0
mean,3.0,32.2,3.6
std,1.581139,15.466092,1.516575
min,1.0,11.0,2.0
25%,2.0,21.0,3.0
50%,3.0,39.0,3.0
75%,4.0,43.0,4.0
max,5.0,47.0,6.0


In [19]:
text_data["accuracy"]


0    3
1    6
2    4
3    2
4    3
Name: accuracy, dtype: int64

In [20]:
accuracy = text_data["accuracy"]
accuracy


0    3
1    6
2    4
3    2
4    3
Name: accuracy, dtype: int64

In [21]:
text_data[0:2] # get rows using Python counting logic

Unnamed: 0,position,score,accuracy
0,1,43,3
1,2,21,6


In [22]:
text_data.loc[0:2,:] # get rows by their location (index, in this case)

Unnamed: 0,position,score,accuracy
0,1,43,3
1,2,21,6
2,3,39,4


In [23]:
text_data.loc[:,["position","accuracy"]] # get columns by location, using names

Unnamed: 0,position,accuracy
0,1,3
1,2,6
2,3,4
3,4,2
4,5,3


In [24]:
text_data.iloc[:,1] # this selects column 1, which is the "score" column

0    43
1    21
2    39
3    11
4    47
Name: score, dtype: int64

In [25]:
text_data[text_data["accuracy"]<=3] # get the DataFrame only where accuracy is <= 3

Unnamed: 0,position,score,accuracy
0,1,43,3
3,4,11,2
4,5,47,3


In [26]:
text_data.iloc[2,1] = np.nan # the value of 39 in the accuracy column
text_data

Unnamed: 0,position,score,accuracy
0,1,43.0,3
1,2,21.0,6
2,3,,4
3,4,11.0,2
4,5,47.0,3


In [27]:
text_data.dropna(how="any") # row with missing "accuracy" entry is not displayed

Unnamed: 0,position,score,accuracy
0,1,43.0,3
1,2,21.0,6
3,4,11.0,2
4,5,47.0,3
