# Building Python Proficiency

*2 hours*

**Contents:**

- [Objects in Python](#Objects-in-Python)
- [Other Python Data Structures](#Other-Python-Data-Structures)
- [Loading Python Packages](#Loading-Python-Packages)
- [Review: Sequences](#Review:-Sequences)
- [Iterating through Sequences](#Iterating-through-Sequences)
- [Using Sequences Effectively](#Using-Sequences-Effectively)

---

## Objects in Python

So far, we've treated Python like a *functional programming language.* That is, we have functions and we have values and that's it. Functions operate on values and return new values. We can chain functions together to get more complex values:

```python
print(round(dms_to_dd(24, 8, precision = 5)))
```

But one of Python's most important features requires us to understand **objects.**

**In Python, everything is an object. By "everything," I mean character strings...**

In [1]:
"Algeria".upper()

'ALGERIA'

In this example, the character string `"Algeria"` is an object.

**Objects have both data, or values, and behaviors.** That is, they are capable of storing values and also capable of carrying out pre-programmed behaviors on those values.

Here, the string `"Algeria"` has:

- **Values:** It's primary value is equal to the character string "Missoula"; it may have other values like its length--the number of characters.
- **Behaviors:** It's behaviors include the ability to capitalize all the letters in its value.

We can see what other behaviors are contained in the `"Algeria"` object by using the `dir()` function:

In [2]:
dir("Algeria")

['__add__',
 '__class__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmod__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'capitalize',
 'casefold',
 'center',
 'count',
 'encode',
 'endswith',
 'expandtabs',
 'find',
 'format',
 'format_map',
 'index',
 'isalnum',
 'isalpha',
 'isascii',
 'isdecimal',
 'isdigit',
 'isidentifier',
 'islower',
 'isnumeric',
 'isprintable',
 'isspace',
 'istitle',
 'isupper',
 'join',
 'ljust',
 'lower',
 'lstrip',
 'maketrans',
 'partition',
 'removeprefix',
 'removesuffix',
 'replace',
 'rfind',
 'rindex',
 'rjust',
 'rpartition',
 'rsplit',
 'rstrip',
 'split',
 'splitlines',
 'startswith',
 'strip',
 'swapcase',


For instance, there's a function to count the occurrence of unique characters:

In [3]:
"Hello, world!".count("l")

3

If everything in Python is an object, then variable names always refer to objects as well:

In [4]:
city = "Algiers"
city.lower()

'algiers'

The functions `lower()`, `upper()`, and `count()`, when they *belong* to an object, are referred to as **methods.** A method is just a function that belongs to an object. It's behavior may depend on that object and the object's value, as we saw with `upper()`.

Similar objects can be grouped together by their class or their `type`.

In [5]:
type("Algiers")

str

We already knew that `"Algiers"` is a string. But, more importantly, all strings share the same built-in behaviors:

In [6]:
dir("") == dir("Algiers")

True

In [7]:
"Atlantic Ocean".upper()

'ATLANTIC OCEAN'

Numbers are objects, too, though in Python, built-in numbers don't have a lot of cool behaviors we would normally use...

In [8]:
x = 3.14
x.is_integer()

False

In [9]:
# The is_integer() method only belongs to "float" objects, not "int" objects
x = 3
x.is_integer()

AttributeError: 'int' object has no attribute 'is_integer'

In [10]:
dir([])

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

In [11]:
cities = ['Algiers', 'Rome', 'Paris']
cities

['Algiers', 'Rome', 'Paris']

In [12]:
cities.reverse()

Why didn't we see any output?

In [13]:
cities

['Paris', 'Rome', 'Algiers']

In [14]:
cities.sort()

In [15]:
cities

['Algiers', 'Paris', 'Rome']

**It's important to realize that `list` methods modify the data in place, unlike string methods:**

In [16]:
"Rome".replace("R", "H")

'Home'

**This difference arises because Python strings are *immutable;* they can't be changed once they are created.** `"Rome"` and `"Home"` are two different strings, so their string methods return a new string each time.

**Objects are key to advanced Python programming.** I wanted to introduce them here because we will be seeing them a lot.

---

### Challenge: Formatting Place Names

You received a Shapefile that shows a number of cities. In the attribute table, the city names are all written like:

```
NEW YORK CITY
MIAMI
KANSAS CITY
...
```

**How can we re-format these names so that they look like proper place names, e.g., "New York City"?**

Look at the list of methods that belong to string (`str`) objects. Is there a method defined on strings that does this for us? Use your intuition and the `help()` function.

---

## Other Python Data Structures

So far, we've learned about a few different Python data types:

- Strings or `str`, like `"Algeria"`
- Integers or `int`, like `42`
- Floats or `float`, like `3.14`

These data types are referred to as *atomic* data types because they contain just one value; they are like individual atoms. How can we work with molecules or whole chunks of matter, instead? We've seen just one data structure so far, the `list`:

In [17]:
cities

['Algiers', 'Paris', 'Rome']

A related data structure in Python is called the `tuple`; it looks just like a list but is represented by opening and closing parentheses instead of square brackets:

In [18]:
tuple(cities)

('Algiers', 'Paris', 'Rome')

In [19]:
my_tuple = (1, 2, 3)

In [20]:
dir(my_tuple)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getnewargs__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rmul__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'count',
 'index']

**Why does `tuple` have so few methods defined?**

Tuples are immutable, just like strings. They can't be changed once they are created. The only two public methods defined just give us information about the tuple's fixed, unchanging elements:

In [21]:
my_tuple.index(1)

0

If we want to change a tuple, we'll have to convert it to a `list` first; of course, we're really just getting a *copy* of the tuple's contents contained in a `list` instead.

In [22]:
list(my_tuple)

[1, 2, 3]

**Another data structure is the dictionary, or `dict`.** Let's say we're working with Shapefiles for multiple countries and we want to identify [each country's 2-letter country code...](https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2#Officially_assigned_code_elements)

In [23]:
fips = {
    "Algeria": "DZ",
    "United States": "US"
}

A dictionary allows us to relate one kind of data to another. Here, we can look up a county's FIPS code based on its name.

In [24]:
fips["Algeria"]

'DZ'

What do dictionaries know how to do?

In [25]:
dir({})

['__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__ior__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__or__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__ror__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'clear',
 'copy',
 'fromkeys',
 'get',
 'items',
 'keys',
 'pop',
 'popitem',
 'setdefault',
 'update',
 'values']

In [26]:
fips.keys()

dict_keys(['Algeria', 'United States'])

In [27]:
fips.values()

dict_values(['DZ', 'US'])

In [28]:
fips.items()

dict_items([('Algeria', 'DZ'), ('United States', 'US')])

One thing to note: Each of these outputs *looks* like a list but it is not actually a list.

In [29]:
type(fips.keys()) is list

False

In [30]:
type(cities) is list

True

In [31]:
list(fips.keys())

['Algeria', 'United States']

**The "keys" of a dictionary can be any Python value, but they are usually either strings or numbers:**

In [32]:
my_dict = {1: "One", 2: "Two", 42: "Forty-Two"}
my_dict[42]

'Forty-Two'

In summary:

- A `list` is **mutable** (can be changed) and looks like: `[1, True, 3]`
- A `tuple` is **immutable** (cannot be changed) and looks like: `(1, True, 3)`
- A `dict` looks like: `{1: True, 2: False}` and...

**Is a dictionary mutable or immutable?**

### Using Dictionaries

Python dictionaries have lots of relevance for programming exercises. Suppose we had a list of counties, like in the county FIPS codes example, and we wanted to count how many times each unique county's name appears in the list?

In [33]:
# By the way, a quick way to make a list longer...
countries = ['Algeria'] * 3
countries

['Algeria', 'Algeria', 'Algeria']

In [34]:
countries.extend(['Tunisia', 'Libya'])
countries

['Algeria', 'Algeria', 'Algeria', 'Tunisia', 'Libya']

We can create a new, empty dictionary just like this:

In [35]:
counts = {}

An empty dictionary has no keys, yet!

In [36]:
counts.keys()

dict_keys([])

This next part I'll show for now, but we'll learn the details later:

In [37]:
for name in countries:
    if name in counts.keys():
        counts[name] = counts[name] + 1
    else:
        counts[name] = 1
        
counts

{'Algeria': 3, 'Tunisia': 1, 'Libya': 1}

---

## Break!

*A 10-minute break for learners.*

---

## Loading Python Packages

There's a lot that we could do with what we've already seen from Python. However, Python's best tools and powerful new features are available as packages that must be loaded. In Python, we tend to use the word "module" to refer to these extensions, instead of "package."

For example, because not every program needs to work with dates and times, Python includes all that functionality in a separate module called `datetime`:

In [38]:
import datetime

In [39]:
today = datetime.date.today()
today

datetime.date(2024, 3, 19)

In [40]:
type(today)

datetime.date

**Recall that Python objects contain both data and behaviors. We learned that the behaviors are called *methods.* The data are called *attributes.*** `datetime.date` objects have some pretty helpful attributes :

In [41]:
today.month

3

In [42]:
today.year

2024

**One of the powerful consequences of everything being an object in Python is that objects know a lot about themselves and what you want them to do.**

In [43]:
today > datetime.date(2000, 1, 1)

True

What is the date 7 days from today?

In [44]:
today + datetime.timedelta(days = 7)

datetime.date(2024, 3, 26)

**Note that we're using the same Python operators that we used for doing math with numbers.** That's because `datetime.date` objects have (in order, from above), `__gt__` and `__add__` methods defined:

In [45]:
today.__add__

<method-wrapper '__add__' of datetime.date object at 0x7f59902ffc50>

In [46]:
help(today.__gt__)

Help on method-wrapper:

__gt__ = <method-wrapper '__gt__' of datetime.date object>
    Return self>value.



### Advanced Importing

Because `datetime` can be a lot to type, if you're using it a lot, you might want to give it an *alias.*

In [47]:
import datetime as dt

dt.date.today()

datetime.date(2024, 3, 19)

Why do we have to write `datetime.date` or `dt.date`? The `datetime` module is organized into *sub-modules.* For now, you can think of these as nested folders.

Another example of a built-in module is the `math` module.

In [48]:
import math

With any module, we can either import the entire module, as above; or, we can import a specific part of the module that we want to use:

In [49]:
from math import log10

log10(100)

2.0

In [50]:
# Because of "import math" we can also do:
math.log10(100)

2.0

Recall that the `is` operator can be used to see if two variables point to the same thing:

In [51]:
log10 is math.log10

True

---

## Review: Sequences

**Sequences,** which include lists, tuples, and character strings, are fundamental to scientific programming in Python. We're never working a single piece of data, so there will always be a need to iterate through a collection of elements.

Recall that a **sequence** is any one of the Python data types: `list`, `tuple`, or `str` (string).

Remember how to test for membership in a list?

In [52]:
my_list = [1, 2, 3]
4 in my_list

False

Well, we can use this same syntax for *any* sequence, including strings!

In [53]:
4 in tuple(my_list)

False

In [54]:
'York' in 'New York City'

True

Note that this is case-sensitive, however.

In [55]:
'york' in 'New York City'

False

And, as such, when working with strings it's a good idea to *normalize* the strings if you don't care about case.

In [56]:
'New York City'.lower()

'new york city'

In [57]:
'york' in 'New York City'.lower()

True

---

## Iterating through Sequences

As we've seen, Python functions allow us to repeatedly apply the same behavior to new and different pieces of data. However, a function still has to be called, repeatedly, on each piece of data we want to work with. How can we iterate over all the data in a sequence?

**The basic tool for iterating through a sequence of data is the `for` loop.** Let's learn by example.

For these examples we'll be working with [data on cities in Africa from Natural Earth Data.](https://www.naturalearthdata.com/downloads/110m-cultural-vectors/) The `pandas` module is a community module that is great for working with tabular data, like this comma-separated variable (CSV) file from Natural Earth Data.

In [58]:
import pandas as pd

places = pd.read_csv("data/cities_Africa.csv")
places

Unnamed: 0,name,adm1name,latitude,longitude,pop_max,pop_min
0,Lobamba,Manzini,-26.466668,31.199997,9782,4557
1,Bir Lehlou,,26.119167,-9.652522,500,200
2,Moroni,,-11.704158,43.240244,128698,42872
3,Kigali,Kigali City,-1.951644,30.058586,860000,745261
4,Mbabane,Hhohho,-26.316651,31.133335,90138,76218
...,...,...,...,...,...,...
62,Addis Ababa,Addis Ababa,9.035256,38.698059,3100000,2757729
63,Cape Town,Western Cape,-33.918065,18.433042,3215000,2432858
64,Lagos,Lagos,6.445208,3.389585,9466000,1536
65,Nairobi,Nairobi,-1.281401,36.814711,3010000,2750547


In [59]:
type(places)

pandas.core.frame.DataFrame

Our CSV file is represented in our Python session as a `pandas` DataFrame. The DataFrame is basically a collection of sequences: each column is a sequence of values.

A `pandas` DataFrame can be indexed just like a dictionary.

In [60]:
places['name']

0         Lobamba
1      Bir Lehlou
2          Moroni
3          Kigali
4         Mbabane
         ...     
62    Addis Ababa
63      Cape Town
64          Lagos
65        Nairobi
66          Cairo
Name: name, Length: 67, dtype: object

Indexing the `'name'` column of the `places` DataFrame gives us something like a sequence of values. Therefore, we can index the first 10 values of that sequence like this:

In [61]:
places['name'][0:10]

0       Lobamba
1    Bir Lehlou
2        Moroni
3        Kigali
4       Mbabane
5          Juba
6        Dodoma
7      Laayoune
8      Djibouti
9        Banjul
Name: name, dtype: object

Let's assign the sequence of city names to a variable so it's easier to work with. A column in a `pandas` table has a method, `tolist()`, that will return that column's values as a list.

In [62]:
# Convert the "name" column to a Python list
cities = places['name'].tolist()

type(cities)

list

### The `for` Loop

A `for` loop is named after the first keyword, `for`, but it has two parts:

```python
for thing in sequence:
```

- `thing`, in the example above, can be any variable name you choose. We'll refer to the variable in the position currently occupied by `thing` as the *iterator variable.*
- `sequence` can be any Python expression or variable, so long as it evaluates to or points to a sequence.

In each iteration of the `for` loop, the iterator variable takes on a new value.

In [63]:
for color in ["red", "blue", "green"]:
    print(color)

red
blue
green


In the trivial example above, notice that we get three lines of output. This is because there are three elements in our sequence (the list of color names as strings). In the first iteration of the loop, the iterator variable `color` points to the string `"red"`. In the second iteration, it points to the string `"blue"`, and so on.

Note that the line with `print(color)` is indented four (4) spaces, just like the body of a function. In Python, the colon (`:`) character almost always indicates that a new code block is starting on the next line, so you should be used to indenting the next line and any other lines in the code block. Lines that are not indented are considered to be outside the code block; so, in the case of a `for` loop, lines outside of the code block are not run for each iteration of the `for` loop.

In [64]:
for color in ["red", "blue", "green"]:
    print(color)
print("done!")

red
blue
green
done!


The loop variable, `color`, gets assigned to each element of the sequence, in order, in each iteration. Therefore, when the `for` loop is complete, the `color` variable still points to the *last element* of the sequence.

In [65]:
color

'green'

Here's a slightly more interesting example. We can quickly filter the items of our `cities` names sequence by adding an `if` statement to the body of our `for` loop.

In [66]:
for name in cities:
    if 'Port' in name:
        print(name)

Porto-Novo
Port Louis


The body of a `for` loop can be as long as you want it to be. However, it's usually a good idea to keep `for` loops short, for readability. If you have a lot you need to do in the body of a `for` loop, write a function and then call that function inside the body of the `for` loop, just like we did with the `print()` function above.

---

## Using Sequences Effectively

Let's dig deeper using data. **How far away is Algiers from each of these cities?** 

Most of these are too far away to use a flat-earth approximation. **In geography, the shortest distance between two points on a sphere is a path described by...?**

That's right! A *great circle.* **The Haversine function calculates great-circle distances between two points,** and is reasonably accurate for shorter distances.

In [67]:
import numpy as np

def haversine(p1, p2):
    x1, y1 = map(np.deg2rad, p1)
    x2, y2 = map(np.deg2rad, p2)
    dphi = np.abs(y2 - y1) # Difference in latitude
    dlambda = np.abs(x2 - x1) # Difference in longitude
    angle = 2 * np.arcsin(np.sqrt(np.add(
        np.power(np.sin(dphi / 2), 2),
        np.cos(y1) * np.cos(y2) * np.power(np.sin(dlambda / 2), 2)
    )))
    return float(angle * 6371e3) # Earth's radius: 6371e3 meters

I need two numbers to represent a location on earth (longitude and latitude), so each location (referred to as `p1` or `p2` in the function above) needs to be a *2-element sequence;* i.e., a 2-element list or 2-element tuple.

In [68]:
# Longitude, Latitude
ALGIERS = (3.048, 36.765)

# Distance between Algiers and Tunis?
haversine(ALGIERS, (10.179, 36.803))

634924.2858604608

Let's suppose that we have three different sequences of equal length.

In [69]:
cities = places['name'].tolist()
lat = places['latitude'].tolist()
lng = places['longitude'].tolist()

How far away from Algiers is Cairo? Let's start by getting the numeric index of Cairo.

In [70]:
'Cairo' in cities

True

In [71]:
cities.index("Cairo")

66

In [72]:
CAIRO = (lng[66], lat[66])

haversine(ALGIERS, CAIRO)

2710762.9480906716

The answer is in meters because of the way our function is written. If we want the answer in kilometers, we need to divide by 1000.

In [73]:
haversine(ALGIERS, CAIRO) / 1000

2710.7629480906717

**How can we calculate these distances for all cities?**

When we have sequences of equal length, or anytime the length of a sequence is predictable and important, we can use a `for` loop that iterates of the *numeric indices* of the sequence.

In [74]:
# Just for the first 10 cities... Can replace 10 with len(cities)
for i in range(0, 10):
    print(cities[i])

Lobamba
Bir Lehlou
Moroni
Kigali
Mbabane
Juba
Dodoma
Laayoune
Djibouti
Banjul


Here, we can use the index variable `i` to select elements in the same position in multiple arrays.

In [75]:
distances = []
for i in range(0, len(cities)):
    city = cities[i]
    coords = (lng[i], lat[i])
    dist = haversine(ALGIERS, coords) / 1000
    distances.append(round(dist, 1))

In [76]:
print(cities[66], 'is', distances[66], 'km away from Algiers')

Cairo is 2710.8 km away from Algiers


**Python has some special built-in functions for working with sequences.**

The function `enumerate()`, when used in a `for` loop, takes any sequence and produces pairs of an indexing variable and each element of that sequence, in order.

In [77]:
for i, city in enumerate(cities):
    if lat[i] < 0:
        print(city)

Lobamba
Moroni
Kigali
Mbabane
Dodoma
Bujumbura
Port Louis
Lusaka
Harare
Bloemfontein
Pretoria
Maputo
Windhoek
Maseru
Antananarivo
Lilongwe
Brazzaville
Gaborone
Victoria
Dar es Salaam
Luanda
Johannesburg
Kinshasa
Cape Town
Nairobi


In the above example, we have *two* iterator variables. Confused? Consider this slight variation on the previous example. Below, I am using a `break` statement to make sure the loop only runs once; we'll talk more about `break` later.

In [78]:
for thing in enumerate(cities):
    print(thing)
    break

(0, 'Lobamba')


You should be able to tell that the iterator variable `thing` is a tuple containing two elements. The first element is a number, `0`, and the second element is a city name, like we would have expected. What `enumerate()` does is, instead of assigning the iterator variable (`thing`) to each value of the sequence (`cities`), it creates a tuple and makes the iterator variable point to that tuple instead.

So, in our first example, with two iterator variables, we were basically doing something like:

In [79]:
for thing in enumerate(cities):
    i = thing[0]
    city = thing[1]

But by writing two, comma-separated iterator variables, our `for` loop is a little cleaner and easier to read.

**Another built-in function that can be useful to use with `for` loops is `reversed()`.** Can you guess what `reversed()` does?

In [80]:
reversed(cities)

<list_reverseiterator at 0x7f593e4c21a0>

Instead of printing the results of the `reversed()` function immediately, Python creates what's called a **generator.** This is a special Python object that represents the transformation of the data that you want; in this case, the `cities` list in reverse order. By creating this representation, Python saves time because the result is not calculated until you actually ask for an element.

```python
for city in reversed(cities):
    print(city)
```

---

### Challenge: Closest City to Algiers?

**What's the closest city in this dataset to Algiers?** Remember that there are built in `min()` and `max()` functions, and you can call the `index()` method on a `list` to get the numeric position of an element.

In [81]:
distances.index(min(distances))

57

In [82]:
cities[57]

'Algiers'

Okay, sure, Algiers is in the cities dataset... Challenge yourself further by calculating the *second-closest* city to Algiers (i.e., the closest city if Algiers wasn't in the cities dataset).

---

## More Resources

- [The Python Standard library](https://docs.python.org/3/library/index.html)
- Punit Jajodia: [In Python, Everything is an Object](https://www.youtube.com/watch?v=X1RN6ADsOW4)
- Whirlwind Tour of Python: [Basic Python Semantics: Variables and Objects](https://nbviewer.org/github/jakevdp/WhirlwindTourOfPython/blob/master/03-Semantics-Variables.ipynb)
- Whirlwind Tour of Python: [Built-in Data Structures](https://nbviewer.org/github/jakevdp/WhirlwindTourOfPython/blob/master/06-Built-in-Data-Structures.ipynb)