# Data Structures


## What's a data structure?
As its name implies, a data structure is a containerthat holds data. Just like some post office boxes hold packages and others hold letters, Python's built-in data structures have different purposes and uses. Use data structures to organize and perform operations on data. Python has the following built-in data structures: Lists, Dictionaries, Sets, and Tuples. Each container has different attributes and is used for a different purpose.

sources: (W3Schools, RealPython.com)

<img src="images/post_office_boxes.jpg" align="middle">

## Comparing Built-in Data Structures
Below is a comparsion of four built-in data structures in Python. 

![](images/Structures.jpg)

# Tuples
What is the proper pronunication of "tuple"? Answer: either TEW-pull or Tupple (like the 'u' sound in pup). 

* Ordered sequence of elements
* Tuples are immutable
* Parentheses denote a tuple

In [2]:
# Create an empty tuple

t = ()
t = (4, "hello", True, 3.1)
print(t[3])

3.1


In [3]:
# Concatenation with a tuple

print(t)
print("Concatenate '7'")
print((t) + (7,)) # Note the comma--the comma tells Python that this is a tuple and not an int

(4, 'hello', True, 3.1)
Concatenate '7'
(4, 'hello', True, 3.1, 7)


In [4]:
# Iterating a tuple
for v in t:
    print(v)

4
hello
True
3.1


In [1]:
# zip function example

s = 'apple','banana','pear'
t = [0, 1, 2] 
zip(s, t)
for pair in zip(s, t):
    print(pair)

('apple', 0)
('banana', 1)
('pear', 2)


In [2]:
type(s)

tuple

In [3]:
def tip_options(amount):
    # Use a tuple to return more than one value
    return(amount, amount*1.10, amount*1.15, amount*1.2)

print(tip_options(30))
print(type(tip_options(30)))

(30, 33.0, 34.5, 36.0)
<class 'tuple'>


In [4]:
# Use a tuple to swap values
y = 5
x = 10
print('x =', x, 'y =', y)
(x, y) = (y, x)
print('x =', x, 'y =', y)

x = 10 y = 5
x = 5 y = 10


# Lists

A list is an ordered sequence of *items*. Lists are similar to arrays in other languages. One difference is that Lists can contain different types of data.

## Creating Lists
 
Lists are created using several methods.

In [3]:
#Use square brackets to make list

cities = [] # an empty list

In [4]:
print(cities)

[]


In [5]:
cities = ["Dallas","Chicago","Miami","Grand Rapids" ]
print(cities," is of type ", type(cities))

['Dallas', 'Chicago', 'Miami', 'Grand Rapids']  is of type  <class 'list'>


In [6]:
# Get the length of your list using the len() function
len(cities)

4

In [8]:
#Ordered -- accessible via index
print(cities[2]) # Print the third item in a list

Miami


In [11]:
print(cities[-1]) # Print the last item in a list

Grand Rapids


In [12]:
print(cities[1:3]) # print the second and third items. Second value is not inclusive.

['Chicago', 'Miami']


In [13]:
print(cities[1:]) # print the second item through the end of the list.

['Chicago', 'Miami', 'Grand Rapids']


In [14]:
# Lists are not limited to containing only values of a single type
# A list may contain objects such as another list 
my_list = [True, 0, "Greg Bott", 3.14159, ["steak","eggs","donuts"]]
print(my_list)

[True, 0, 'Greg Bott', 3.14159, ['steak', 'eggs', 'donuts']]


## Lists are Mutable

In [22]:
print(cities)
cities[2] = "San Antonio" # Replace the third entry ('Miami') with 'San Antonio'
print(cities) # mutated list object

['Dallas', 'Chicago', 'Miami', 'Grand Rapids']
['Dallas', 'Chicago', 'San Antonio', 'Grand Rapids']


## Adding items to a list
* Use append to add elements to the end of a list. This operation *mutates* the list.
```Python
<list>.append(element)
``` 

Combine lists using the extend() function or + operator
```Python
<list>.extend(element)

<list> + <list>
``` 

Insert an item at specified index location using insert().



In [23]:
# Use the append() method to add an item to the list
print(cities)
print('Adding Columbia...')
cities.append("Columbia")
print(cities)

['Dallas', 'Chicago', 'San Antonio', 'Grand Rapids']
Adding Columbia...
['Dallas', 'Chicago', 'San Antonio', 'Grand Rapids', 'Columbia']


In [24]:
# Using append() to add multiple items results in a list within a list
more_cities = ['St. Louis', 'Tempe', 'Atlanta']

# append() accepts only 1 argument (an interable)
cities.append(more_cities) 
print(cities)

['Dallas', 'Chicago', 'San Antonio', 'Grand Rapids', 'Columbia', ['St. Louis', 'Tempe', 'Atlanta']]


In [25]:
cities[5][0]

'St. Louis'

In [31]:
# Use insert() to add an item to specific location in the list. 
print(cities)
cities.insert(1, 'Austin') # Insert Austin in the second position
print(cities)

['Dallas', ['Austin', 'Boise'], 'Austin', 'Chicago', 'Miami', 'Grand Rapids', 'Columbia', 'St. Louis', 'Tempe', 'Atlanta']
['Dallas', 'Austin', ['Austin', 'Boise'], 'Austin', 'Chicago', 'Miami', 'Grand Rapids', 'Columbia', 'St. Louis', 'Tempe', 'Atlanta']


In [32]:
# Reset the list to the original cities
cities = ['Dallas', 'Austin', 'Chicago', 'Miami', 'Grand Rapids', 'Columbia']

In [33]:
more_cities = ['St. Louis', 'Tempe', 'Atlanta']

# Use extend() when you want to add multiple values to a list
cities.extend(more_cities)
print(cities)

['Dallas', 'Austin', 'Chicago', 'Miami', 'Grand Rapids', 'Columbia', 'St. Louis', 'Tempe', 'Atlanta']


## Copying a list
If you use the following expression to create a new list, what you have is two references to a single object, NOT two lists.
```Python
list_a = list_b
```

In [34]:
list_a = [1,2,3,4,5]

list_b = list_a
print("a=",list_a)
print("b=",list_b)

a= [1, 2, 3, 4, 5]
b= [1, 2, 3, 4, 5]


In [35]:
# Change ONLY list_b
list_b[0] = 'Protein bar'
print("a=",list_a)
print("b=",list_b)

a= ['Protein bar', 2, 3, 4, 5]
b= ['Protein bar', 2, 3, 4, 5]


Even though I did not change list a, changing list b also changed list a because the two lists are not two separate objects. They are two references pointing to the same object.

In [36]:
list_a = [1,2,3,4,5]

# To make a *copy* of the object instead of referencing the same object, use copy()
list_b = list_a.copy()
print("a=",list_a)
print("b=",list_b)

a= [1, 2, 3, 4, 5]
b= [1, 2, 3, 4, 5]


In [37]:
# Change ONLY list_b
list_b[0] = 'Protein bar'
print("a=",list_a)
print("b=",list_b)

a= [1, 2, 3, 4, 5]
b= ['Protein bar', 2, 3, 4, 5]


This time, list a did not change. Now we have TWO separate objects. Therefore, changing list b does not affect list a and vice versa.

## Converting Lists to Strings and Back
Using the split() function to separate a string using a delimter (e.g., a comma) creates a list object.

In [43]:
states = "Missouri, Alabama, Texas, Washington, Florida"
print("States is of ", type(states))

states = states.split(",")
print(states, "this is of",type(states))

States is of  <class 'str'>
['Missouri', ' Alabama', ' Texas', ' Washington', ' Florida'] this is of <class 'list'>


In [44]:
print(states[2])

 Texas


In [45]:
states = ','.join(states)
print(states, "this is of",type(states))

Missouri, Alabama, Texas, Washington, Florida this is of <class 'str'>


In [47]:
# Remember that you may split on any character.
#   Here is an example of splitting an email address
#   into the user name (string prior to the '@' symbol)
#   and the domain (the part following the '@' symbol.')

addr = 'monty@python.org'
uname, domain = addr.split('@')
split_email = addr.split('@' )
print(f"user name = {uname}")
print(f"domain name = {domain}")
print(split_email)

user name = monty
domain name = python.org
['monty', 'python.org']


## Using List Comprehensions

List comprehensions are a compact method to build lists using a single line of code.

Basic syntax
```python
[ expr for item in iterable ]
```

Instead of:

In [49]:
# Iteration method to load a list
L = []
for n in range(0,12):
    L.append(n ** 2)
print(L)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]


In [50]:
# Loading items using a List Comprehension
L = [n ** 2 for n in range(12)]
print(L)

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121]


## Removing Items from a List

In [58]:
# Remove elements by index
t = ['a', 'b', 'c', 'd', 'e']
del(t[1])
print(t)

TypeError: list indices must be integers or slices, not str

In [63]:
# Delete the last item on the list. Returns the item deleted. Mutates list.
x = t.pop()

print('Deleted ', x)
print('New list is ', t)

Deleted  e
New list is  ['a', 'b', 'c', 'd']


In [60]:
# Remove specific element (e.g., remove 'Chicago'), mutates the list.
print(cities)

cities.remove('Chicago')
print(cities)

In [56]:
# ERROR: If not in list, error.
cities.remove('St. Louis')

ValueError: list.remove(x): x not in list

Be careful when removing items from a list. If you attempt to remove items while iterating over the same list, items may be skipped. 

In [65]:
my_list = [1,2,3,4,5,6,7,8,9,10]

In [66]:
# The intent of this code is to remove numbers greater than 5 from my list.

for item in my_list:
    if item > 5:
        my_list.remove(item)

# ERROR: However, 7 and 9 remain because they were skipped as items were removed.
print(my_list)

[1, 2, 3, 4, 5, 7, 9]


One solution is to use a list comprehension.

In [67]:
my_list = [1,2,3,4,5,6,7,8,9,10]

# Only keep items in the list that are less than 6
my_list = [item for item in my_list if item < 6]

print(my_list)

[1, 2, 3, 4, 5]


Another solution is to reverse the list. That way, if the last item (item 9) is deleted, it doesn't alter the indexes of the rest of the list.

In [68]:
my_list = [1,2,3,4,5,6,7,8,9,10]

# Reverse the list and delete items greater than 5
for item in reversed(my_list):
    if item > 5:
        my_list.remove(item)
        
print(my_list)

[1, 2, 3, 4, 5]


## Testing for membership

Use the in keyword to test for list membership.

In [69]:
print(cities)
print("Dallas" in cities)
print("Tuscaloosa" in cities)

['Dallas', 'Austin', 'Grand Rapids', 'Columbia', 'Tempe', 'Atlanta']
True
False


## Iterating a list

Use a for loop to iterate a list.

In [72]:
# Loop through list
for city in cities:
    print(city)

Dallas
Austin
Grand Rapids
Columbia
Tempe
Atlanta


Use the len() function to determine how many items are in the list and use that within a range() function.

In [73]:
for i in range(len(cities)):
    print(cities[i],end=" ")

Dallas Austin Grand Rapids Columbia Tempe Atlanta 

## List Concatenation


In [74]:
# Use the '+' operator to concatenate lists
a = [1,2,3]
b = [4,5,6]
c = a + b # does not mutate 'a' or 'b'

print("c = ", c)

print("a = ", a)
print("b = ", b)

c =  [1, 2, 3, 4, 5, 6]
a =  [1, 2, 3]
b =  [4, 5, 6]


## Extending a list

In [75]:
print("list 'a' = ", a)
print("list 'b' =", b)

a.extend(b) # This combines a and b, mutates a but not b

print("list 'a' =", a)
print("list 'b' =", b)
print(c)

list 'a' =  [1, 2, 3]
list 'b' = [4, 5, 6]
list 'a' = [1, 2, 3, 4, 5, 6]
list 'b' = [4, 5, 6]
[1, 2, 3, 4, 5, 6]


In [76]:
# Use the '*' operator to repeat items
print(a * 3)

[1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]


##Slicing Lists
You can return parts of a list using slicing operators. Other objects (e.g., strings and tuples) can also be sliced.

In [77]:
# Slicing operations

t = ['a', 'b','c','d','e','f','g']

# return the 2nd and 3rd elements in t
print(t[1:3])

['b', 'c']


In [78]:
# Omitting the first parameter tells the intepreter to start at the beginning
print(t[:3])

['a', 'b', 'c']


In [79]:
# Omitting the second paramter tells the interpreter to continue to the end
# start with the third element and return all elements to the end of the list
print(t[3:])

['d', 'e', 'f', 'g']


In [81]:
# Get the last item
print("Last item in the list = ", t[-1])

Last item in the list =  f


In [82]:
# Use Negative slicing to replace the last item in the ist
t[-1] = 'watermelon'
print(t)

['a', 'b', 'c', 'd', 'e', 'f', 'watermelon']


## Sorting Lists

In [87]:
# Use sorted() to display a sorted list but not mutate it.
my_letters = ['n','r','y','x','a','w']

print("sorted list = ", sorted(my_letters))
print("my_letters = ", my_letters)

sorted list =  ['a', 'n', 'r', 'w', 'x', 'y']
my_letters =  ['n', 'r', 'y', 'x', 'a', 'w']


In [84]:
# Use the sort() method to sort the items in a list
my_letters.sort()
print(my_letters)

['a', 'n', 'r', 'w', 'x', 'y']


In [85]:
# Don't do this...sort() returns "None"
my_letters = my_letters.sort() 
print(my_letters)

None


In [88]:
# Reverse a list
my_letters.sort(reverse=True)
print(my_letters)

['y', 'x', 'w', 'r', 'n', 'a']


## Use a list to return more than one value from a function

In [89]:
# Use a list to return more than one value from a function
def tip_options(amount):
    # Use a list to return three tipping options (10%, 15%, 20%)
    return[amount, amount*1.10, amount*1.15, amount*1.2]

print(tip_options(30))
print(type(tip_options(30)))

[30, 33.0, 34.5, 36.0]
<class 'list'>


## Working With Nested Lists

In [90]:
my_list = [['Ford','Chevrolet','Volkswagen'],
           ['F150','Suburban','Passat'],
           ['Big Bang Theory','Young Sheldon','Mindhunter']]

print(my_list[0][1]) # row zero, item 2 (index 1)
print(my_list[2][2]) # row 3, item 3 (index 2)


Chevrolet
Mindhunter


In [91]:
# Use a list to swap values
y = 5
x = 10

print('x =', x, 'y =', y)

[x, y] = [y, x]

print('x =', x, 'y =', y)

x = 10 y = 5
x = 5 y = 10


# Sets
* Sets are unordered.
* Set elements are unique. Duplicate elements are not allowed.
* You may add or remove items from the set, but you cannot edit an item in a set.
* Accessing items by index (e.g., myset[1]) is NOT supported.
* Sets are denoted by curly braces.
* Membership tests are more efficient using sets than lists or tuples.

You can define a set using the set() function.b
```python
x = set(<iter>)
```

In [93]:
my_list = ['a','b',1, 'c', 1]
set2 = set(my_list)
print(set2)
print(my_list)

{'b', 1, 'a', 'c'}
['a', 'b', 1, 'c', 1]


You can also create a set using curly braces {}. However, you cannot create an empty set using a pair of curly braces like you can for a list.

In [94]:
# INCORRECT
my_set = {}  # <-- results in a dictionary, NOT a set
print(type(my_set))

# Instead use the set constructor
my_set = set()
print(type(my_set))

<class 'dict'>
<class 'set'>


In [95]:
# Use curly braces to create a set
my_set = {1,1, 6,7, 3, 5,5,5,5,5, 'red'}
print(type(my_set))
print(my_set)

<class 'set'>
{1, 3, 5, 6, 7, 'red'}


In [None]:
# Error - sets are unordered and not accessible by subscript
my_set[0]

## Why do I care about sets?
Sets in Python provide the same benefits as sets in mathematics. Sets contain a well-defined collection of distinct objects called elements. Using the set object enables you to efficiently perform set operations such as union and intersection.

![](images/data_science_diagram.png) <br>
(image source: https://towardsdatascience.com)

## Creating sets
Use curly braces to denote a set or use the set() constructor. If you use set(), you must provide an iterable as the argument.

In [96]:
# Persons with expertise in specific areas
cs_expertise = {"Bill", "Matt", "Alexandra", "Joe", "Dexter"}
stats_expertise = set(["Dexter", "Subha", "Brad", "Bruce"])
business_expertise = {"Kay","Jonathan","Dexter","Suzanne", "Matt"}

You can also use the set() method to create a set. The argument for the set method must be an iterable.

In [None]:
#Use the set() method to create a set, parameter must be <iter> (an iterable --e.g., a list)
my_set2 = set(['foo', 'bar', 3.141, 'bar'])
print(my_set2)

In [None]:
#Error creating tropical_fruits set using set() contructor...why?
tropical_fruits = set("Guava", "Dragon Fruit", "Banana","Banana")
temperate_fruits = {"Apple", "Peach", "Plum"}

all_fruit = tropical_fruits.union(temperate_fruits)
print(all_fruit)

In [97]:
# Who might be suited for Data Science (intersection of three topics)
data_scientists = cs_expertise.intersection(stats_expertise, business_expertise)
print(data_scientists)

{'Dexter'}


In [98]:
#Empty sets are evaluated as False
loch_ness_monsters = set()
print("The set of Loch Ness Monsters is " + str(bool(loch_ness_monsters)))
print()

#You can add, update, and remove items, but you cannot change items in a set
loch_ness_monsters.add("Marvin")
print("Added Marvin to monster set...")
print("The set of Loch Ness Monsters is " + str(bool(loch_ness_monsters)), loch_ness_monsters)
print("The length of the monster set is " + str(len(loch_ness_monsters)))

The set of Loch Ness Monsters is False

Added Marvin to monster set...
The set of Loch Ness Monsters is True {'Marvin'}
The length of the monster set is 1


In [100]:
# Reduce this list of grades to only have unique values
grades = {81,100,81,89,99,99,99,76,94,93,86,75,88,96,76,87,90,81,78,99,83,94,75,83,92,96,81,99,89,99,98,100,95,84,94,97,100,92,97,98,92,95,88,90,98,87,86,95,86,84,91,87,88,83,89,84,98,75,90,100,79,83,94,89,93,84,83,94,84,93,97,75,81,91,84,78,89,96,97,99,90,98,83,93,96,98,91,77,98,97,76,98,75,89,92,81,83,84,82,94,89,77,96,94,100,86,79,87,78,83,86,89,99,77,96,88,91,86,89,99,82,83,92,91,84,83,76,89,90,82,75,84,83,81,96,87,90,82,93,76,86,100,81,88,100,94,84,99,77,91,92,98,88,90,83,88}
print(grades)

{75, 76, 77, 78, 79, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100}


In [None]:
c_and_higher = set(range(75,101))

missing_grades = grades.symmetric_difference(c_and_higher)

print("What grades are missing from 75-100?: " + str(missing_grades))

# Dictionaries

Think of dictionaries like a list, but with a flexible index. The List index must be an integer, but the index or keys used to associate values can be any immutable data type.

Dictionaries are **unordered** and use key-value pairs to store and retrieve data. In other languages this structure might be called an *associative array*.


## Creating dictionaries
Use curly braces and a colon to indicate to the interpreter that you are creating a dictionary data structure. A key can be any immutable data type.

Pretty Print is a module that displays dictionaries in a more human-readable format.

In [1]:
import pprint as pp
import pathlib
import json

In [2]:
# The employee ID is associated with the employee name
employees = {"2334":"Greg Bott", "2335":"John Gilbert", "2336":"Bill Hampton","2337":"Joe Odom"}
pp.pprint(employees)

{'2334': 'Greg Bott',
 '2335': 'John Gilbert',
 '2336': 'Bill Hampton',
 '2337': 'Joe Odom'}


In [3]:
# Create an empty dictionary
person = {}

#Display the type of the 'person' variable
print(type(person))


person[1000] = {'first_name':'Greg', 'last_name':'Bott', 'spouse':'Amy', 'children':['John Davis', 'Piper', 'Will', 'Truett'], 'pets':{'Bama':'dog', 'TJ':'cat'}}
person[1001] = {'first_name':'Joe', 'last_name':'Devlin', 'spouse':'Suzanne', 'children':['CK', 'Alan', 'Devin', 'Tom'], 'pets':{'Orangey':'gold fish', 'Hammer':'turtle'}}

pp.pprint(person)

<class 'dict'>
{1000: {'children': ['John Davis', 'Piper', 'Will', 'Truett'],
        'first_name': 'Greg',
        'last_name': 'Bott',
        'pets': {'Bama': 'dog', 'TJ': 'cat'},
        'spouse': 'Amy'},
 1001: {'children': ['CK', 'Alan', 'Devin', 'Tom'],
        'first_name': 'Joe',
        'last_name': 'Devlin',
        'pets': {'Hammer': 'turtle', 'Orangey': 'gold fish'},
        'spouse': 'Suzanne'}}


### Creating a dictionary from a JSON file

In [7]:
# create Path object for file path
fb_game_path = pathlib.Path('files', '2017_Alabama_v_LSU.json')

if fb_game_path.exists():
    # open the data filepath
    with open(fb_game_path, 'r') as fp:

        # read the bytes from the datafile
        json_bytes = fp.read()

        # decode the bytes to utf-8
        #json_str = json_bytes.decode('utf-8')

        # load the decoded bytes as JSON
        game_data = json.loads(json_bytes)
        
        print(f'The game file includes {len(game_data)} records.')
else:
    print('{fb_game_path.name} does not exist')

The game file includes 8 records.


In [8]:
for drive in game_data['drives']['previous']:
    print(drive['displayResult'])

Punt
Punt
Touchdown
Punt
Punt
Interception
Touchdown
Field Goal
Punt
Punt
Punt
End of Half
Punt
Punt
Punt
Punt
Punt
Touchdown
Touchdown
Field Goal
Punt
Punt
Punt
Punt
Downs
End of Game


In [9]:
list(game_data.items())[:10]

[('scoringPlays',
  [{'period': {'number': 1},
    'homeScore': 7,
    'awayScore': 0,
    'scoringType': {'displayName': 'Touchdown',
     'name': 'touchdown',
     'abbreviation': 'TD'},
    'id': '400933904101939101',
    'text': 'J. Hurts pass,to I. Smith Jr. for 4 yds for a TD, (A. Pappanastos KICK)',
    'clock': {'displayValue': '6:08', 'value': 368},
    'team': {'uid': 's:20~l:23~t:333',
     'displayName': 'Alabama Crimson Tide',
     'logo': 'http://a.espncdn.com/i/teamlogos/ncaa/500/333.png',
     'links': [{'href': 'http://www.espn.com/college-football/team/_/id/333/alabama-crimson-tide',
       'text': 'Clubhouse'},
      {'href': 'http://www.espn.com/college-football/team/schedule/_/id/333',
       'text': 'Schedule'}],
     'id': '333',
     'abbreviation': 'ALA'},
    'type': {'id': '67', 'text': 'Passing Touchdown', 'abbreviation': 'TD'}},
   {'period': {'number': 1},
    'homeScore': 7,
    'awayScore': 0,
    'scoringType': {'displayName': 'Extra Point',
     'name'

In [10]:
for drive in game_data['drives']['previous']:
    for play in drive['plays']:
        print(play['text'])

Cameron Gamble kickoff for 57 yds , Henry Ruggs III return for 18 yds to the Alab 26
Damien Harris run for 5 yds to the Alab 31
Jalen Hurts sacked by Christian LaCouture for a loss of 10 yards to the Alab 21
Jalen Hurts pass complete to Robert Foster for 14 yds to the Alab 35
JK Scott punt for 56 yds, fair catch by D.J. Chark at the LSU 9
LSU Penalty, False Start (Edward Ingram) to the LSU 5
Derrius Guice run for 4 yds to the LSU 9
Danny Etling pass complete to Foster Moreau for 1 yd to the LSU 10
Danny Etling pass complete to Stephen Sullivan for 31 yds to the LSU 41 for a 1ST down
Danny Etling pass complete to Derrius Guice for 3 yds to the LSU 44
Danny Etling pass incomplete to D.J. Chark
Danny Etling pass incomplete
Zach Von Rosenberg punt for 46 yds, fair catch by Xavian Marks at the Alab 10
Bo Scarbrough run for a loss of 5 yards to the Alab 5
Bo Scarbrough run for 4 yds to the Alab 9
Jalen Hurts pass complete to Calvin Ridley for 15 yds to the Alab 24 for a 1ST down
Bo Scarbroug

## Accessing data in a dictionary
Use ```dict[key]``` to return the value from the key-value pair. If the value doesn't exist, an exception is thrown. Use the get() method to retrieve keys and handle missing keys more gracefully.

You can also use ```keys()``` to list the keys in your dictionaries, ```values``` to access the values of the key-value pair or ```items()``` to access both.

In [11]:
# Using the employee ID (key), display the name of the employee (value)
print(employees["2334"])

Greg Bott


In [12]:
# Print the first_name attribute of the person 1001 key.
print(person[1001]['first_name'])

Joe


### Using the get() method
Use the get() method to access keys. Using get() avoids a KeyError if the desired key does not exist. Instead, Python returns the None value.

In [None]:
print(employees.get("2334"))
print(employees.get("9999"))

In [None]:
# You can also provide a default value if a key does not exist
print(employees.get('ss_number', 'no SS# provided'))

In [None]:
# Use a List as values a dictionary
make_model = {"Ford":["Mustang","Explorer","Focus"],"Volkswagen":["Passat","Jetta","Beetle"]}
print(make_model["Ford"])

In [None]:
print(person[1000]['first_name'])

In [None]:
# Replace values using a key
pp.pprint(person[1001])
person[1001]['pets'] = {'flying squirrel':'Rocky'}
pp.pprint(person[1001])

### Using a loop to examine a dictionary
Although looping through a dictionary fails to take advantage the speed of a dictionary, sometimes you may find it useful. Remember that dictionaries contain a key / value pair and that you must loop through them differently than you would a list.

In [13]:
# Attempting to interate through a dictionary as you would a list 
#   will yield only the keys
for x in employees:
    print(x)

2334
2335
2336
2337


In [14]:
# Instead, use the items() method to return both the key and the value
for x, y in employees.items(): # x = key; y = value
    print(x,y) 

2334 Greg Bott
2335 John Gilbert
2336 Bill Hampton
2337 Joe Odom


### Using a loop to add values to a dictionary.
So far we have manually added items to a dictionary. Most often you will add items progamatically (e.g., using a loop) rather than manually. Below is part of the code I use find duplicate files using an MD5 hash. A hash is a one-way algorithm applied to an object that results in a fixed-length string that uniquely identifies that object.

We'll use the os module to access the file system and the hashlib to apply the MD5 algorithm to the files and then store them in a dictionary using the MD5 value as the key.

In [17]:
import os
import hashlib
import pprint as pp

# Create a blank dictionary
os_files = {}

# Hash Function
def hashfile(path, blocksize=65536):
    
    # Get the file, read binary
    file_to_hash = open(path, 'rb')
    
    # Create an MD5 hasher object
    hasher = hashlib.md5()
    
    # Load part of the file (the size of the block)
    #    We divide the file into smaller parts to avoid using all the 
    #    available RAM
    buf = file_to_hash.read(blocksize)
    
    # Add blocks of the file until the buffer is empty
    while len(buf) > 0:
        hasher.update(buf)
        buf = file_to_hash.read(blocksize)
    file_to_hash.close()
    
    # Return the MD5 hex digest of the file
    return hasher.hexdigest()

for file in os.listdir():
    # Use error handling to avoid errors (e.g., file permission issues)
    try:
        # Use the hash value (hex digest) as the key and use the file name as the value
        os_files[hashfile(file)] = file
    except:
        pass
    
# Display the Dictionary
pp.pprint(os_files)

{'05aade0e76dfa34214f277937185ec35': 'Pandas_Fundamentals.ipynb',
 '0aff0bd7bf61db1fce8bc930d931e538': 'Selenium.ipynb',
 '0cfa707ca34410d87772c3c8b7fffd3f': 'Web Scraping-Requests-BS4.ipynb',
 '16ca058e744f4ba3ec97c3643c879e0f': 'demofile.txt',
 '327cba190668447db87feb5f6785aec9': 'requirements.txt',
 '359c3e68977fcba90476d7d25049e4e2': '.gitignore',
 '39bcab32f8c269ce04379b2fb404327b': 'Python Fundamentals-3-Data '
                                     'Structures.ipynb',
 '4c36da94662e0cd564c37d96b87f7262': 'Microsoft_Store Data Description.txt',
 '4dc083148b1bda2375631e63bb3d5dfc': 'mbox-short.txt',
 '52d1d2eb9e51d1775678976652e7e533': 'Python fundamentals-2.ipynb',
 '750f62e8b5cce295a6621795b9cd41d6': 'README.md',
 '92717f8bed6606196888253673fdf8f6': 'Database programming.ipynb',
 '92b8fca6d9cea8708ef71895e89c8fbf': 'Python Fundamentals-1.ipynb',
 'a4d99705417a14861b81f7d7402b81ac': 'Matplotlib_Introduction.ipynb',
 'cd761456f101104bae14792cf5df677c': 'Untitled.ipynb',
 'edee6f156e

### Check for Values in a Dictionary

To determine if a value is present within a key, us the *in* keyword.

In [None]:
print("Focus" in make_model["Ford"])
print("Explorer II" in make_model["Ford"])

### Access Specific item in key value

Individual values associated with a key may be accessed by an index value.

In [None]:
# Print the third value associated with the Ford key.
print(make_model['Ford'][2])

### Check for Keys in a Dictionary

In [None]:
search_key = "Focus"
if search_key in make_model:
    print(f"'{search_key}' key found in dictionary!")
else:
    print(f"'{search_key}' key NOT found in dictionary.")

### Attempting to Access Keys that Don't exist
If you attempt to access a key that does not exist within the dictionary, Python will raise an exception


In [None]:

person[1000] = {'first_name':'Greg', 'last_name':'Bott', 'spouse':'Amy', 'children':['John Davis', 'Piper', 'Will', 'Truett'], 'pets':{'Bama':'dog', 'TJ':'cat'}}
person[1001] = {'first_name':'Joe', 'last_name':'Devlin', 'spouse':'Suzanne', 'children':['CK', 'Alan', 'Devin', 'Tom'], 'pets':{'Orangey':'gold fish', 'Hammer':'turtle'}}


In [None]:
# ERROR: KeyError (key does not exist in the dictionary)
print(person['fname'])
print(person['ss_number'])

### Updating dictionary values
Assigning a value to an existing key/value pair will replace the value.

You can also use the update method to replace multiple values in a dictionary and add new values.

```update()``` takes a dictionary as its parameter.

In [None]:
person[1000].update({'first_name':'Gregory','ss_number':'123-45-6789','middle':'Hamilton'})
pp.pprint(person)

In [None]:
# Use the append() method of a key to append a value
#   Here we are adding Sally to Joe Devlin's children
person[1001]['children'].append('Sally')
pp.pprint(person[1001])

### Removing an item from the dictionary
You can use del or pop to remove items from a dictionary. Just as in a list, ```pop``` returns the value deleted that you can store in a variable.

In [None]:
del person['pets']
print(person)

In [None]:
print(person)
ss_number = person.pop('ss_number')
print(ss_number)

### Clear a dictionary
Use the clear() method to empty the contents of a dictionary.

In [None]:
print(person)
person.clear()
print(person)

# Regular Expressions
RegEx or regular expressions is a sequence of characters that match other strings or sets of strings, using a specialized syntax pattern. Python has a built-in package called re, which can be used to work with regular expressions. To use the re package, import re.

## Raw Strings

To avoid Python escaping the RegEx patterns, prefix the patter with 'r'.

## Regex Cheat Sheet

source: https://regexone.com

For an excellent interactive tutorial, go to https://regexone.com/lesson/introduction_abcs

To test and learn more about RegEx, https://regexr.com/ is also a helpful site.

abc…	Letters<br>
123…	Digits<br>
\d	Any Digit<br>
\D	Any Non-digit character<br>
.	Any Character<br>
\.	Period<br>
[abc]	Only a, b, or c<br>
[^abc]	Not a, b, nor c<br>
[a-z]	Characters a to z<br>
[0-9]	Numbers 0 to 9<br>
\w	Any Alphanumeric character<br>
\W	Any Non-alphanumeric character<br>
{m}	m Repetitions<br>
{m,n}	m to n Repetitions<br>
\*	Zero or more repetitions<br>
\+	One or more repetitions<br>
?	Optional character<br>
\s	Any Whitespace<br>
\S	Any Non-whitespace character<br>
^…$	Starts and ends<br>
(…)	Capture Group<br>
(a(bc))	Capture Sub-group<br>
(.*)	Capture all<br>
(abc|def)	Matches abc or def<br>

In [None]:
# Import the built-in Regular Expressions package
import re
import pprint as pp

email_header = "From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008 Return-Path: <postmaster@collab.sakaiproject.org> for <source@collab.sakaiproject.org>;Received: (from apache@localhost) Author:  stephen.marquard@uct.ac.za"

found_text = re.findall('\d\d:\d\d:\d\d', email_header)
print(found_text)
print("found text is of type",type(found_text))

author = re.findall('Author:\s+\S+', email_header)
print(author)

In [None]:
mboxfile = open("files\\mbox-short.txt", "r")

for line in mboxfile:
    line = line.rstrip()
    
    # Search for lines that start with 'F', followed by 2 characters, followed by 'm:'
    if re.search('F..m:', line):        
        print(line)
mboxfile.close()

In [None]:
# Store all email addresses into a list (deep treatment of finding emails usign Regex: https://www.regular-expressions.info/email.html)
mboxfile = open("files\\mbox-short.txt", "r")
all_emails_list = []
for line in mboxfile:
    line = line.rstrip()
    x = re.findall('\S+@\S+\.\D\D\D', line)
    #x = re.findall('[A-Za-z0-9._%+-]+@(?:[A-Za-z0-9-]+\.)+[A-Za-z]{2,}', line)
    if len(x) > 0:
        all_emails_list.extend(x)
print(len(all_emails_list))
pp.pprint(all_emails_list) # Many duplicate emails


In [None]:
mboxfile = open("files\\mbox-short.txt", "r")
all_emails_list = []
for line in mboxfile:
    line = line.rstrip()
    x = re.findall('rev=.....', line)
    if len(x) > 0:
        all_emails_list.extend(x)

print(all_emails_list)
all_revs_set = set(all_emails_list)
print(len(all_revs_set))