# INFO 212: Data Science Programming 1
___

### Week 2, Lecture 1
___

### Mon., April 13, 2020
---

**Question:**
- What built-in capabilities does Python provide for data analysis?

**Objectives:**
- Create, change, and iterate Python tuples, lists, dicts
- Index and slice tuples, lists, and dicts
- Use comprehensions 

## Data Structures and Sequences
Python’s data structures are simple but powerful. Mastering their use is a critical for data science programming.

### Tuple
A tuple is a fixed-length, immutable sequence of Python objects. The easiest way to
create one is with a comma-separated sequence of values:

```
# Create a tuple
tup = 4, 5, 6
tup
```

```
# create a nested tuple
nested_tup = (4, 5, 6), (7, 8)
nested_tup
```

```
# any sequences can be converted to tutples
tuple([4, 0, 2])
tup = tuple('string')
tup
```

####  Access Elements
Elements can be accessed with square brackets [] as with most other sequence types.

```
tup[0]
```

tuple is immutable meaning once the tuple is created it’s not possible to modify which object is stored in each slot:

```
tup = tuple(['foo', [1, 2], True])
tup[2] = False
```

If an object inside a tuple is mutable, such as a list, you can modify it in-place:

```
tup[1].append(3)
tup
```

You can concatenate tuples using the + operator to produce longer tuples:

```
(4, None, 'foo') + (6, 0) + ('bar',)
```

Multiplying a tuple by an integer, as with lists, has the effect of concatenating together that many copies of the tuple:

```
('foo', 'bar') * 4
```

#### Unpacking tuples
If you try to assign to a tuple-like expression of variables, Python will attempt to
unpack the value on the righthand side of the equals sign.

```
tup = (4, 5, 6)
a, b, c = tup
b
```

Even sequences with nested tuples can be unpacked:

```
tup = 4, 5, (6, 7)
a, b, (c, d) = tup
d
```

Using this functionality you can easily swap variable names, a task which in many
languages might look like:
```
tmp = a
a = b
b = tmp
```

```
a, b = 1, 2
a
b
b, a = a, b
a
b
```

A common use of variable unpacking is iterating over sequences of tuples or lists:

```
seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
for a, b, c in seq:
    print('a={0}, b={1}, c={2}'.format(a, b, c))
```

The Python language recently acquired some more advanced tuple unpacking to help
with situations where you may want to “pluck” a few elements from the beginning of
a tuple.

```
values = 1, 2, 3, 4, 5
a, b, *rest = values
a, b
rest
```

As a matter of convention, many Python programmers will use
the underscore (_) for unwanted variables:

```
a, b, *_ = values
```

#### Tuple methods
Since the size and contents of a tuple cannot be modified, it is very light on instance
methods. A particularly useful one (also available on lists) is count, which counts the
number of occurrences of a value:

```
a = (1, 2, 2, 2, 3, 4, 2)
a.count(2)
```

#### Exercise
Create a tuple of lists of student information including name, phone, email. Print out the information 

### List
lists are variable-length and their contents can be modified
in-place. You can define them using square brackets [] or using the list type function:

```
a_list = [2, 3, 7, None]
tup = ('foo', 'bar', 'baz')
b_list = list(tup)
b_list
b_list[1] = 'peekaboo'
b_list
```

The list function is frequently used in data processing as a way to materialize an
iterator or generator expression:

```
gen = range(10)
gen
list(gen)
```

#### Adding and removing elements

Elements can be appended to the end of the list with the append method:

```
b_list.append('dwarf')
b_list
```

Using insert you can insert an element at a specific location in the list:

```
b_list.insert(1, 'red')
b_list
```

The inverse operation to insert is pop, which removes and returns an element at a particular index:

```
b_list.pop(2)
b_list
```

Elements can be removed by value with remove, which locates the first such value and
removes it from the last:

```
b_list.append('foo')
b_list
b_list.remove('foo')
b_list
```

Check if a list contains a value using the in keyword:

```
'dwarf' in b_list
```

The keyword not can be used to negate in:

```
'dwarf' not in b_list
```

Checking whether a list contains a value is a lot slower than doing so with dicts and
sets (to be introduced shortly), as Python makes a linear scan across the values of the
list, whereas it can check the others (based on hash tables) in constant time.

#### Concatenating and combining lists

Adding two lists by + will concatenate them

```
[4, None, 'foo'] + [7, 8, (2, 3)]
```

If you have a list already defined, you can append multiple elements to it using the
extend method:

```
x = [4, None, 'foo']
x.extend([7, 8, (2, 3)])
x
```

Note that list concatenation by addition is a comparatively expensive operation since
a new list must be created and the objects copied over. Using extend to append elements
to an existing list, especially if you are building up a large list, is usually preferable.
Thus,
```
everything = []
for chunk in list_of_lists:
    everything.extend(chunk)
```

is faster than
```
everything = []
for chunk in list_of_lists:
    everything = everything + chunk
```

#### Sorting

You can sort a list in-place (without creating a new object) by calling its sort
function:
```
a = [7, 2, 5, 1, 3]
a.sort()
a
```

We can sort a collection of strings by their lengths:
```    
b = ['saw', 'small', 'He', 'foxes', 'six']
b.sort(key=len)
b
```

#### Slicing
You can select sections of most sequence types by using slice notation, which in its
basic form consists of start:stop passed to the indexing operator []:

```
seq = [7, 2, 3, 7, 5, 6, 0, 1]
seq[1:5]
```

Slices can also be assigned to with a sequence:
```
seq[3:4] = [6, 3]
seq
```

While the element at the start index is included, the stop index is not included, so
that the number of elements in the result is stop - start.


Either the start or stop can be omitted, in which case they default to the start of the
sequence and the end of the sequence, respectively:
```
seq[:5]
seq[3:]
```

Negative indices slice the sequence relative to the end:
   ``` 
seq[-4:]
seq[-6:-2]
```

In [36]:
seq[-2:-6:-1]

['O', 'L', 'L', 'E']

In [41]:
seq[-5:5]

['E', 'L', 'L', 'O']

In [40]:
seq[-3] == seq[3]

True

The following figure shows a helpful illustration of slicing with positive and negative
integers. In the figure, the indices are shown at the “bin edges” to help show
where the slice selections start and stop using positive or negative indices.
![](https://i.imgur.com/zJA7O16.png)

A step can also be used after a second colon to, say, take every other element:
```
seq[::2]
```

In [42]:
seq[::2]

['H', 'L', 'O']

A clever use of this is to pass -1, which has the useful effect 
of reversing a list or tuple:
```
seq[::-1]
```

### Built-in Sequence Functions
Python has a handful of useful sequence functions that you should familiarize yourself
with and use at any opportunity.

#### enumerate

It’s common when iterating over a sequence to want to keep track of the index of the
current item. A do-it-yourself approach would look like:

```
i = 0
for value in collection:
    # do something with value
    i += 1
```

Since this is so common, Python has a built-in function, enumerate, which returns a
sequence of (i, value) tuples:
```
for i, value in enumerate(collection):
    # do something with value
```
When you are indexing data, a helpful pattern that uses enumerate is computing a
dict mapping the values of a sequence (which are assumed to be unique) to their
locations in the sequence:

In [43]:
seq

['H', 'E', 'L', 'L', 'O', '!']

In [44]:
for i, v in enumerate(seq):
    print('value={1} and index={0}'.format(i, v))

value=H and index=0
value=E and index=1
value=L and index=2
value=L and index=3
value=O and index=4
value=! and index=5


```
some_list = ['foo', 'bar', 'baz']
mapping = {}
for i, v in enumerate(some_list):
    mapping[v] = i
mapping
```

#### sorted

The sorted function returns a new sorted list from the elements of any sequence:
```
sorted([7, 1, 2, 6, 0, 3, 2])
sorted('horse race')
```

#### zip

zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a
list of tuples:
```
seq1 = ['foo', 'bar', 'baz']
seq2 = ['one', 'two', 'three']
zipped = zip(seq1, seq2)
list(zipped)
```

zip can take an arbitrary number of sequences, and the number of elements it produces
is determined by the shortest sequence:
```
seq3 = [False, True]
list(zip(seq1, seq2, seq3))
```

A very common use of zip is simultaneously iterating over multiple sequences, possibly
also combined with enumerate:
```
for i, (a, b) in enumerate(zip(seq1, seq2)):
    print('{0}: {1}, {2}'.format(i, a, b))
    ```

Given a “zipped” sequence, zip can be applied in a clever way to “unzip” the
sequence. Another way to think about this is converting a list of rows into a list of
columns. The syntax, which looks a bit magical, is:
```
pitchers = [('Nolan', 'Ryan'), ('Roger', 'Clemens'),
            ('Schilling', 'Curt')]
first_names, last_names = zip(*pitchers)
first_names
last_names
```

#### reversed

reversed iterates over the elements of a sequence in reverse order. 'reversed' is a generator. It doesn't produce a list by itself.
```
list(reversed(range(10)))
```

### dict
dict is likely the most important built-in Python data structure. A more common
name for it is hash map or associative array. It is a flexibly sized collection of key-value
pairs, where key and value are Python objects. One approach for creating one is to use
curly braces {} and colons to separate keys and values:

```
empty_dict = {}
d1 = {'a' : 'some value', 'b' : [1, 2, 3, 4]}
d1
```

You can access, insert, or set elements using the same syntax as for accessing elements
of a list or tuple:
```
d1[7] = 'an integer'
d1
d1['b']
```

check if a dictionary contains a key
```
'b' in d1
```

You can delete values either using the del keyword or the pop method (which simultaneously
returns the value and deletes the key):
```
d1[5] = 'some value'
d1
d1['dummy'] = 'another value'
d1
del d1[5]
d1
ret = d1.pop('dummy')
ret
d1
```

The keys and values method give you iterators of the dict’s keys and values, respectively.
```
list(d1.keys())
list(d1.values())
```

merge two dictionaries using update:
```
d1.update({'b' : 'foo', 'c' : 12})
d1
```

#### Creating dicts from sequences
It’s common to occasionally end up with two sequences that you want to pair up
element-wise in a dict. As a first cut, you might write code like this:

```
mapping = {}
for key, value in zip(key_list, value_list):
    mapping[key] = value
    ```

The dict fucntion accepts two sequences:
```
mapping = dict(zip(range(5), reversed(range(5))))
mapping
```

#### Default values

```
if key in some_dict:
    value = some_dict[key]
else:
    value = default_value
    ```

The dict methods get and pop can take a default value to be returned, so that
the above if-else block can be written simply as:
```
value = some_dict.get(key, default_value)
```

With setting values, a common case is for the values in a dict to be other collections,
like lists. For example, you could imagine categorizing a list of words by their
first letters as a dict of lists:

The setdefault dict method is for precisely this purpose. The preceding for loop
can be rewritten as:
```
for word in words:
    letter = word[0]
    by_letter.setdefault(letter, []).append(word)
```

The built-in collections module has a useful class, defaultdict, which makes this
even easier. To create one, you pass a type or function for generating the default value
for each slot in the dict:
```
from collections import defaultdict
by_letter = defaultdict(list)
for word in words:
    by_letter[word[0]].append(word)
    ```

#### Valid dict key types

Keys of dicitonaries must be hashable or immutable.
```
hash('string')
hash((1, 2, (2, 3)))
hash((1, 2, [2, 3])) # fails because lists are mutable
```

```
d = {}
d[tuple([1, 2, 3])] = 5
d
```

### set

A set is an unordered collection of unique elements.
```
set([2, 2, 2, 1, 3, 3])
{2, 2, 2, 1, 3, 3}
```

Set union, intersection, difference, and symmetric difference

In [None]:
a = {1, 2, 3, 4, 5}
b = {3, 4, 5, 6, 7, 8}

```
a.union(b)
a | b
```

```
a.intersection(b)
a & b
```

Set supports mathematical set operations. Set elements must be immutable.

### List, Set, and Dict Comprehensions

List comprehensions are one of the most-loved Python language features. They allow
you to concisely form a new list by filtering the elements of a collection, transforming
the elements passing the filter in one concise expression. They take the basic form:
```
[expr for val in collection if condition]
```

This is equivalent to the following for loop:
```
result = []
for val in collection:
    if condition:
        result.append(expr)
```
The filter condition can be omitted, leaving only the expression. For example, given a
list of strings, we could filter out strings with length 2 or less and also convert them to uppercase like this:

```
strings = ['a', 'as', 'bat', 'car', 'dove', 'python']
[x.upper() for x in strings if len(x) > 2]
```

Set and dict comprehensions are a natural extension, producing sets and dicts in an
idiomatically similar way instead of lists. A dict comprehension looks like this:
```
dict_comp = {key-expr : value-expr for value in collection if condition}
```

A set comprehension looks like the equivalent list comprehension except with curly
braces instead of square brackets:
```
set_comp = {expr for value in collection if condition}
```

```
unique_lengths = {len(x) for x in strings}
unique_lengths
```

```
set(map(len, strings))
```

```
loc_mapping = {val : index for index, val in enumerate(strings)}
loc_mapping
```

#### Nested list comprehensions

```
all_data = [['John', 'Emily', 'Michael', 'Mary', 'Steven'],
            ['Maria', 'Juan', 'Javier', 'Natalia', 'Pilar']]
            ```

You might have gotten these names from a couple of files and decided to organize
them by language. Now, suppose we wanted to get a single list containing all names
with two or more e’s in them. We could certainly do this with a simple for loop:
```
names_of_interest = []
for names in all_data:
    enough_es = [name for name in names if name.count('e') >= 2]
    names_of_interest.extend(enough_es)
    ```

A single nested list comprehension does this nicely:
```
result = [name for names in all_data for name in names
          if name.count('e') >= 2]
```

Flatten a list of tuples:
```
some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)]
flattened = [x for tup in some_tuples for x in tup]
```

A multple statements for loops:
```
flattened = []

for tup in some_tuples:
    for x in tup:
        flattened.append(x)
        ```

List comprehension inside a list comprehension:
```
[[x for x in tup] for tup in some_tuples]
```

## Key Points
* tuples are immutable
* list.append() to add elements at the end of a list
* enumearte(list) generates sequences of tuples (index, element)
* dictionary keys must be immutable
* zip combines two sequences to a sequence of tuples
* zip(*sequence of tuples) unzip to separate sequences
* [x for x in a sequence if filter] list comprehension is more compact
* set is dictionary with only keys