# In this lesson today, we will look at some basics concerning Python data structures as well as some special data structures that come up in mathematical modeling often.

## Lists

The first data structure is a list.  Lists are mutable objects (meaning they can be changed).  They are also ordered.  Let's see this in action.

In [1]:
lst_one = [1,2,3]
lst_two = [3,2,1]

We begin by initializing two toy lists, lst_one and lst_two.  Let's confirm that the order matters.  We can do this by asking Python if they're equal.

In [2]:
lst_one == lst_two

False

<blockquote>As expected, Python informs us that these two lists are not equal even though they contain the same elements, because they are in different orders.  Notice that we had to use "==" to make this comparison, since that is the Boolean version of the equality operator.  A single "=" assigns a variable.</blockquote>

Now, we can try to change the first list.  We will begin with a common error.  Let's say I want to append a "4" to the end of lst_one.  One natural thing to try is the following:

In [3]:
lst_one = lst_one + 4

TypeError: can only concatenate list (not "int") to list

Notice the error warning.  It is telling us that we can only concatenate a list with a list.  In other words, we tried to add the integer "4" to a list.  What we need instead is this:

In [4]:
lst_one = lst_one + [4] # Note that lst_one += [4] would also work.

Now we can check to verify that it has done what we wanted:

In [5]:
lst_one

[1, 2, 3, 4]

Let us use the len() function to verify the number of elements in lst_one:

In [6]:
len(lst_one)

4

To loop through a list, we can either loop through the elements:

In [25]:
for item in lst_one:
    print(item)

1
2
3
4


Or we can use the fact that lists are ordered and loop through the index of the list (rememebering that Python is zero-indexed).

In [26]:
for i in range(len(lst_one)):
    print(lst_one[i])

1
2
3
4


## Dictionaries

The next data structure we will talk about is the <b>dictionary</b>.  A dictionary is a mutable collection of <b>keys</b> and data structures attached to those keys. Dictionaries are mutable and indexed by their keys.

Let's begin by creating a toy dictionary to investigate.

In [7]:
dct_toy = {"a": [1,2,3], "b": [4,5,6], "c": [7,8,9], "d": [0]}

As we can see above, we create dictionaries with curly braces.  We can also see that the keys of a dictionary (a, b, c, and d in this case) can point to any other type of data structure.  In fact, you can even have a dictionary of dictionaries.

If I want to find what is written under one of its labels, I can call it like this:

In [8]:
dct_toy["a"]

[1, 2, 3]

In [9]:
type(dct_toy)

dict

In [10]:
type(dct_toy["a"])

list

As you can see, Python does understand dct_toy and its various entries to be different data structures.  If I try to call a key that isn't there, this will happen:

In [11]:
dct_toy["e"]

KeyError: 'e'

We can also loop through the keys of a dictionary, like this:

In [12]:
for key in dct_toy:
    if len(dct_toy[key]) == 3:
        print("Not zero!")
    elif len(dct_toy[key]) == 1:
        print("Zero!")
    else:
        print("Something's wrong....")

Not zero!
Not zero!
Not zero!
Zero!


In [13]:
for key in dct_toy:
    print(type(key))
    print(type(dct_toy[key]))
    print(dct_toy[key])

<class 'str'>
<class 'list'>
[1, 2, 3]
<class 'str'>
<class 'list'>
[4, 5, 6]
<class 'str'>
<class 'list'>
[7, 8, 9]
<class 'str'>
<class 'list'>
[0]


## The next data structure on our list is the <b>tuple</b>.

Tuples are both <b>ordered</b> and <b>immutable</b>.  They are written with round parentheses.

In [14]:
(x,y) = (3,2)

In [15]:
type((x,y))

tuple

In [16]:
type(x)

int

As you can see, you can call the elements of the tuple without having to reference it, although that would work equally well.

In [17]:
x

3

In [18]:
(x,y)[0]

3

You can also assign a tuple without explicitly using parentheses.

In [19]:
z,w = 3,4

This is a shortcut that is worth knowing, since it will often be used without comment when looking up code.

## Sets

The next data structure that we will look at is a <b>set</b>.

A set is denoted with curly braces:

In [21]:
a_set = {1,2,3}

In [24]:
len(a_set)

3

An empty set must be created like this:

In [22]:
empty_set = set()

In [23]:
len(empty_set)

0

Sets are unordered, however, and mutable.  So, if you want to loop through them, you can't use an index, but rather do what we did for dictionaries.

# Numpy

## Next, we will investigate numpy arrays and operations with them.  Numpy is essentially the best module in Python to perform linear algebra operations.

To use numpy, we must begin by importing it and we will use the standard convention:

In [27]:
import numpy as np

A numpy array is basically a matrix (or a tensor if you are familiar with that).

They can have many different shapes:

In [31]:
array_1 = np.array(3)
array_2 = np.array([1,2])
array_3 = np.array([[1,2],[3,4]])
array_4 = np.array([[[1,2],[3,4]],[[5,6],[7,8]]])
array_5 = np.array([[1,2,3],[4,5,6]])
array_6 = np.array([[1,2],[3,4],[5,6]])

You can check the dimensions of these arrays easily:

In [32]:
print(array_1.shape)
print(array_2.shape)
print(array_3.shape)
print(array_4.shape)
print(array_5.shape)
print(array_6.shape)

()
(2,)
(2, 2)
(2, 2, 2)
(2, 3)
(3, 2)


In [34]:
print(array_4[0].shape)

(2, 2)


As you can see, I can also slice arrays and find the shapes of these subarrays.

We can also perform different linear algebra operations with numpy.

In [35]:
np.dot(array_5,array_6)

array([[22, 28],
       [49, 64]])

Before performing any operation with numpy, however, you should always check that it will do what you want it to do.  For instance, the following is not matrix multiplication:

In [36]:
array_5*array_6

ValueError: operands could not be broadcast together with shapes (2,3) (3,2) 

As you can read in the error, numpy attempts something called <b>broadcasting</b>.  This is a very useful function that it has, but it is not the normal matrix operation with which we are familiar from mathematics.

Virtually any operation you would want to perform on a matrix or tensor, numpy will have a way to do.  Furthermore, the code for numpy is written in optimized C code, so it is much faster than performing these operations through loops (to say nothing of it being quicker to type).

## Pandas

Pandas is a phenomenally useful Python module that is built directly on top of Numpy to allow Python to even more easily handle relational databases.  Of course, before using it, we have to import Pandas.  We will do so with the standard convention.

In [37]:
import pandas as pd

You can create a Pandas dataframe in many different ways.  Below, we'll create a toy one in essentially the same way as we've created other objects before in Python.

In [38]:
df_toy = pd.DataFrame({"a": [1,2,3], "b": [4,5,6]})

In [39]:
df_toy

Unnamed: 0,a,b
0,1,4
1,2,5
2,3,6


In [41]:
df_toy.shape

(3, 2)

In [42]:
df_toy.columns

Index(['a', 'b'], dtype='object')

In [43]:
len(df_toy)

3

In [44]:
list(df_toy)

['a', 'b']

In [45]:
df_toy.T

Unnamed: 0,0,1,2
a,1,2,3
b,4,5,6


In [46]:
df_toy.index

RangeIndex(start=0, stop=3, step=1)

In [47]:
type(df_toy)

pandas.core.frame.DataFrame

Let's add a new column to this dataframe. 

In [48]:
df_toy["c"] = [7,8,9]

In [49]:
df_toy

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [50]:
df_toy["a"].dtype

dtype('int64')

In [51]:
df_toy.head() # Or df_toy.tail()

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


Now, we will read a csv file into Pandas.

In [53]:
df_train = pd.read_csv("train.csv")

In [54]:
df_test = pd.read_csv("test.csv")

In [55]:
print(df_train.shape)
print(df_test.shape)

(1460, 81)
(1459, 80)


In [57]:
for item in list(df_train):
    if item not in list(df_test):
        print(item)

SalePrice


In [62]:
pd.concat([df_train.drop("SalePrice",axis=1),df_test], axis=0, ignore_index=True)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,2,2008,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,5,2007,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,9,2008,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,2,2006,WD,Abnorml
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,12,2008,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2914,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
2915,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
2916,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
2917,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [70]:
pd.concat([df_train,df_test],axis=1)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,120.0,0.0,,MnPrv,,0.0,6.0,2010.0,WD,Normal
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,,Gar2,12500.0,6.0,2010.0,WD,Normal
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,,MnPrv,,0.0,3.0,2010.0,WD,Normal
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0.0,0.0,,,,0.0,6.0,2010.0,WD,Normal
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,144.0,0.0,,,,0.0,1.0,2010.0,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,,,0.0,4.0,2006.0,WD,Abnorml
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,,,0.0,9.0,2006.0,WD,Abnorml
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,MnPrv,Shed,700.0,7.0,2006.0,WD,Normal
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0.0,0.0,,,,0.0,11.0,2006.0,WD,Normal


Pandas can also perform SQL operations such as inner, left, right, and outer joins.

In [68]:
df_train.loc[:,"LotFrontage"].dtype

dtype('float64')

In [69]:
df_train.loc[:,"LotArea"].dtype

dtype('int64')

In [71]:
df_train.loc[:,"LotShape"].dtype

dtype('O')

In [72]:
df_train.loc[:,"LotShape"] = df_train.loc[:,"LotShape"].astype(str)