# Numpy and Pandas
---

@author = Ajay Raj and Roland Chin

An informal introduction to your least-worst enemy in the realm of data science. Numpy is an optimized math library for Python. Most of the optimization occurs in C, and some neat Python backend tinkering allows us to interface with it in Python. The code is vectorized as much as possible, which means that there's a heavy focus on using arrays (treated as n-dimensional vectors) to do operations. This is a shift away from doing looped operations.

For example:
If you wanted to compute the dot product of two arrays [1, 2, 3, 4, 5] and [5, 4, 3, 2, 1], you could either loop through the arrays

```python
sum = 0
for v1, v2 in zip(arr1, arr2):
    sum += v1*v2
```

Or, you could perform all the multiplications at once, and then add them together. That's basically what Numpy does behind the scenes. So doing the dot product in numpy is very simple:

```python

arr1 = np.array(arr1) # casting lists to np arrays
arr2 = np.array(arr2)
dp = arr1.dot(arr2)
```

## Table of Contents

We will be going over the following:

* **basic operations**

* **array indexing**

* **broadcasting**

* **linear algebra**

* **conditions**

* **table manipulation**

Before we begin, vectors are single dimensional matrices, and a matrix is multidimensional. In Numpy, we represent both with np.arrays. We can also use np.matrix, but arrays are faster performance wise, and matrices are less easily manipulated and are inherently 2 dimensional.

## Basic Operations

In [256]:
import numpy as np

NumPy is a Python library that is used to handle linear algebra operations. It does a couple amazing things under the hood that make certain operations lightning fast, and makes large scale data processing possible (like Pandas –– covered later).

NumPy holds data in **arrays**.

In [257]:
v = np.array([1, 2, 3, 4, 5]) #creating an array
v

array([1, 2, 3, 4, 5])

You can **index** an element of an array by putting it's index number in brackets after the name of the array.

Note: the index of arrays start with **0**.

In [258]:
v[2]

3

In [259]:
v[5] #why does this error?

IndexError: index 5 is out of bounds for axis 0 with size 5

**Exercise**: Let's say you want the sum of the 3rd and 4th element of v. Click the box in the following cell so you see a flashing cursor, delete the #'s, and write a line of code that does just that.

In [260]:
summed = ###
summed

SyntaxError: invalid syntax (<ipython-input-260-5655bbe3078f>, line 1)

### Indexing 2-D Arrays in Numpy

What is a 2-D array? It's an array of arrays. Also referred to as a matrix. 

This is what a 2D list looks like in vanilla Python.
```python
A = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    ]
```

Accessing the number 5 is not too easy however. There are no built-in routines to help you index 2 layers deep into a list. So you have to index into multiple arrays one at a time:

```python
# getting the number 5 from A
# A[1] = [4, 5, 6]
# A[1][1] = 5
five = A[1][1]
```

When you store an array as an np.array, you are not only gaining a runtime speedup, you're also getting a speedup in writing your code because you now have advanced indexing! 

Now, we'll show how to index in a similar 4x4 array in numpy's array format.

The function ```randn ``` creates a sample from the standard normal distribution (the bell shaped curve with a mean 0 and variance 1).

In [322]:
A = np.random.randn(4,4)
A

array([[-0.42545507, -0.15433492,  0.36459258,  1.71364059],
       [ 0.71037643, -0.72130325,  0.16869925,  0.48739668],
       [ 1.52071333, -1.00802986,  2.31464319, -0.240767  ],
       [ 0.54184979, -0.46216755, -0.39040407, -0.23267452]])

Let's multiply all its values by 10 so it's more easier to read.

In [323]:
A = A * 10
A

array([[ -4.25455072,  -1.54334921,   3.64592575,  17.13640592],
       [  7.10376431,  -7.21303245,   1.68699247,   4.87396684],
       [ 15.20713327, -10.08029864,  23.14643195,  -2.40766998],
       [  5.41849794,  -4.62167548,  -3.90404073,  -2.32674517]])

That's better, but there's still a lot of decimals. Let's go 1 step further and round all the values with the astype function, which casts it's arguments to a specified type. Here it'll round all the values in the array.

In [324]:
A = A.astype(int)
A

array([[ -4,  -1,   3,  17],
       [  7,  -7,   1,   4],
       [ 15, -10,  23,  -2],
       [  5,  -4,  -3,  -2]])

Now let's select certain values.

In [325]:
A[2, 2] # the element in row index 2, column index 2

23

You can also use a colon ```[start:end]``` to select multiple elements. For example:

```[1:5]``` is equivalent to "from element index 1 to 4" (Python doesn't include the last index)

```[1:]``` is equivalent to "from element index 1 to all the way to the end"

```[0:]``` is equivalent to ```[:]```, which makes a copy

Omitting a the first index defaults to the beginning (0), and leaving the second index blank defaults to the length of whatever is being sliced.

In a 2D array, since there's rows and columns, the slice operator will take ```[row start:row end, column start:column end]```.

In [263]:
A[1, :] # this will return the 2nd row of the matrix, along with all it's column elements from 0 to 3

array([14, 16,  0,  0])

In [264]:
A[:, 2] # this will return the 3rd column of the matrix, along with all it's row elements from 0 to 3

array([-9,  0, -3,  0])

This is how you get the 2x2 matrix at the bottom right hand corner of A.

In [265]:
A[2:, 2:]

array([[-3,  4],
       [ 0,  7]])

**Exercise:** What if you wanted the 2x2 matrix in the middle of A? Uncomment the code below (click in the box so you have a flashing cursor, and delete the #) and erase the underscore blanks, and filling in the correct numbers.

In [266]:
# A[_:_, _:_]

In [267]:
A.shape #the shape functions returns the dimensions of an array, which can be useful to tell how many elements there are

(4, 4)

The ```reshape``` function, when called on an array, can change it's elements into a new array shape. The number of elements in each array must be equal.

In [268]:
A.reshape((8,2)) # will reshape and fill it in by rows

array([[ -2,  12],
       [ -9,  -3],
       [ 14,  16],
       [  0,   0],
       [  0,   2],
       [ -3,   4],
       [ 27, -11],
       [  0,   7]])

In [269]:
A.reshape((2, 8))

array([[ -2,  12,  -9,  -3,  14,  16,   0,   0],
       [  0,   2,  -3,   4,  27, -11,   0,   7]])

## Broadcasting

The most important thing NumPy does is **broadcasting**, which means that it allows for arithmetic operations on arrays of different shapes.

Broadcasting is when a smaller array is "repeated" across a larger array so they have compatible shapes, and arithmetic can be done between them.

More information can be found here: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

The following is an example of array arithmetic:

In [270]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b

array([2., 4., 6.])

And here's what happens if you multiply 2 arrays, whose sizes are equal.

In [271]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a * b

array([2., 4., 6.])

The rule of thumb is that NumPy does arithmetic operations pairwise, but if a certain dimension is 1, then it will **broadcast** that effect across the dimension.

In [272]:
a = np.array([1.0, 2.0, 3.0])

B = np.zeros((3, 3)) # the zeros function creates an n x n array of all zeros

a + B

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

In [273]:
a = np.array([[1.0], [2.0], [3.0]])

B = np.zeros((3, 3))

a + B

array([[1., 1., 1.],
       [2., 2., 2.],
       [3., 3., 3.]])

In [274]:
1 + np.zeros((3, 3))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.]])

# What if we want to add a constant vector to each row of a matrix? In the following example, the sizes of the arrays are different, so the sum is performed elementwise.

In [275]:
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([1, 0, 1])
c = a + b
c

array([[ 2,  2,  4],
       [ 5,  5,  7],
       [ 8,  8, 10]])

You can also change the axes orientation when you sum up matrices. This will sum along the specified axis; axis=0 is column wise, and axis=1 is row wise. In effect it _**collapses**_ the specified axis.

For example if A is a 2x2 matrix and you sum with axis = 0, it will sum the columns vertically, and with axis = 1, it will sum the rows horizontally.

In [276]:
A = [[1, 3], [4, 5]] #remember, this looks like [1, 3]
np.sum(A, axis=0)    #                          [4, 5]

array([5, 8])

Now try summing with axis = 1.

In [328]:
np.sum(A, axis=1)

array([15,  5, 26, -4])

Let's try an example with another 2D matrix. Uncomment A (again by deleting the #) and erase the underscore blanks, filling in numbers to make the output equal to ```[3, 12, 8]```.

In [329]:
#A = [[2, _, 5], [_, 9, _]]

In [330]:
np.sum(A, axis=0) #test your values by running this

array([ 23, -22,  24,  17])

# Linear Algebra

NumPy also supports many linear algebra operations.

Vectors are single dimensional matrices, and a matrix is multidimensional. Both can be created with brackets, as shown below.

In [331]:
vector = [[3], [1], [3]] #this is a vector
vector

[[3], [1], [3]]

In [332]:
matrix = [[1, 2, 3], [3, 1, 2], [2, 1, 3]] #this is a 3x3 matrix
matrix

[[1, 2, 3], [3, 1, 2], [2, 1, 3]]

Let's practice working with matrices.

In [333]:
A = 10 * np.random.rand(3, 3) #this creates a random 3x3 vector, and multiplies each element by 10
A = A.astype(int)
A

array([[9, 8, 5],
       [0, 4, 9],
       [3, 1, 9]])

With matrices in numpy, you can easily transpose with ```T```.

In [334]:
A.T

array([[9, 0, 3],
       [8, 4, 1],
       [5, 9, 9]])

In [335]:
x = np.ones(3) #here we're creating a vector with 3 elements with value 1
x

array([1., 1., 1.])

Another cool thing we can do with numpy matrices is multiplication with the ```@``` operator. This is a main advantage of matrices over vectors.

In [336]:
A @ x

array([22., 13., 13.])

In [337]:
np.dot(A, x) # equivalent to above

array([22., 13., 13.])

In [338]:
A * x # does not work as expected! see the broadcasting section

array([[9., 8., 5.],
       [0., 4., 9.],
       [3., 1., 9.]])

Let's say we want to solve a matrix equation, $Ax = b$, where A and b are as follows:

In [339]:
A = [[2, 3, -2], [1, 0, -4], [2, -1, -6]]
b = [[8], [1], [4]]

To find x, you multiply both side by the inverse of A; the left side becomes the 3x3 identity matrix. Numpy is perfect for this, the inverse can be found with ```np.linalg.inv```.

In [340]:
A_inverse = np.linalg.inv(A)
A_inverse

array([[ 0.33333333, -1.66666667,  1.        ],
       [ 0.16666667,  0.66666667, -0.5       ],
       [ 0.08333333, -0.66666667,  0.25      ]])

Now we have to multiply $A$ and $b$. Remember the ```@``` operator?

In [341]:
x = A_inverse @ b
x

array([[5.],
       [0.],
       [1.]])

You could also do it in 1 line, with either of the following:

In [342]:
np.dot(np.linalg.inv(A), b) #np.linalg.inv(A) @ b would also work

array([[5.],
       [0.],
       [1.]])

In [343]:
np.linalg.solve(A, b) #we didn't tell you this, but np.linalg.solve solves a linear matrix equation of the form Ax=B

array([[ 5.],
       [-0.],
       [ 1.]])

## Other Operations

In [344]:
a = np.random.rand(100)

In [345]:
a.mean() #finds the mean of an array of number

0.5235958364215484

In [346]:
a.sum() #finds the sum of an array of numbers

52.35958364215484

In [347]:
np.median(a) #finds the median of an array of numbers

0.5193358683897059

## Conditions

Now we're going to see how we can select certain elements, based on conditions that we specify. Sometimes you don't want all the rows and columns from a matrix you're given.

In [348]:
A = np.arange(2, 20, 3)  #arange is similar to range(), it takes a start, end, and multiples by
A

array([ 2,  5,  8, 11, 14, 17])

In [299]:
A[A % 2 == 0] #the modulo operator gives the remainder when x is divided by y, here we want the even numbers

array([ 2,  8, 14])

In [300]:
B = np.random.rand(3, 3) 
B

array([[0.58048757, 0.08651757, 0.30534035],
       [0.41696628, 0.31470651, 0.31346983],
       [0.69569685, 0.75116468, 0.63290204]])

**Exercise** What if we only want the numbers who are less than the mean of all the elements of the matrix? Uncomment out the following (erase the #) and fill in the blank between the brackets.

In [306]:
#B[_]

## Exercises

### Broadcasting

Uncomment the x and fill in the 2 blanks so that x + y = z.

In [312]:
# x = np.array([_, _])
y = np.array([[3], [4]])
z = np.array([[4, 5], [5, 6]])
#just a reminder that z looks like [4, 5]
#                                  [5, 6]

In [313]:
x + y

array([[4, 5],
       [5, 6]])

### Linear Algebra

In [353]:
x = np.arange(1000).reshape(1000, 1)
b = np.ones((1000, 1))

X = np.append(x, b, axis=1)
#all you need to know is that this is the X you will use in the equation below

Y = 2 * x[:,0] + 4*b + np.random.random()
#all you need to know is that this is the Y you will use in the equation below

Use Least Squares Linear Regression to find $\hat{\theta}$, weights on each column of $X$ such that it models $Y$. Remember, the formula for Least Squares Linear Regression is: 

$$X^TX\hat{\theta} = X^TY$$

In [308]:
# theta_hat = np.linalg.solve(___, ___)
theta_hat

Ellipsis

Find the loss of your model. The loss equation is as follows
$$ ||Y - X\hat{\theta}|| $$
_Hint: First use np.linalg.norm on the difference of Y and the product of X and theta hat, then square._

In [315]:
# loss = 
loss

Ellipsis

## Final Notes on Numpy

Numpy makes scientific computing in Python possible. It's pretty fantastic. But there are many tiny details that might trip you up when using it in a practical setting. Sometimes it will have to with using functions properly, othertimes it will be low-level messups.

For a common one that often gets me annoyed, see <a href="http://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html">this link</a>.

For readers who are interested in speeding up these operations across multiple computers, or on graphics cards, see --
* <a href="https://github.com/cupy/cupy">CuPy</a> - Used in popular ML libraries
* <a href="https://github.com/enthought/distarray">DistArray</a> - I really don't know much about this

Also, whenever you cry for help with a numpy function, remember you can always call
    ```help(np.arange) ``` in a cell

# Pandas

Pandas is a commonly used data processing library. 

Data is stored in **DataFrame** objects, which is a collection of **Series** objects, which represent columns.

We'll go over an example EDA (exploratory data analysis) and feature engineering process on some data in Pandas.

In [35]:
titanic_train = pd.read_csv('data/titanic/train.csv')
titanic_test = pd.read_csv('data/titanic/test.csv')

First, let's look at the data itself.

In [36]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Next, let's do some data cleaning. Are there any missing values?

In [1]:
titanic_train.isnull().sum()

NameError: name 'titanic_train' is not defined

The first column with missing values is **Age**. One way we can deal with missing *quantitative* data is **imputing** the missing values with the mean of the column.

We use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html>**.fillna**</a> function of Pandas to do this.

In [38]:
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].mean())

The next column with missing values is **Cabin**. In general, the **Cabin** column is weird, so let's investigate it further. We use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html>.unique</a> to look at the different values of the column.

In [39]:
titanic_train['Cabin'].unique()

array([nan, 'C85', 'C123', 'E46', 'G6', 'C103', 'D56', 'A6', 'C23 C25 C27',
       'B78', 'D33', 'B30', 'C52', 'B28', 'C83', 'F33', 'F G73', 'E31',
       'A5', 'D10 D12', 'D26', 'C110', 'B58 B60', 'E101', 'F E69', 'D47',
       'B86', 'F2', 'C2', 'E33', 'B19', 'A7', 'C49', 'F4', 'A32', 'B4',
       'B80', 'A31', 'D36', 'D15', 'C93', 'C78', 'D35', 'C87', 'B77',
       'E67', 'B94', 'C125', 'C99', 'C118', 'D7', 'A19', 'B49', 'D',
       'C22 C26', 'C106', 'C65', 'E36', 'C54', 'B57 B59 B63 B66', 'C7',
       'E34', 'C32', 'B18', 'C124', 'C91', 'E40', 'T', 'C128', 'D37',
       'B35', 'E50', 'C82', 'B96 B98', 'E10', 'E44', 'A34', 'C104', 'C111',
       'C92', 'E38', 'D21', 'E12', 'E63', 'A14', 'B37', 'C30', 'D20',
       'B79', 'E25', 'D46', 'B73', 'C95', 'B38', 'B39', 'B22', 'C86',
       'C70', 'A16', 'C101', 'C68', 'A10', 'E68', 'B41', 'A20', 'D19',
       'D50', 'D9', 'A23', 'B50', 'A26', 'D48', 'E58', 'C126', 'B71',
       'B51 B53 B55', 'D49', 'B5', 'B20', 'F G63', 'C62 C64', 'E24',

We can also look at the counts of each value in the column.

In [40]:
titanic_train['Cabin'].value_counts().head()

B96 B98        4
G6             4
C23 C25 C27    4
F2             3
C22 C26        3
Name: Cabin, dtype: int64

Seems like each entry has maybe a Floor and a room number: however, some entries seem to have multiple cabins, and some entries are even more interesting: "T", "F E69". There are many ways to approach this data, but for now, let's just take the Floor letter from each cabin and place it into a new column.

Note: this may not be the best way to use the Cabin column: if the goal is to predict if a person survived, it may be important to save not just the floor but also the cabin number---i.e. if different people stay in the same room, maybe they all survived or all died.

In [41]:
titanic_train['Floor'] = titanic_train['Cabin'].apply(lambda cabin: cabin[0] if type(cabin) != float else cabin)

In [42]:
titanic_train['Floor'].value_counts()

C    59
B    47
D    33
E    32
A    15
F    13
G     4
T     1
Name: Floor, dtype: int64

In [43]:
titanic_train['Floor'].unique()

array([nan, 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

Let's also take a look at the types of data in some of the rest of the columns.

In [44]:
titanic_train['Sex'].unique()

array(['male', 'female'], dtype=object)

In [45]:
titanic_train['SibSp'].unique()

array([1, 0, 3, 4, 2, 5, 8])

In [46]:
titanic_train['Pclass'].unique()

array([3, 1, 2])

In [47]:
titanic_train['Parch'].unique()

array([0, 1, 2, 5, 3, 4, 6])

In [48]:
titanic_train['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

## Dropping Columns, Inplace

Above, when we said:

In [49]:
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].mean())

We had to set it equal to the column after we called **.fillna**: this is because almost all Pandas functions are **non-destructive** by default---if you're performing an operation on the column, Pandas will create a new column, rather than replace an old column.

For example, the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html>**.drop**</a> method will not remove a column from a DataFrame, it will create a copy of the DataFrame without that column:

In [50]:
titanic_train['dummy'] = 1
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Floor,dummy
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C,1
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,,1


In [51]:
titanic_train.drop('dummy', axis=1).head() # axis = 1 means drop columns, not rows: if you wanted to drop rows, pass in the row index

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Floor
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,


However, if we pass in **inplace=True**, then Pandas will delete the column in the original DataFrame: many other functions in Pandas have this functionality.

In [52]:
titanic_train['dummy'] = 1
titanic_train.drop('dummy', inplace=True, axis=1)
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Floor
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,


Be warned: doing things inplace is dangerous! Say, for example, it took a really long time to load in your database (maybe you had to do some web scraping, or you downloaded it directly from a URL and you lost Internet connection). If you do **drop** operations inplace, without saving the original state of the DataFrame, you could lose data.

In general, it is usually a good idea to save your DataFrame in states throughout your EDA.

## One-hot encoding

A lot of the Titanic data is **categorical**: one way to deal with this kind of data so that we can do predictive modeling is **one-hot encoding**, which means we transform a column, "Pclass" for example, which has 3 different values into 3 different columns with 0 or 1 values, e.g. the values are 1, 2, 3, so 2 turns into [0 1 0].

We use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html>**get_dummies**</a> function in Pandas.

Let's do this for some of the columns in the data.

In [53]:
titanic_train_copy = titanic_train.copy() # save the state of your DF

def one_hot(df, columns):
    for column in columns:
        # this means one-hot encode the column, and make the column title Pclass_{value}, for example
        col_onehot = pd.get_dummies(df[column], prefix=column) 
        df.drop(column, axis=1, inplace=True)
        df = df.join(col_onehot)
    return df

titanic_train_one_hot = one_hot(titanic_train_copy, ['Pclass', 'Sex', 'SibSp', 'Parch'])

In [54]:
titanic_train_one_hot.head()

Unnamed: 0,PassengerId,Survived,Name,Age,Ticket,Fare,Cabin,Embarked,Floor,Pclass_1,...,SibSp_4,SibSp_5,SibSp_8,Parch_0,Parch_1,Parch_2,Parch_3,Parch_4,Parch_5,Parch_6
0,1,0,"Braund, Mr. Owen Harris",22.0,A/5 21171,7.25,,S,,0,...,0,0,0,1,0,0,0,0,0,0
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,PC 17599,71.2833,C85,C,C,1,...,0,0,0,1,0,0,0,0,0,0
2,3,1,"Heikkinen, Miss. Laina",26.0,STON/O2. 3101282,7.925,,S,,0,...,0,0,0,1,0,0,0,0,0,0
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,113803,53.1,C123,S,C,1,...,0,0,0,1,0,0,0,0,0,0
4,5,0,"Allen, Mr. William Henry",35.0,373450,8.05,,S,,0,...,0,0,0,1,0,0,0,0,0,0


NOTE: The .get_dummies function will do nothing with missing values, so when one-hot encoding columns with missing values, create a dummy value for these missing values, so it will turn into a category that .get_dummies will create a column for.

In [55]:
titanic_train['Floor'] = titanic_train['Floor'].fillna('null')
pd.get_dummies(titanic_train['Floor'], prefix='Floor').head()

Unnamed: 0,Floor_A,Floor_B,Floor_C,Floor_D,Floor_E,Floor_F,Floor_G,Floor_T,Floor_null
0,0,0,0,0,0,0,0,0,1
1,0,0,1,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,1
3,0,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1


### Exercise

Clean the rest of the columns of the Titanic data set and use <a href=http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>sklearn.LogisticRegression</a> to create a model for the 'Survived' column.

# Answers to Exercises

### Broadcasting

In [56]:
x = np.array([1, 2])
y = np.array([[3], [4]])
x + y # what does this output

array([[4, 5],
       [5, 6]])

### Linear Algebra

In [57]:
x = np.arange(1000).reshape(1000, 1)
b = np.ones((1000, 1))
X = np.append(x, b, axis=1)

Y = 2 * np.arange(1000).reshape(1000, 1) + 4 + np.random.random()

In [58]:
theta_hat = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, Y))
theta_hat

array([[ 2.        ],
       [ 4.41398437]])

In [59]:
loss = np.dot((Y - np.dot(X, theta_hat)).T, (Y - np.dot(X, theta_hat)))
loss

array([[  8.10227167e-19]])