In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Linear Algebra

NumPy also supports many linear algebra operations.

In [None]:
A = 10 * np.random.rand(3, 3)
A = A.astype(int)
A

In [None]:
A.T

In [None]:
x = np.ones(3)
x

In [None]:
A @ x # Matrix vector multiplication

In [None]:
np.dot(A, x) # equivalent to above

In [None]:
A * x # does not work as expected! see the broadcasting section

In [None]:
def generate_vector_in_subspace(A):
    return np.dot(A, np.random.rand(A.shape[1], 1))

In [None]:
b = generate_vector_in_subspace(A)
b

In [None]:
np.linalg.solve(A, b)

In [None]:
np.dot(np.linalg.inv(A), b)

## Conditions

In [None]:
A = np.arange(1, 10).reshape(3, 3) # arange is similar to range()

In [None]:
cond = (A < 5)
A[cond]

In [None]:
# np.random.rand generates a random matrix of some shape
B = np.random.rand(1, 9).reshape(3, 3)
B

In [None]:
B[cond] # selects the first four elements of the matrix (by row)

## Exercises

### Broadcasting

In [None]:
x = np.array([1, 2])
y = np.array([[3], [4]])
x + y # what does this output

### Linear Algebra

In [None]:
x = np.arange(1000).reshape(1000, 1)
b = np.ones((1000, 1))
X = np.append(x, b, axis=1)

Y = 2 * x[:,0] + 4*b + np.random.random()

Use Least Squares Linear Regression to find $\hat{\theta}$, weights on each column of $X$ such that it models $Y$. Remember, the formula for Least Squares Linear Regression is: 

$$X^TX\hat{\theta} = X^TY$$

In [None]:
theta_hat = ...
theta_hat

Find the loss of your model.

In [None]:
loss = ...
loss

# Pandas

Pandas is a commonly used data processing library. 

Data is stored in **DataFrame** objects, which is a collection of **Series** objects, which represent columns.

We'll go over an example EDA (exploratory data analysis) and feature engineering process on some data in Pandas.

In [None]:
titanic_train = pd.read_csv('titanic/train.csv')
titanic_test = pd.read_csv('titanic/test.csv')

First, let's look at the data itself.

In [None]:
titanic_train.head()

Next, let's do some data cleaning. Are there any missing values?

In [None]:
titanic_train.isnull().sum()

The first column with missing values is **Age**. One way we can deal with missing *quantitative* data is **imputing** the missing values with the mean of the column.

We use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html>**.fillna**</a> function of Pandas to do this.

In [None]:
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].mean())

The next column with missing values is **Cabin**. In general, the **Cabin** column is weird, so let's investigate it further. We use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html>.unique</a> to look at the different values of the column.

In [None]:
titanic_train['Cabin'].unique()

We can also look at the counts of each value in the column.

In [None]:
titanic_train['Cabin'].value_counts().head()

Seems like each entry has maybe a Floor and a room number: however, some entries seem to have multiple cabins, and some entries are even more interesting: "T", "F E69". There are many ways to approach this data, but for now, let's just take the Floor letter from each cabin and place it into a new column.

Note: this may not be the best way to use the Cabin column: if the goal is to predict if a person survived, it may be important to save not just the floor but also the cabin number---i.e. if different people stay in the same room, maybe they all survived or all died.

In [None]:
titanic_train['Floor'] = titanic_train['Cabin'].apply(lambda cabin: cabin[0] if type(cabin) != float else cabin)

In [None]:
titanic_train['Floor'].value_counts()

In [None]:
titanic_train['Floor'].unique()

Let's also take a look at the types of data in some of the rest of the columns.

In [None]:
titanic_train['Sex'].unique()

In [None]:
titanic_train['SibSp'].unique()

In [None]:
titanic_train['Pclass'].unique()

In [None]:
titanic_train['Parch'].unique()

In [None]:
titanic_train['Embarked'].unique()

## Dropping Columns, Inplace

Above, when we said:

In [None]:
titanic_train['Age'] = titanic_train['Age'].fillna(titanic_train['Age'].mean())

We had to set it equal to the column after we called **.fillna**: this is because almost all Pandas functions are **non-destructive** by default---if you're performing an operation on the column, Pandas will create a new column, rather than replace an old column.

For example, the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html>**.drop**</a> method will not remove a column from a DataFrame, it will create a copy of the DataFrame without that column:

In [None]:
titanic_train['dummy'] = 1
titanic_train.head()

In [None]:
titanic_train.drop('dummy', axis=1).head() # axis = 1 means drop columns, not rows: if you wanted to drop rows, pass in the row index

However, if we pass in **inplace=True**, then Pandas will delete the column in the original DataFrame: many other functions in Pandas have this functionality.

In [None]:
titanic_train['dummy'] = 1
titanic_train.drop('dummy', inplace=True, axis=1)
titanic_train.head()

Be warned: doing things inplace is dangerous! Say, for example, it took a really long time to load in your database (maybe you had to do some web scraping, or you downloaded it directly from a URL and you lost Internet connection). If you do **drop** operations inplace, without saving the original state of the DataFrame, you could lose data.

In general, it is usually a good idea to save your DataFrame in states throughout your EDA.

## One-hot encoding

A lot of the Titanic data is **categorical**: one way to deal with this kind of data so that we can do predictive modeling is **one-hot encoding**, which means we transform a column, "Pclass" for example, which has 3 different values into 3 different columns with 0 or 1 values, e.g. the values are 1, 2, 3, so 2 turns into [0 1 0].

We use the <a href=https://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html>**get_dummies**</a> function in Pandas.

Let's do this for some of the columns in the data.

In [None]:
titanic_train_copy = titanic_train.copy() # save the state of your DF

def one_hot(df, columns):
    for column in columns:
        # this means one-hot encode the column, and make the column title Pclass_{value}, for example
        col_onehot = pd.get_dummies(df[column], prefix=column) 
        df.drop(column, axis=1, inplace=True)
        df = df.join(col_onehot)
    return df

titanic_train_one_hot = one_hot(titanic_train_copy, ['Pclass', 'Sex', 'SibSp', 'Parch'])

In [None]:
titanic_train_one_hot.head()

NOTE: The .get_dummies function will do nothing with missing values, so when one-hot encoding columns with missing values, create a dummy value for these missing values, so it will turn into a category that .get_dummies will create a column for.

In [None]:
titanic_train['Floor'] = titanic_train['Floor'].fillna('null')
pd.get_dummies(titanic_train['Floor'], prefix='Floor').head()

### Exercise

Clean the rest of the columns of the Titanic data set and use <a href=http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html>sklearn.LogisticRegression</a> to create a model for the 'Survived' column. Try it out on the `titanic_test` data and report your accuracy.