# Lab 01: Pandas, Numpy, and Statsmodels


### Pandas

Pandas is a popular library for data analysis in python.  The library allows users to work with tabular data similar to R's `data.frame`.  Today, you'll learn how to :

* Load in data
* Select columns and rows
* Assign new colums
* Extract data from columns as numpy arrays



In [3]:
import numpy as np
import pandas as pd
from statsmodels.regression.linear_model import OLS
from statsmodels.regression.quantile_regression import QuantReg
from IPython.display import display

Data comes in a lot of forms.  Comma seperated value (csv) files are most common.  You can read in a csv files into python using `pandas.read_csv`.

In [4]:
# Read in data
df = pd.read_csv('2018_data.csv')

# See the top 5 rows
df.head()


Unnamed: 0,created_at,apparentTemperature,humidity,precipIntensity,precipProbability,precipType,pressure,visibility,windBearing,windSpeed,wr
0,2018-01-02 06:28:33,-23.27,0.83,0.0,0.0,NoPrecip,1031.23,16.093,240.0,16.56,9.0
1,2018-01-02 06:58:21,-23.27,0.83,0.0,0.0,NoPrecip,1031.23,16.093,240.0,16.56,10.0
2,2018-01-02 07:27:11,-24.22,0.83,0.0,0.0,NoPrecip,1030.51,16.093,199.0,20.54,10.0
3,2018-01-02 07:58:38,-24.22,0.83,0.0,0.0,NoPrecip,1030.51,16.093,199.0,20.54,7.0
4,2018-01-02 08:27:15,-19.47,0.85,0.0,0.0,NoPrecip,1030.29,16.093,255.0,6.43,18.0


There are several ways to extract a column using pandas. You can use one of:

* `df['column_name']`

* `df.column_name`

* `df.loc[:, 'column_name']`



In [5]:
# Extract the temperature column using the three methods
display(df['apparentTemperature'])

display(df.apparentTemperature)

display(df.loc[:, 'apparentTemperature'])


0      -23.27
1      -23.27
2      -24.22
3      -24.22
4      -19.47
        ...  
6419    -8.02
6420    -8.02
6421    -8.02
6422    -7.17
6423    -8.69
Name: apparentTemperature, Length: 6424, dtype: float64

0      -23.27
1      -23.27
2      -24.22
3      -24.22
4      -19.47
        ...  
6419    -8.02
6420    -8.02
6421    -8.02
6422    -7.17
6423    -8.69
Name: apparentTemperature, Length: 6424, dtype: float64

0      -23.27
1      -23.27
2      -24.22
3      -24.22
4      -19.47
        ...  
6419    -8.02
6420    -8.02
6421    -8.02
6422    -7.17
6423    -8.69
Name: apparentTemperature, Length: 6424, dtype: float64

Creating a new column in pandas cal be done by 

`df['new_column'] = ...`

Let's create a new column for the temperature in Fahrenheit.

In [6]:

# Note, pandas columns act like vectors in so far as
# operations act elementwise.
df['temp_in_Fahrenheit'] = 1.8*df.apparentTemperature + 32

df.head()

Unnamed: 0,created_at,apparentTemperature,humidity,precipIntensity,precipProbability,precipType,pressure,visibility,windBearing,windSpeed,wr,temp_in_Fahrenheit
0,2018-01-02 06:28:33,-23.27,0.83,0.0,0.0,NoPrecip,1031.23,16.093,240.0,16.56,9.0,-9.886
1,2018-01-02 06:58:21,-23.27,0.83,0.0,0.0,NoPrecip,1031.23,16.093,240.0,16.56,10.0,-9.886
2,2018-01-02 07:27:11,-24.22,0.83,0.0,0.0,NoPrecip,1030.51,16.093,199.0,20.54,10.0,-11.596
3,2018-01-02 07:58:38,-24.22,0.83,0.0,0.0,NoPrecip,1030.51,16.093,199.0,20.54,7.0,-11.596
4,2018-01-02 08:27:15,-19.47,0.85,0.0,0.0,NoPrecip,1030.29,16.093,255.0,6.43,18.0,-3.046


If you have data in python you would like to convert into a DataFrame, you can do so as follows

In [7]:
array_data = np.array([3,2,1])

array_to_df = pd.DataFrame({'a_column_name': array_data})

array_to_df

Unnamed: 0,a_column_name
0,3
1,2
2,1


___


### Numpy

#### Linear algebra and dot products

In math, a vector $\mathbf{x} \in \mathbb{R}^n$ is usually understood to be a *column vector*.  That is, if I were to write out $\mathbf{x}$ then it would look like

$$ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} $$

When we write $\mathbf{x}^T$ that makes $\mathbf{x}$ a *row vector*.  So if I were to write out $\mathbf{x}^T$ it would look like

$$ \mathbf{x}^T = \begin{bmatrix} x_1 \,, x_2 \,, \cdots \,, x_n \end{bmatrix} $$

When I want to compute a dot product between two vectors $\mathbf{x}$ and $\mathbf{y}$ I write

$$ \mathbf{y}^T \mathbf{x} $$

Which means that this is the product between a *row vector* and a *column vector*. 


#### One- and two dimensional arrays in numpy
Intuitively, we expect that Numpy behaves the same way. The confusion arises because vectors can be represented as 1-d or 2-d arrays. 



In [8]:

# These are a one-dimensional numpy arrays
x = np.array([1,2,3])
y = np.array([4,5,6])

print('x looks like:')
print(x)
print(y)

print('shape of x:', x.shape)
print('shape of y:', x.shape)

print(f'x has {x.ndim} dimension(s)')
print(f'y has {y.ndim} dimension(s)')

x looks like:
[1 2 3]
[4 5 6]
shape of x: (3,)
shape of y: (3,)
x has 1 dimension(s)
y has 1 dimension(s)


In [9]:
# Write x and y as 2d- arrays

x = np.array([[1,2,3]]).T
y = np.array([[4,5,6]]).T

print('x looks like:')
print(x)
print(y)

print('shape of x:', x.shape)
print('shape of y:', x.shape)

print(f'x has {x.ndim} dimension(s)')
print(f'y has {y.ndim} dimension(s)')

x looks like:
[[1]
 [2]
 [3]]
[[4]
 [5]
 [6]]
shape of x: (3, 1)
shape of y: (3, 1)
x has 2 dimension(s)
y has 2 dimension(s)


#### Dot products in numpy 
The dot-product between two 2d-arrays behaves as we would expect (Note that the result is still technically a 2-d array):

In [10]:
x.T@y

array([[32]])

In [11]:
# If the sizes of the arrays do not align, I am getting an error message 
# Since x is a column vector, the inner product does not work.
# x must be a row vector and have the same dimension as y
x@y

ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0, with gufunc signature (n?,k),(k,m?)->(n?,m?) (size 3 is different from 1)

 But take a look at what happens when I compute the dot product between two arrays using `@`.

In [12]:
# I shouldn't be able to do either of these operations with vectors
# because they don't have the right shape
x = np.array([1,2,3])
y = np.array([-1,2,0])
print(x@y)
print(y@x)
# But as you can see, I can:

3
3


This isn't allowed to happen in linear algebra.  If `x` and `y` are "vectors", then numpy is allowing me to take the product of two column vectors without transposing one of them.  The answer is right, the dot product is 3, but 1-d numpy arrays do not behave like vectors.  

Indeed, the transpose operator leaves a 1-d array unchanged

In [34]:
print(x)
print(x.T) #Transposing does not change a 1d array

[1 2 3]
[1 2 3]


A useful function to transform 1-d arrays into 2-d arrays is the function `reshape`. We can transform or data into  `(n,1)` shaped arrays using `x.reshape(-1,1)`.  When you pass `-1` to reshape, you're telling numpy to infer the shape in that dimension.  So if I had an array, `z`, of 3 elements and I called `z.reshape(-1,1)`.  This will reshape the array to be a `(3,1)` array.  We didn't have to tell numpy the size for the first dimension, numpy inferred it from the size of the array.



In [38]:
print(x)

z = x.reshape(-1,1)

print(z)

[1 2 3]
[[1]
 [2]
 [3]]
