# Review

## Read a text file
  * use `open` function to open a file: `open(filename)` where `filename` is a string
    * loop through the file for its content line by line
  * `with` statement: `with expression as variable:`
    * the `as variable` part is optional, it assigns the expression to the variable
    * the expression is an object with `__enter__` and `__exit__` methods
    * the `__exit__` method of a file closes the file.
  * `with open(filename) as file:` assign the opned file to the variable `file`
    * after exiting the `with` statement, the file is closed.

In [12]:
# print the first 10 lines
with open('height-weight.csv','r') as file:
    linenumber = 0
    for line in file:
        print(line)
        linenumber = linenumber + 1
        if linenumber == 10:
            break

65.78331,112.9925

71.51521,136.4873

69.39874,153.0269

68.2166,142.3354

67.78781,144.2971

68.69784,123.3024

69.80204,141.4947

70.01472,136.4623

67.90265,112.3723

66.78236,120.6672



## Read a Comma Seperated Values (CSV) File
* the `csv` module provides the facilities to read CSV files
* The main reading facility is the `reader` class
  * to create a `reader object`, you need to pass the an opened file object as the first arguments
  * some useful optional keyword arguments 
    * `delimiter`: a one-character string for deliminator (field seperator), defaults to ","
    * `quotechar`: a one-character string to quote fields containing special characters, such as the delimiter or quotechar, defaults to '"'
* You can loop through a `reader` object to access its rows.
  * each row is a list of strings


In [13]:
import csv

# print the first 10 lines
with open('height-weight.csv') as file:
    table = csv.reader(file)
    for row in table:
        print(row)
        if table.line_num == 10:
            break

['65.78331', '112.9925']
['71.51521', '136.4873']
['69.39874', '153.0269']
['68.2166', '142.3354']
['67.78781', '144.2971']
['68.69784', '123.3024']
['69.80204', '141.4947']
['70.01472', '136.4623']
['67.90265', '112.3723']
['66.78236', '120.6672']


In [18]:
# read the csv into a matrix


# A Review on Primary Component Analysis
Suppose there are $n$ samples on $m$ variables $X_j$ for $j=1,2,\dots,m$.

The first primary component is a unit vector $\vec \alpha = [\alpha_1,\dots,\alpha_m]^T$, $\|\vec\alpha\|=1$, such that the linear combination of the variables
$$Y=\sum_{k=1}^m\alpha_k X_k,\; \|()\|=1$$
has the largest variance (i.e., it explains the most varaince in the data. It is also the direction of the line that the minimizes the square distance beteween the line and the data.

The second primary component is a unit vector $\vec \beta = [\beta,\dots,\beta]^T$ orthogonal to $\vec\alpha$, i.e., $\|\vec\alpha\|=1$ and $\vec\beta \perp\vec\alpha$, suc that the  linear combination of the variables
$$Z=\sum_{k=1}^m\beta_k X_k,\; 1$$
has the largest variance (i.e., it explains the most varaince in the data when subtracted $Y\vec \alpha$ from the data.

* We first shift the data by subtracting the average of each variable from the data to make the data zero mean. 
* Suppose the data is organized as a matrix $X$, where each row is a sample and each column is a variable, i.e., $X_{ij}$ is the value of the variable $j$ for sample $i$.
* The variance covariance matrix is $X^TX$.
* The first and second primary component is then the eigenvectors of $X^TX$ associated witht he largest and the second largest eigenvalues, respectively.

# Group assignment
The height_weight.csv contains 25,000 samples of the height (in inches, in the first column) and weight (in lbs, in the second column) of a population. This dataset is downloaded from [Statistics Online Computational Resource](http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_Dinov_020108_HeightsWeights)

1. Find the first primary component, which is the linear relationship between weight and height.

2. Find the second primary component, which is the deviation from the linear relationship, and the variable $Z$ is thus a measure of overweightness for this population.

3. Define a function that calculates the value $Z$ from a weight and a height.

**Hint**: the `eig` function in `numpy.linalg` compute the eigenvalues and eigenvectors of a matrix.
  * `e, P=eig(A)` returns the eigenvalues of teh matrix $A$ in `e`, and the associated eigenvectors in $P$.

In [47]:
# first primary component
from numpy import matmul, mean, array
from numpy import zeros
from numpy.linalg import eig
n = 25000
X = zeros([25000, 2])

# print the first 10 lines
with open('height-weight.csv') as file:
    table = csv.reader(file)
    # row index 
    i = 0
    for row in table:
        X[i, 0] = float(row[0])
        X[i, 1] = float(row[1])
        i = i + 1

weight = []
height = []

for i in range(0,n):
    weight.append(X[i,0])
    height.append(X[i,1])

print(X)

Y = array([weight-mean(weight), height-mean(height)]).T
e, P = eig(matmul(Y.T, Y)/(Y.shape[0]-1))
print("e: "e)
print(P)

cov=matmul(X.T,X)

#first primary component is  0.08336679W - 0.99651893H
#second primary component is -0.99651893H - 0.08336679W

[[ 65.78331 112.9925 ]
 [ 71.51521 136.4873 ]
 [ 69.39874 153.0269 ]
 ...
 [ 64.69855 118.2655 ]
 [ 67.52918 132.2682 ]
 [ 68.87761 124.8742 ]]
[  2.68350923 136.90940491]
[[-0.99651893 -0.08336679]
 [ 0.08336679 -0.99651893]]
