In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# Read the dataset
To read the dataset we are going to use the function `read_csv` from the [pandas library](https://pandas.pydata.org/). In the following box the dataset is first loaded as a "dataframe" (similar to those from R), each column correspond to a variable (dimension) and each row to a point.

This dataset consist of $n=9$ __physiological and medical variables (columns)__ measured for $m=768$ __patients (rows)__

Each column represents the following variables:

+ column 0:  *Pregnancies*: Number of times pregnant
+ column 1:  *Glucose*: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
+ column 2:  *BloodPressure*: Diastolic blood pressure (mm Hg)
+ column 3:  *SkinThickness*: Triceps skin fold thickness (mm)
+ column 4:  *Insulin*: 2-Hour serum insulin (mu U/ml)
+ column 5:  *BMI*: Body mass index (weight in kg/(height in m)^2)
+ column 6:  *DiabetesPedigreeFunction*: Diabetes pedigree function
+ column 7:  *Age*: Age (years)
+ column 8:  *Outcome*: The person is diabetic or not (1 or 0)


In [3]:
import pandas as pd
dataset = pd.read_csv('diabetes.csv',header=0)

To see the first 3 lines of the dataset we use the `head` method with a parameter `3`

In [5]:
dataset.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


# Description of the regression problem 

For this project, we will consider a reduced dataset with the following 5 columns as the __conditions__:
+ Pregancies
+ BloodPressure
+ SkinThinkness
+ BMI
+ Age

Let $X$ be a $m\times 5$ matrix corresponding to the values of each one of these condition variable for each patient. 

And We are going to consider the following column as our __observation__:
+ Glucose

Let $y$ be the vector of observations for each patient

### Goal (Least Squares):
Our goal is to find $c$ a vector of __parameters__ such that:
$$X\cdot c +r = y \quad \text{and}\quad ||r||_2 \text{ is minimized }$$

### Construction of $X$ and $y$
In the following box we construct our matrix $X$ and vector $y$ as `np.array`, so you do not need to bother understanding the structure of the dataframe. 


In [37]:
# Get only the requiered variables
dataset_X = dataset[["Pregnancies",
                    "BloodPressure",
                    "SkinThickness",
                    "BMI",
                    "Age"]]
# Get only the observation variable
dataset_y = dataset["Glucose"]
# Get only np.array out of the dataset
X = dataset_X.values
y = dataset_y.values

In [38]:
print("type of X: ",type(X))
print("shape of X: ",X.shape)
print("type of y: ",type(y))
print("shape of y: ",y.shape)

type of X:  <class 'numpy.ndarray'>
shape of X:  (768, 5)
type of y:  <class 'numpy.ndarray'>
shape of y:  (768,)


# Questions
+ Plot $y$ as a function of the values of each column of $X$
+ Compute a $QR$ factorization of $X$
+ Use the factorization to compute the vector $c$