# Intro to  Linear Regression in Python with cuML

#### Installing RAPIDS AI

In [0]:
# pull RAPIDS AI install script from notebooks-contrib 
!wget -nc https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh
# install RAPIDS 0.10 nightly
!bash rapids-colab.sh 

import sys, os
# set up system for RAPIDS use
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

## Load data
- Google Colab comes with a `sample_data` directory

In [61]:
!ls sample_data

anscombe.json		      mnist_test.csv
california_housing_test.csv   mnist_train_small.csv
california_housing_train.csv  README.md


#### california_housing
- We are going to use the `california_housing` data sets from this directory
  - i.e. `california_housing_test.csv` and `california_housing_train.csv`
- Start by importing `cuDF` and prep data for Linear Regression

In [44]:
import cudf 

# load train data
train_df = cudf.read_csv('sample_data/california_housing_train.csv')
# load test data
test_df = cudf.read_csv('sample_data/california_housing_train.csv')

train_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,1.4936,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,1.82,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,1.6509,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,3.1917,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,1.925,65500.0


## Simple Linear Regression
- For basic ("*simple*", or "*2D*") Linear Regression, we will predict `median_house_value` based on `median_income`
  - Go ahead and trim data to just these columns

In [29]:
# trim train data to income and house value
simple_train_df = train_df[['median_income', 'median_house_value']]
# trim test data to income and house value
simple_test_df = test_df[['median_income', 'median_house_value']]

simple_train_df.head()

Unnamed: 0,median_income,median_house_value
0,1.4936,66900.0
1,1.82,80100.0
2,1.6509,85700.0
3,3.1917,73400.0
4,1.925,65500.0


### Predict Values
After we import Linear Regression from `cuML`
1. set train (*simple_X_train*, *y_train*) and test (*simple_X_test*, *y_test*) values 
  - with the x-axis representing just 1 value (`medium_income`)
  - and the y-axis representing just 1 value (`medium_house_value`)
2. fit the model with `medium_income` (*simple_X_train*) and corresponding `medium_house_value` (*y_train*) values 
  - so it can build an understanding of their relationship 
3. predict `median_house_value` (*y_test*) for a test set of `medium_income` (*simple_X_test*) values
  - and compare `median_house_value` predictions to actual median house (*y_test*) values

In [39]:
from cuml import LinearRegression

# set train X, y
simple_X_train = simple_train_df['median_income']
y_train = simple_train_df['median_house_value']

# set test X, y
simple_X_test = simple_test_df['median_income']
y_test = simple_test_df['median_house_value']

# set linear regression model
simple_OLS = LinearRegression()
# fit training data to the model
fit = simple_OLS.fit(simple_X_train, y_train)

# predict median house value of test data
simple_y_hat = fit.predict(simple_X_test)

# calculate mean squared error
simple_MSE = ((y_test - simple_y_hat)**2).sum()

print(simple_MSE)

6.720912665782728e+18


## Multiple Linear Regression 
- our mean squared error for Simple Linear Regression was quite high...
  - let's try Multiple Linear Regression (predicting based on all non-test variables rather than just `median_income`) and see if that's any better
1. set train (*multi_X_train*) and test (*multi_X_test*) values 
  - with the x-axis representing all values that are not `medium_house_value`
    - i.e. `longitude`, `latitude`, `housing_median_age`, `total_rooms`, `total_bedrooms`, `population`, `households`, and `median_income`
  - and the y-axis representing just 1 value (`medium_house_value`)
    - *y_train* and *y_test* do not need to be set as they are the same values from the Simple Linear Regression
2. fit the model with `medium_income` (*multi_X_train*) and corresponding `medium_house_value` (*y_train*) values 
  - so it can build an understanding of their relationship 
3. predict `median_house_value` (*y_test*) for a test set of independent (*multi_X_test*) values
  - and compare `median_house_value` predictions to actual median house (*y_test*) values

In [37]:
# set multiple linear regression train X
multi_X_train = train_df.drop('median_income')

# set multiple linear regression test X
multi_X_test = test_df.drop('median_income')

multi_X_train.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_house_value
0,-114.31,34.19,15.0,5612.0,1283.0,1015.0,472.0,66900.0
1,-114.47,34.4,19.0,7650.0,1901.0,1129.0,463.0,80100.0
2,-114.56,33.69,17.0,720.0,174.0,333.0,117.0,85700.0
3,-114.57,33.64,14.0,1501.0,337.0,515.0,226.0,73400.0
4,-114.57,33.57,20.0,1454.0,326.0,624.0,262.0,65500.0


In [38]:
# set linear regression model
multi_OLS = LinearRegression()
# fit training data to the model
fit = OLS.fit(multi_X_train, y_train)

# predict median house value of test data
multi_y_hat = fit.predict(multi_X_test)

# calculate mean squared error
multi_MSE = ((y_test - y_hat)**2).sum()

print(multi_MSE)


119211472331544.78


#### Wow, both look quite high..
- but which is more off?

In [34]:
# is the simple model better?
simple_MSE <= multi_MSE

False

In [36]:
int(multi_MSE - simple_MSE)

-6720793454310395904