# Support Vector Machine

## Citations / Resources

[Occupancy Dataset](https://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+#): Accurate occupancy detection of an office room from light, temperature, humidity and CO2 measurements using statistical learning models. Luis M. Candanedo, Véronique Feldheim. Energy and Buildings. Volume 112, 15 January 2016, Pages 28-39.

# To Do List
- Target distribution
- Quick overview of linear reg, logit reg, and regularization L1, L2, elastic net
- Explanation of SVM
- Pairs plot
- Randomized Search CV
- Regularization parameter, how that works

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
import sklearn.metrics as metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
import matplotlib.pyplot as plt
import seaborn as sns

from helpers import cust_func

%matplotlib inline
plt.rcParams['figure.figsize'] = [16, 9]
plt.style.use("fivethirtyeight")
%load_ext autoreload
%autoreload 2

In [2]:
data_train = pd.read_csv("data/occupancy_data/datatraining.txt")
data_test = pd.read_csv("data/occupancy_data/datatest.txt")

In [3]:
data_train.head()

Unnamed: 0,date,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
1,2015-02-04 17:51:00,23.18,27.272,426.0,721.25,0.004793,1
2,2015-02-04 17:51:59,23.15,27.2675,429.5,714.0,0.004783,1
3,2015-02-04 17:53:00,23.15,27.245,426.0,713.5,0.004779,1
4,2015-02-04 17:54:00,23.15,27.2,426.0,708.25,0.004772,1
5,2015-02-04 17:55:00,23.1,27.2,426.0,704.5,0.004757,1


In [4]:
data_train.describe()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
count,8143.0,8143.0,8143.0,8143.0,8143.0,8143.0
mean,20.619084,25.731507,119.519375,606.546243,0.003863,0.21233
std,1.016916,5.531211,194.755805,314.320877,0.000852,0.408982
min,19.0,16.745,0.0,412.75,0.002674,0.0
25%,19.7,20.2,0.0,439.0,0.003078,0.0
50%,20.39,26.2225,0.0,453.5,0.003801,0.0
75%,21.39,30.533333,256.375,638.833333,0.004352,0.0
max,23.18,39.1175,1546.333333,2028.5,0.006476,1.0


In [5]:
data_test.describe()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio,Occupancy
count,2665.0,2665.0,2665.0,2665.0,2665.0,2665.0
mean,21.433876,25.353937,193.227556,717.90647,0.004027,0.364728
std,1.028024,2.436842,250.210906,292.681718,0.000611,0.481444
min,20.2,22.1,0.0,427.5,0.003303,0.0
25%,20.65,23.26,0.0,466.0,0.003529,0.0
50%,20.89,25.0,0.0,580.5,0.003815,0.0
75%,22.356667,26.856667,442.5,956.333333,0.004532,1.0
max,24.408333,31.4725,1697.25,1402.25,0.005378,1.0


In [6]:
X_train = data_train.drop(["Occupancy", "date"], axis = 1).copy()
X_test = data_test.drop(["Occupancy", "date"], axis = 1).copy()

In [7]:
X_train.head()
#X_test.head()

Unnamed: 0,Temperature,Humidity,Light,CO2,HumidityRatio
1,23.18,27.272,426.0,721.25,0.004793
2,23.15,27.2675,429.5,714.0,0.004783
3,23.15,27.245,426.0,713.5,0.004779
4,23.15,27.2,426.0,708.25,0.004772
5,23.1,27.2,426.0,704.5,0.004757


In [8]:
y_train = data_train["Occupancy"]
y_test = data_test["Occupancy"]

In [9]:
y_train.head()

1    1
2    1
3    1
4    1
5    1
Name: Occupancy, dtype: int64

## Target Distribution TODO

## Scale and SVC pipeline

In [10]:
baseline_pipe = make_pipeline(StandardScaler(), SVC())

In [11]:
baseline_pipe.fit(X_train, y_train)

In [12]:
train_score = baseline_pipe.score(X_train, y_train)
test_score = baseline_pipe.score(X_test, y_test)
print(f"Train Score: {round(train_score, 3)}")
print(f"Test Score: {round(test_score, 3)}")

Train Score: 0.989
Test Score: 0.97


Red flags going off in my head!

In [14]:
cm = cust_func.nice_conf_mat(y_test,
                         baseline_pipe.predict(X_test))

Unnamed: 0,Predicted Negative,Predicted Positive
True Negative,1616,77
True Positive,3,969
