### Problem Statement: 
Load the data from “college.csv” that has attributes collected about private and public colleges for a
particular year. Predict the private/public status of the colleges from other attributes.
Use LabelEncoder to encode the target variable to numerical form. Split the data such that 20% of the data is set aside for
testing. Fit a linear svm from scikit learn and observe the accuracy. [Hint: Use Linear SVC]
Preprocess the data using StandardScalar and fit the same model again. Observe the change in accuracy.
Use scikit learn’s gridsearch to select the best hyperparameter for a nonlinear SVM. Identify the model with best score and
its parameters. [Hint: Refer to model_selection module of Scikit learn]


### Objective: 
Employ SVM from scikit learn for binary classification and measure the impact of preprocessing data and hyper
parameter search using grid search.


###### Import the Dataset

In [1]:
import pandas as pd
df = pd.read_csv("College.csv")
df.columns

Index(['Private', 'Apps', 'Accept', 'Enroll', 'Top10perc', 'Top25perc',
       'F.Undergrad', 'P.Undergrad', 'Outstate', 'Room.Board', 'Books',
       'Personal', 'PhD', 'Terminal', 'S.F.Ratio', 'perc.alumni', 'Expend',
       'Grad.Rate'],
      dtype='object')

###### Label Encoding


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
X, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
# male -> 1
# female -> 0
target_encoder = LabelEncoder()
y = target_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=1)
print(X_train.shape)

(621, 17)


###### Fit the Linear SVC Classifier

In [3]:
from sklearn.svm import LinearSVC,SVC
classifier = LinearSVC()
classifier.fit(X_train,y_train)
y_predict = classifier.predict(X_test)
classifier.score(X_test,y_test)



0.8205128205128205

###### Obtain Performance Matrix


In [4]:
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_predict,y_test))

[[ 12   1]
 [ 27 116]]


###### Fit the SVC Classifier

In [5]:
classifier = SVC()
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)

0.9230769230769231

###### Preprocess the Data

In [8]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X, y = df.iloc[:, 1:].values, df.iloc[:, 0].values
X = scaler.fit_transform(X)
target_encoder = LabelEncoder()
y = target_encoder.fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=1)
print(X_train.shape)

(621, 17)


###### Refitting the SVC Model

In [9]:
classifier = SVC()
classifier.fit(X_train,y_train)
classifier.score(X_test,y_test)

0.9423076923076923

###### Fitting Grid Search 

In [10]:
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
C_range = np.logspace(-2, 10, 13)
gamma_range = np.logspace(-9, 3, 13)
param_grid = dict( gamma=gamma_range,C=C_range)
grid = GridSearchCV(SVC(), param_grid=param_grid)
grid.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='scale', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': array([1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04, 1.e+05,
       1.e+06, 1.e+07, 1.e+08, 1.e+09, 1.e+10]),
                         'gamma': array([1.e-09, 1.e-08, 1.e-07, 1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02,
       1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

###### Getting the Best Hyperparameter

In [11]:
print("The best parameters are %s with a score of %0.2f"
% (grid.best_params_, grid.best_score_))

The best parameters are {'C': 1000000.0, 'gamma': 1e-07} with a score of 0.94
