# Instance based learning

The title of the dataset is ‘Crime and Communities’. It is prepared using real data from socio-economic data from 1990 US Census, law enforcement data from the 1990 US LEMAS survey, and crime data from the 1995 FBI UCR. This dataset contains a total number of 147 attributes and 2216 instances.

This code aims to predict the number of violent crimes per 100K persons in a community from a number of predictive variables using instance based learning.

Advantageous because of its ability to adapt model to previously unseen data.
However, issues with memory complexity of storing all training instances and risk of overfitting.



Trying: Support Vector Machines

In [3]:
%run Preprocessing.ipynb

      communityname state  countyCode  communityCode  fold  population  \
0    Marpletownship    PA        45.0        47616.0     1       23123   
1        Tigardcity    OR         NaN            NaN     1       29344   
2  Gloversvillecity    NY        35.0        29443.0     1       16656   
3       Bemidjicity    MN         7.0         5068.0     1       11245   
4   Springfieldcity    MO         NaN            NaN     1      140494   

   householdsize  racepctblack  racePctWhite  racePctAsian  ...  burglaries  \
0           2.82          0.80         95.57          3.44  ...        57.0   
1           2.43          0.74         94.33          3.43  ...       274.0   
2           2.40          1.70         97.35          0.50  ...       225.0   
3           2.76          0.53         89.16          1.17  ...        91.0   
4           2.45          2.51         95.65          0.90  ...      2094.0   

   burglPerPop  larcenies  larcPerPop  autoTheft  autoTheftPerPop  arsons  \
0  

KeyError: 'ViolentCrimesPerPop'

KeyError: 'ViolentCrimesPerPop'

In [None]:
# import packages
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import time


Split the data

In [None]:
# Split the data into training and test data sets using sklearn.model_selection

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.70, random_state=1)

Normalise the training set

In [None]:

# normalisation computed on the training set
mu = np.mean(x_train, axis=0)
sigma = np.std(x_train, axis=0)

x_train = (x_train - mu)#/sigma
x_test = (x_test - mu)/sigma

In [None]:
np.mean(x_train, axis=0)

In [None]:
# If we have properly sampled the dataset:
# we should get a mean vector for the test set that contains close to zero values.
np.mean(x_test, axis=0)

**Nearest Neighbour Algorithm**

The KNN classifier considers the entire data-set and assigns any new observation the value the majority of the closest K-neighbors.

In order to instantiate this classifier we need to define a distance function. Since we are dealing with continuous features we will define the euclidean distance.

To evaluate how this classifier performs on the test set we will measure its accuracy. Note that evaluating the accuracy on the training set is pointless because this will always be 1 by definition. We will now do also this only for instructive purposes.

We now define the accuracy measure. Remember that the accuracy is equal to the proportion of examples that the classifier predicted correctly.

**KNN on Scikit Learn**

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=5, metric='euclidean')

In [None]:
# train the classifier
knn_clf.fit(x_train, y_train)

In [None]:
# Evaluate the result of this classifier.
# should not see any difference with respect to the results obtained above with our implementation of the kNN algorithm.

from sklearn.metrics import accuracy_score

y_test_pred = knn_clf.predict(x_test)

print('Test accuracy of kNN', accuracy_score(y_test, y_test_pred))


**Cosine Distance Algorithm**

In [None]:
# cosine distance
knn_clf = KNeighborsClassifier(n_neighbors=5, metric='cosine')

knn_clf.fit(x_train, y_train)

y_test_pred = knn_clf.predict(x_test)

print('Test accuracy of kNN', accuracy_score(y_test, y_test_pred))

Try other 𝑛 values to find a better one.

Although, the danger of doing this hyper-parameter exploration using the test set is that we may overfit these hyper-parameters on the test set. To avoid this, it is better to find the best hyper-parameter values via a validation strategy.

To find these hyper-parameter values we can exploit the grid search of scikit-learn. This will perform a k-fold cross-validation on the training set.

In [None]:
from sklearn.model_selection import GridSearchCV


param_grid = [{
    'weights': ["uniform", "distance"],
    'n_neighbors': range(1, 11),
    'metric':['euclidean', 'manhattan', 'cosine']}]

knn_clf = KNeighborsClassifier()
grid_search = GridSearchCV(knn_clf, param_grid, cv=5, verbose=2)
grid_search.fit(x_train, y_train)

In [None]:
# see what are the best hyper-parameter values found by the cross-validation
grid_search.best_estimator_


In [None]:
# the best estimator is n_neighbours=10
# try this hyper-parameters on the test set.
knn_clf = KNeighborsClassifier(metric='cosine', n_neighbors=10, weights='distance')

knn_clf.fit(x_train, y_train)

y_train_pred = knn_clf.predict(x_train)
y_test_pred = knn_clf.predict(x_test)

# The accuracy measured on the test set is now a better estimate of the accuracy we would expect on unseen examples.
print('Train accuracy of kNN', accuracy_score(y_train, y_train_pred))
print('Test accuracy of kNN', accuracy_score(y_test, y_test_pred))

**Support Vector Machine Algorithm**

In [None]:
start = time.time()

# Import support vector classifier
clf = SVC(kernel='linear') # gaussian
# Fit the x features and y classes
clf.fit(x_train, y_train)

end = time.time()
print((end - start)/60, 'minutes to run SVM classifier')

In [None]:
# Predict
y_pred = clf.predict(x_test)

In [None]:
# Evaluate the model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred)) # Accuracy = 0.7729323308270677



**Hyperparameters**
https://www.datacamp.com/community/tutorials/svm-classification-scikit-learn-python
* Kernel: The main function of the kernel is to transform the given dataset input data into the required form. There are various types of functions such as linear, polynomial, and radial basis function (RBF). Polynomial and RBF are useful for non-linear hyperplane. Polynomial and RBF kernels compute the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear. This transformation can lead to more accurate classifiers.

* Regularization: Regularization parameter in python's Scikit-learn C parameter used to maintain regularization. Here C is the penalty parameter, which represents misclassification or error term. The misclassification or error term tells the SVM optimization how much error is bearable. This is how you can control the trade-off between decision boundary and misclassification term. A smaller value of C creates a small-margin hyperplane and a larger value of C creates a larger-margin hyperplane.

* Gamma: A lower value of Gamma will loosely fit the training dataset, whereas a higher value of gamma will exactly fit the training dataset, which causes over-fitting. In other words, you can say a low value of gamma considers only nearby points in calculating the separation line, while the a value of gamma considers all the data points in the calculation of the separation line.


**Advantages**
SVM Classifiers offer good accuracy and perform faster prediction compared to Naïve Bayes algorithm. They also use less memory because they use a subset of training points in the decision phase. SVM works well with a clear margin of separation and with high dimensional space.

**Disadvantages**
SVM is not suitable for large datasets because of its high training time and it also takes more time in training compared to Naïve Bayes. It works poorly with overlapping classes and is also sensitive to the type of kernel used.

**Conclusion**

This is not a suitable model for our dataset because it is computationally too expensive.
K-nearest neighbors seems like a better alternative as far as instance-based learning goes.

