<a href="https://colab.research.google.com/github/ChuquEmeka/SVM-Optical-Character-Recognition/blob/main/svm_optical_character_recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## USING SUPPORT VECTOR MACHINES FOR OPTICAL CHARACTER RECOGNITION.

#### PRESENTED BY EDEH EMEKA N.

##### SVMs are well suited to tackle the challenges of image data. They are capable of learning complex patterns without being overly sensitive to noise.

##### I will simulate a process that involves matching glyphs to one of the 26 letters of the English Alphabets.

In [1]:
letter <- read.csv("letterdata.csv", stringsAsFactors = TRUE)

In [2]:
str(letter)

'data.frame':	20000 obs. of  17 variables:
 $ letter: Factor w/ 26 levels "A","B","C","D",..: 20 9 4 14 7 19 2 1 10 13 ...
 $ xbox  : int  2 5 4 7 2 4 4 1 2 11 ...
 $ ybox  : int  8 12 11 11 1 11 2 1 2 15 ...
 $ width : int  3 3 6 6 3 5 5 3 4 13 ...
 $ height: int  5 7 8 6 1 8 4 2 4 9 ...
 $ onpix : int  1 2 6 3 1 3 4 1 2 7 ...
 $ xbar  : int  8 10 10 5 8 8 8 8 10 13 ...
 $ ybar  : int  13 5 6 9 6 8 7 2 6 2 ...
 $ x2bar : int  0 5 2 4 6 6 6 2 2 6 ...
 $ y2bar : int  6 4 6 6 6 9 6 2 6 2 ...
 $ xybar : int  6 13 10 4 6 5 7 8 12 12 ...
 $ x2ybar: int  10 3 3 4 5 6 6 2 4 1 ...
 $ xy2bar: int  8 9 7 10 9 6 6 8 8 9 ...
 $ xedge : int  0 2 3 6 1 0 2 1 1 8 ...
 $ xedgey: int  8 8 7 10 7 8 8 6 6 1 ...
 $ yedge : int  0 4 3 2 5 9 7 2 1 1 ...
 $ yedgex: int  8 10 9 8 10 7 10 7 7 8 ...


In [3]:
#i will check for missing values
 sapply(letter, function(x) sum(is.na (x)))

**SVM requires all features to be numeric and fairly scaled to a small interval. To a large extent, this dataset is already prepared for evaluation. No missing value is detected. The rescaling will be authomatically handled by the package that will be used for the model fitting. I will next move into the training and testing phase.**

## **MODEL TRAINING**  

**The data has already been randomized by Frey and Slate. I will use the first 15,000 observations(75%) as training data and the remaining 5000(25%) as test data.**

In [4]:
letter_train <- letter[1:15000,]
letter_test <- letter[15001:20000,]

**To train my model, i will be using the ksvm() function in the kernlab package(developed in R)**

In [5]:
install.packages("kernlab")


Installing package into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)



In [6]:
library("kernlab")

In [7]:
letter_classifier <- ksvm(letter ~ ., data = letter_train, kernel = 'vanilladot')

 Setting default kernel parameters  


In [8]:
letter_classifier

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Linear (vanilla) kernel function. 

Number of Support Vectors : 6618 

Objective Function Value : -13.2947 -19.6051 -20.8982 -5.6651 -7.2092 -31.5151 -48.3253 -17.6236 -57.0476 -30.532 -15.7162 -31.49 -28.2706 -45.741 -11.7891 -33.3161 -28.2251 -16.5347 -13.2693 -30.88 -29.4259 -7.7099 -11.1685 -29.4289 -13.0857 -9.2631 -144.1105 -52.7747 -71.052 -109.7783 -158.3152 -51.2839 -39.6499 -67.0061 -23.8637 -27.6083 -26.3461 -35.2626 -38.6346 -116.8967 -173.8336 -214.2196 -20.7925 -10.3812 -53.1156 -12.228 -46.6132 -8.6867 -18.9108 -11.0535 -94.5751 -26.5689 -224.0215 -70.5714 -8.3232 -4.5265 -132.5431 -74.6876 -19.5742 -12.7352 -81.7894 -11.6983 -25.4835 -17.582 -23.934 -27.022 -50.7092 -10.9228 -4.3852 -13.7216 -3.8547 -3.5723 -8.419 -36.9773 -47.1418 -172.6874 -42.457 -44.0342 -42.7695 -13.0527 -16.7534 -78.7849 -101.8146 -32.1141 -30.3349 -104.0695 -32.1258 -24.6301 -32.6087 -17.08

## **MODEL PERFORMANCE EVALUATION**

In [9]:
letter_prediction <- predict(letter_classifier, letter_test)
head(letter_prediction)

**I will examine how well the classifier performed by comparing the predicted letter to the true letter in the test dataset.**

In [10]:
table(letter_prediction, letter_test$letter)

                 
letter_prediction   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O
                A 191   0   1   0   0   0   0   0   0   1   0   0   1   2   2
                B   0 157   0   9   2   0   1   3   0   0   1   0   3   0   0
                C   0   0 142   0   5   0  14   3   2   0   2   4   0   0   2
                D   1   1   0 196   0   1   4  12   5   3   4   4   0   6   5
                E   0   0   8   0 164   2   1   1   0   0   3   5   0   0   0
                F   0   0   0   0   0 171   4   2   8   2   0   0   0   0   0
                G   1   1   4   1  10   3 150   2   0   0   1   2   1   0   0
                H   0   3   0   1   0   2   2 122   0   2   4   2   2   5  23
                I   0   0   0   0   0   0   0   0 175  10   0   0   0   0   0
                J   2   2   0   0   0   3   0   2   7 158   0   0   0   0   1
                K   2   1  11   0   0   0   4   6   0   0 148   0   0   2   0
                L   0   0   0   0   1   0   1 

**The diagonal numbers(191, 157, 142, 196 etc) show the total number  
records where predicted letter matches the true values.  
I will obtain the accuracy rate below**

In [11]:
match <- letter_prediction==letter_test$letter

In [12]:
table(match)


match
FALSE  TRUE 
  780  4220 

In [13]:
accuracy <-prop.table(table(match))*100
accuracy

match
FALSE  TRUE 
 15.6  84.4 

##### **The classifier correctly identified 84.4% of the records and wrongly identifeied 15.6% of the records.**

## MODEL PERFORMANCE IMPROVEMENT

**In the previous analysis, i used the simple linear kernel function known as 'vanilladot'. I will be using the Radial Basis Function (RBF) kernel to attempt to improve my model performance.**

In [14]:
letter_classifier2 <- ksvm(letter ~., data = letter_train, kernel= "rbfdot")

In [15]:
letter_prediction2 <- predict(letter_classifier2, letter_test)

In [16]:
#I will get the accuracy
match2 <- letter_prediction2==letter_test$letter

In [17]:
prop.table(table(match2))*100

match2
FALSE  TRUE 
 7.12 92.88 

By changing the kernel function, i am able to increase my model accuracy from 84.4% to about 93%