# K-nn Classification with German Credit Data
Performing k-nn analysis on the provided German Credit Data. Also doing the following:
* Testing the variation of misclassification on different sizes of testing and training data
* Testing the variation of misclassification for different values of K
* Tabulating the entire results

In [1]:
library(class)

## Reading the Data and Factoring
After reading the csv data file into a dataframe, I'm creating a factor for the expected credit value.
Here, I am taking the values as follows:
* 1 - High
* 2 - Low

In [115]:
data <- read.csv("data/german_credit_data1.csv")

In [116]:
data$Credit.Factor <- factor(data$Credit.Risks, labels = c("high","low"))

# Converting Non-Integer Factors to Integer
Since a lot of columns in the data frame are of the type String instead of the type Integer, I am converting all of them to integer to make it easier for the Knn method to use them. I have given a level of 0 to all NA values in this case.

In [117]:
x <- factor(data$Saving.accounts, labels = c(1,2,3,4))
levels(x) <- c(levels(x),0)
x[is.na(x)] <- 0
data$Saving.accounts <- x

In [118]:
x <- factor(data$Checking.account, labels = c(1,2,3))
levels(x) <- c(levels(x),0)
x[is.na(x)] <- 0
data$Checking.account <- x

In [119]:
x <- factor(data$Sex, labels = c(1,2))
levels(x) <- c(levels(x),0)
x[is.na(x)] <- 0
data$Sex <- x

In [120]:
x <- factor(data$Purpose, labels = c(1,2,3,4,5,6,7,8))
levels(x) <- c(levels(x),0)
x[is.na(x)] <- 0
data$Purpose <- x

In [130]:
x <- factor(data$Housing, labels = c(1,2,3))
levels(x) <- c(levels(x),0)
x[is.na(x)] <- 0
data$Housing <- x

## Previewing Data
Along with previewing data in the data frame, I am counting the number of rows in the data set to use when splitting it into training and test data sets.

In [137]:
n <- nrow(data)
head(data)

X.,Age,Sex,Job,Housing,Saving.accounts,Checking.account,Credit.amount,Duration,Purpose,Credit.Risks,Credit.Factor
0,67,2,2,2,0,1,1169,6,6,1,high
1,22,1,2,2,1,2,5951,48,6,2,low
2,49,2,1,2,1,0,2096,12,4,1,high
3,45,2,2,1,1,1,7882,42,5,1,high
4,53,2,2,1,1,1,4870,24,2,2,low
5,35,2,1,1,0,0,9055,36,4,1,high


# K-nn Classification of Data
The training set sizes are defined in the collection named sampling and the different k values are defined in k_vals collection. 
For classification, I am using the knn method in R. Along with that, I am also calculating the misclassification rate for each set and K value, storing them in a dataframe to evaluate later.

In [145]:
sampling <- c(0.9,0.8,0.75,0.7,0.6,0.5)
k_vals <- 1:20
results <- data.frame(integer(),double(),double(),double())
for(k in k_vals){
    for(sampling.rate in sampling){
        num.test.set.labels <- n * (1-sampling.rate)
        training_ids<-sample(1:n, sampling.rate*n, replace=FALSE)
        testing_ids <- setdiff(1:n, training_ids)
        train<-subset(data[training_ids,], select=-c(Credit.Risks, Credit.Factor,X.))
        test<-subset(data[testing_ids,], select=-c(Credit.Risks, Credit.Factor,X.))
        train_label <- data$Credit.Factor[training_ids]
        test_label<-data$Credit.Factor[testing_ids]
        predicted.labels <- knn(train,test,train_label,k)
        num.incorrect.labels<-sum(predicted.labels !=test_label)
        misclassification.rate<-num.incorrect.labels/num.test.set.labels
        results <- rbind(results, c(k,sampling.rate*n,num.test.set.labels,misclassification.rate))
    }
}
names(results) <- c("K","Training.Data.Size","Test.Data.Size","Misclassification.Rate")

In [146]:
results

K,Training.Data.Size,Test.Data.Size,Misclassification.Rate
1,900,100,0.4100000
1,800,200,0.4450000
1,750,250,0.3880000
1,700,300,0.4100000
1,600,400,0.4075000
1,500,500,0.3720000
2,900,100,0.4300000
2,800,200,0.4100000
2,750,250,0.4440000
2,700,300,0.3900000


### Sorting Results with Lowest Misclassification Rate
In order to get the best fitting, I am sorting the obtained results by their misclassification rate. The lower the rate, the better is our fit. In the above example, I find that the best fit is for **K = 11**, training-test ratio to be **90-10** and the rate is **0.26**.

In [154]:
results[with(results, order(Misclassification.Rate)), ]

Unnamed: 0,K,Training.Data.Size,Test.Data.Size,Misclassification.Rate
61,11,900,100,0.2600000
97,17,900,100,0.2700000
87,15,750,250,0.2720000
113,19,600,400,0.2725000
98,17,800,200,0.2750000
116,20,800,200,0.2750000
69,12,750,250,0.2800000
38,7,800,200,0.2800000
85,15,900,100,0.2800000
91,16,900,100,0.2800000
