# Machine Learning Assignment 2019

## Description of dataset and problem

The dataset was adapted from [archive.ics.uci.edu](https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation) and consist of $(B,G,R)$ colur scheme points that are labelled as human skin or not human skin. The dataset consist of $245\ 057$ samples of which $50\ 859$ are samples that represent human skin and $194\ 198$ samples are not human skin. A datapoint from the dataset is of the following format $(B,G,R,C)$ where $B,G$ and $R$ are 8 bit integers ($[0,255]$) and $C$ represents the class in which the point belongs to , a value of $1$ indicates human skin while $2$ represents not human skin.With this dataset we want to implement classifiers for skin segmentation and evaluate their perfomance.The two models that will be implemented are logistic regression and naive bayes.

## Preprocessing of data

Before the data is used in any of the models it will be normalized and the class labels altered, the normalization is as follows
$$x_i=\frac{x_i-x_{min}}{x_{max}-x_{min}}$$
The class labels will become $[1,2]\to[0,1]$ .The dataset will the be split int three parts , one part to be used as training data which will be $60 \%$ of the dataset , a second part to be used as validation data which will  $20\%$ of the dataset and a testing set which will also $20\%$ of the dataset.


In [1]:
import numpy as np
import pandas as pd

### Loading of dataset & sample points

In [2]:
trainingData=pd.read_csv('training_data.csv',names=['Blue','Green','Red','Class'])
validationData=pd.read_csv('validation_data.csv',names=['Blue','Green','Red','Class'])
testingData=pd.read_csv('testing_data.csv',names=['Blue','Green','Red','Class'])
trainingData.head()

Unnamed: 0,Blue,Green,Red,Class
0,74,85,123,1
1,73,84,122,1
2,72,83,121,1
3,70,81,119,1
4,70,81,119,1


### Normalization & class reassignment

In [3]:
trainingData[['Blue','Green','Red']]=trainingData[['Blue','Green','Red']].div(255)
trainingData[['Class']]=trainingData[['Class']]-1
validationData[['Blue','Green','Red']]=validationData[['Blue','Green','Red']].div(255)
validationData[['Class']]=validationData[['Class']]-1
testingData[['Blue','Green','Red']]=testingData[['Blue','Green','Red']].div(255)
testingData[['Class']]=testingData[['Class']]-1
trainingData.sample()

Unnamed: 0,Blue,Green,Red,Class
7222,0.454902,0.501961,0.729412,0


### logistic regression
Logistic regression in which the optimization method for determinining optimal values of $\theta=(\theta_0,\theta_1,\theta_2)$ is the minibatch gradient descent. Minibatch gradient descent is some hybrid between batch gradient descent and stochastic gradient descent.Instead of cycling through entire dataset first before perfoming an update we cycle through a minibatch thats significantly smaller than the dataset and update afterwards and unlike stochastic gradient descent we do not approximate gradient to a single point.

In [None]:
def sigmoid(theta):
    return lambda x:1/(1+np.exp(-np.dot(theta,x)))

#minibatch gradient descent
def mgd(epochs,batchSize,alpha):
    data=[]
    for index,row in trainingData.iterrows():
        b,g,r,y=row['Blue'],row['Green'],row['Red'],row['Class']
        data.append([[b,g,r],int(y)])
    
    theta=np.random.randn(3)    
    for i in range(epochs):
        np.random.shuffle(data)
        miniBatches=[data[k:k+batchSize] for k in range(0,len(data),batchSize)]
        for miniBatch in miniBatches:
            h=sigmoid(theta)
            temp=np.zeros(3)
            prior=theta
            for j in range(len(theta)):
                gradients=[alpha*(h(x)-y)*x[j] for x,y in miniBatch]
                temp[j]=(1/batchSize)*sum(gradients)
            theta=theta-temp
            eta=theta-prior
            if np.sqrt(np.dot(eta,eta))<10**-5:
                return theta
    return theta
def classify(h,x):
    if h(x)<0.5:
        return 0
    return 1

def accuracy(data,theta):
    score=0
    h=sigmoid(theta)
    
    for row in data:
        if classify(h,row[0])==row[1]:
            score=score+1
    return 100*score/len(data)

#testing how accurate model is by using it on a dataset            
def fit(theta,training=False,test=False):
    data=[]
    if training:
        for index,row in trainingData.iterrows():
            b,g,r,y=row['Blue'],row['Green'],row['Red'],row['Class']
            data.append([[b,g,r],int(y)])
    elif test:
        for index,row in testingData.iterrows():
            b,g,r,y=row['Blue'],row['Green'],row['Red'],row['Class']
            data.append([[b,g,r],int(y)])
    return accuracy(data,theta)
            
theta=mgd(2,10,0.9)
print(theta)

# perfomance=fit(theta,training=True)
# print("Training score",perfomance)
perfomance=fit(theta,test=True)
print("Testing score",perfomance,"%")

[ 5.56683359  4.65058547 -7.16665764]


### Naive bayes

In implementing the naive bayes classifier we assume that all of the features follow the normal distribution with respect for a certain class.This allows us to generate a probability value for an a new unseen point.Since the normal distribution was used probability values of zero will not occur therefore there is no need for smoothing.

In [None]:
def gaussian(mean,std):
    return lambda x:(1/np.sqrt(2*np.pi*std**2))*np.exp(-1*((x-mean)**2)/(2*std**2))

#returns the mean and stddev of feature with respect to a class
def summary(name,group):
    sample=trainingData[[name,'Class']]
    sample=sample[sample['Class']==group]
    mean=sample[[name]].mean()[0]
    std=sample[[name]].std()[0]
    
    return mean,std
#returns six normal distribution to be used in bayes rule
def pdfs():
    names=['Blue','Green','Red']
    functions=[]
    for j in range(2):
        for name in names:
            mean,std=summary(name,j)
            p=gaussian(mean,std)
            functions.append(p)
            
    return functions
#returns a probability density function that takes in (b,g,r) and return probabili
def naiveBayes():
    #class probabilities
    p=len(trainingData[trainingData['Class']==0])/len(trainingData)
    q=1-p
    #normal distributions
    f=pdfs()
    #probability density function which will use to classify
    pds=lambda b,g,r:(p*f[0](b)*f[1](g)*f[2](r))/(p*f[0](b)*f[1](g)*f[2](r)+q*f[3](b)*f[4](g)*f[5](r))
    
    return pds

def classify(h,x):
    b,g,r=x
    if h(b,g,r)<0.5:
        return 1
    return 0

def accuracy(data):
    score=0
    h=naiveBayes()
    
    for row in data:
        if classify(h,row[0])==row[1]:
            score=score+1
    return 100*score/len(data)

def fit(training=False,test=False):
    data=[]
    if training:
        for index,row in trainingData.iterrows():
            b,g,r,y=row['Blue'],row['Green'],row['Red'],row['Class']
            data.append([[b,g,r],int(y)])
    elif test:
        for index,row in testingData.iterrows():
            b,g,r,y=row['Blue'],row['Green'],row['Red'],row['Class']
            data.append([[b,g,r],int(y)])
    return accuracy(data)


perfomance=fit(test=True)
print("Testing score",perfomance,"%")