# Machine Learning 
## A### Mehdi Tadayoni     
###   
### ID

# Report
## 1. introduction
The main target is to implement a Naive Bayes algorithm from scratch that is able to predict the domain - one of Archaea, Bacteria, Eukaryota or Virus - from the abstract of research papers about proteins taken from the MEDLINE database.
this task is divided different part such as preprocessing, Naive Bayes class (includes different functions), import the data, splitting data to test and trainset, evaluation of test set data and export the new test_set data to submit to the kaggle. 

## 2. preprocessing
in this step str_arg and cleaned_str are the input and output of preprocessing step, respectively. the function will do the below task:
1. everything apart from letters in all words is excluded
2. multiple spaces are replaced by single space
3. input is converted to lower_case 

## 3. NaiveBayes Classifier
It is a generic code for the NaiveBayes Classifier and it will be defining a NaiveBayes class and includes relevant functions inside this class. However, it includes diverse function as below:
### 3-1 def addToBow(self,example,id_dic):
this function includes 'example' as an input and 'id_dic' which implies to which Bag of Words category this example belongs to
totally this function implement splits the input (example) on the basis of spaces as a tokenizer and adds every tokenized vocabulary to
its corresponding dictionary or Bag of Words
### 3-2 def train(self,dataset,labels):
This function will train will train the Naive Bayes Model. the important task is computing a Bag of Words for each class or labels. re and collections are python packages which are used for NB training. in this assignment, pandas only used once for constructing the cleaned_examples (pd.DataFrame()).
in this step we have to calculate ([ count(w|c)+1 ] / [ count(c) + |V| + 1 ] } * p(c)) as below:
1. prior_ probability of each label ( A,B,E,V) as p(c)
2. vocabulary |V|
3. denominator value for all classes - [ count(c) + |V| + 1 ] 
We can do all these steps at test time, but it can be time consuming so we will precompute all of them and apply them during test time to speed up predictions.
### 3-3 def getExampleProb(self,test_example):                                
Likelihood (probLike) and Posterior (probpost) probability are main outputs in this step
### 3-4 def test(self,test_set)
The main objective in this step is determining probability of each test data for all classes and predicts the correct class against which the label probability is the best
## 4-Splitting the data  
the data was splitted to train and test set data (80% and 20%).
## 5-Evaluation of test result Using the NB Model:
the accuracy for test set data is about 95%
##  6- Evaluation new data by kaggle:
the test set data (tst.csv) was evaluated and after uploading to kaggle the accuracy is about 0.96 which was submitted in 10/4/2021.

In [121]:
import numpy as np 
from collections import defaultdict
import pandas as pd
import re 

#### Lets first write a handy text preprocessing function which is not part of the NaiveBayes class

In [122]:
def preprocess_string(str_arg):
    cleaned_str=re.sub('[^a-z\s]+',' ',str_arg,flags=re.IGNORECASE) 
    cleaned_str=re.sub('(\s+)',' ',cleaned_str) 
    cleaned_str=cleaned_str.lower()  
    return cleaned_str 

In [123]:
class NaiveBayes:
    
    def __init__(self,label_cl):
        # making unique number of labels
        self.classes=label_cl 
        
    def addToBow(self,example,id_dic):
        if isinstance(example,np.ndarray): example=example[0]
        for token_vocab in example.split(): #for all vacabolary in preprocessed example
                      self.bow_dicts[id_dic][token_vocab]+=1  
                
#------training function which can train the Naive Bayes Model-------------------------------------------------                
    def train(self,dataset,labels):
        self.examples=dataset
        self.labels=labels
        self.bow_dicts=np.array([defaultdict(lambda:0) for index in range(self.classes.shape[0])])
        
        #converting to numpy arrays
        if not isinstance(self.examples,np.ndarray):
            self.examples=np.array(self.examples)
        if not isinstance(self.labels,np.ndarray):
            self.labels=np.array(self.labels)
            
        #constructing bag of words 
        for ct_index,ct in enumerate(self.classes):
            all_ct_examples=self.examples[self.labels==ct] 
            cleaned_examples=[preprocess_string(ct_example) for ct_example in all_ct_examples]
            cleaned_examples=pd.DataFrame(data=cleaned_examples)
            np.apply_along_axis(self.addToBow,1,cleaned_examples,ct_index)
            
 #---------------------------- Probabilities calculations ------------------------------------------------------------              
      
        labelPro=np.empty(self.classes.shape[0])
        vocabs=[]
        ct_vocab_counts=np.empty(self.classes.shape[0])
        for ct_index,ct in enumerate(self.classes):
           
            # prior probability etimation
            labelPro[ct_index]=np.sum(self.labels==ct)/float(self.labels.shape[0]) 
            
            # all counts of vocab estimation
            count=list(self.bow_dicts[ct_index].values())
            ct_vocab_counts[ct_index]=np.sum(np.array(list(self.bow_dicts[ct_index].values())))+1 
            
            #get all words of this category                                
            vocabs+=self.bow_dicts[ct_index].keys()
                                                     
        #combine all vocab
        
        self.vocab=np.unique(np.array(vocabs))
        self.vocab_length=self.vocab.shape[0]
                                                                    
        denoms=np.array([ct_vocab_counts[ct_index]+self.vocab_length+1 for ct_index,ct in enumerate(self.classes)])                                                                          

        #self.ct_info has a tuple of values such as dict at index 0, prior probability at index 1, denominator value at index 2
        self.ct_info=[(self.bow_dicts[ct_index],labelPro[ct_index],denoms[ct_index]) for ct_index,cat in enumerate(self.classes)]                               
        self.ct_info=np.array(self.ct_info)                                 
                                              
  #---------------------------------------------Estimates posterior probability --------------------------------------------------------                                            
    def getExampleProb(self,test_example):                                
        
        #Likelihood permeability estimation                         
        probLike=np.zeros(self.classes.shape[0])
        
        #finding probability for each label 
        for ct_index,ct in enumerate(self.classes): 
                             
            for token in test_example.split():                                               
                token_counts=self.ct_info[ct_index][0].get(token,0)+1
                
                #getting likelihood                              
                token_prob=token_counts/float(self.ct_info[ct_index][2])                              
                probLike[ct_index]+=np.log(token_prob)
                                              
        # likelihood value for every label 
        probpost=np.empty(self.classes.shape[0])
        for ct_index,ct in enumerate(self.classes):
            probpost[ct_index]=probLike[ct_index]+np.log(self.ct_info[ct_index][1])                                  
      
        return probpost
    
 #------------------------------------------- Estimation of probability for each labels
    def test(self,test_set):
           
        prediction=[] 
        for example in test_set:                           
            # same approach as training set                                   
            cleaned_example=preprocess_string(example) 
            # posterior probability                                  
            probpost=self.getExampleProb(cleaned_example) #get prob of this example for both classes
            #selecting max value 
            prediction.append(self.classes[np.argmax(probpost)])
        return np.array(prediction) 

## Splitting the data  

In [124]:
df = np.genfromtxt("trg.csv", dtype=None, encoding=None, delimiter=",",skip_header=1,usecols=[1,2])
X=df[:,1]
y=df[:,0]

In [125]:
sp=int(0.8*len(df))
train_data, test_data= X[:sp], X[sp:]
train_labels, test_labels= y[:sp], y[sp:]
print (" Number of Training data: ",len(train_data))
print (" Number of Training Labels: ",len(train_labels))

 Number of Training data:  3200
 Number of Training Labels:  3200


In [126]:
print ("Number of Test Examples: ",len(test_data))
print ("Number of Test Labels: ",len(test_labels))

Number of Test Examples:  800
Number of Test Labels:  800


## Training by Naive Bayes algorithm

In [127]:
nb=NaiveBayes(np.unique(train_labels)) 
print ("Training phase starts ")
nb.train(train_data,train_labels)
print ('Training phase finished')

Training phase starts 
Training phase finished


## Evaluation of test result Using the NB Model

In [128]:
nbclasses=nb.test(test_data) 
test_Eval=np.sum(nbclasses==test_labels)/float(test_labels.shape[0]) 

print ("Test Set data: ",test_labels.shape[0])
print ("Test Set Accuracy: ",test_Eval*100,"%")

Test Set data:  800
Test Set Accuracy:  95.0 %


In [129]:
# Loading the test set dataset tst.csv
df_test = np.array(np.genfromtxt("tst.csv", dtype=None, encoding=None, delimiter=",",skip_header=0 ))
Xtest=np.array(df_test[:,1])
#generating predictions....
classes=np.array(nb.test(Xtest))

In [None]:
# only using pandas package here for uploading csv data to Kaggle
import pandas as pd
test=pd.read_csv('tst.csv')
Xtest=test.iloc[:,1]
classes=nb.test(Xtest) 
kaggle_df=pd.DataFrame(data=np.column_stack([test["id"].values,classes]),columns=["id",'class'])
kaggle_df.to_csv("mtad786_v1.csv",index=False)
print ('Predcitions Generated and saved to mtad786_v1.csv')