For this program, you will be coding some of the ideas around Bayes classification.

**For most of this coding assignment, you may not use packages or "canned" code in your program other than simple function calls to needed math functions or file I/O. I.e., do not use a Bayes classifier that you did not write. However, for making histograms, plotting, or visualization, you may use other software. You may want to make this part of your package use.**

The input file is a csv file named *BayesAssign1_??.dat* and is located in the Data folder on Google. Note: ?? is a number. Look in the folder for possibly more than 1 data file. They were built with different parameters.

Each  line is an observation that has the class indicator, "NEG" or "POS", followed by numeric values for one feature. NEG indicates the absence of a disease or other characteristic. POS indicates the presence.

The values for each class come from Gaussian distributions. 

Split the data into train and test sets. Determine the statistics for each distribution (mean and variance), NEG and POS, from the training data. Using this information, classify the data in the test set.

Print to the console the following values:
* estimated mean and standard deviation for each class.
* estimated prior probability of each class
* percentage of the data used for training
* prevalence
* accuracy
* sensitivity
* specificity
* precision (positive predicitive value)
* a confusion matrix (does not have to be in matrix form - just label the numbers)

You may want to output your classified test data. If so, perhaps on each line would be: actual class, predicted class, value.

You could read this data into other programs for analysis, ...

**Other ideas**

* Make plots, on the same graph, of two Gaussians using the mean and variance you calculated.
* Plot histograms of the original data.
* How do your estimated parameters and performance vary with training fraction?
* Discuss how this might work if one or both of the distributions were uniform in the interval [a, b] instead of both Gaussian.
* Discuss how this might work for more than one class.
* Discuss how this might work for multi-dimensional (more than one feature) observations. E.g., 2-D Gaussians.






In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
'''
Functions and imports for Bayes Assignment #1
'''
import math
import random
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import colors
from matplotlib.ticker import PercentFormatter

#Return the Gaussian probability density function for x given mean = m and sigma = s



def gauss_val(x, m, s):
  val = math.exp(-math.pow((x-m)/s,2)/2.0)/(s*math.sqrt(math.pi*2))
  return(val)

#Generate Gaussian values for each class
#  Add uniform noise
def gen_gauss_list(stats, cl_names, num, noise_fact):
  g_vals = []
  for i in range(num):
    c = random.random()
    #Is the sample from class 0 or from class 1?
    if (c < stats[2]):
      new_val = random.gauss(stats[0], stats[1])
      noise_val = random.uniform(-noise_fact*stats[1], noise_fact*stats[1])
      new_val += noise_val
      g_vals.append([cl_names[0], new_val])
    else:
      new_val = random.gauss(stats[3], stats[4])
      noise_val = random.uniform(-noise_fact*stats[4], noise_fact*stats[4])
      new_val += noise_val      
      g_vals.append([cl_names[1], new_val])
  return(g_vals)

In [None]:
'''
Generate data for Bayes Assignment #1. Assume 2 Gaussians for now

Students do not need this code, and it is only here in case of interest
  of if they want to make their own data to check their code.
  If so, change the output file name below.
Note: parameters given below are not those used for data provided.

'''

if __name__ == "__main__":

  #Class conditional parameters for two Gaussians.
  MEAN_0 = 0.0
  SIG_0  = 1.0
  PROB_0 = 0.5
  MEAN_1 = 1.0
  SIG_1  = 2.0
  PROB_1 = 1 - PROB_0

  STATS = [MEAN_0, SIG_0, PROB_0, MEAN_1, SIG_1, PROB_1]
  CLASS_NAMES = ["NEG", "POS"]

  SEED = 977894657
  random.seed(SEED)

  #Noise can be added as a fraction of the sigma for each class.
  NOISE_FACT = 0.0

  NUM_VALUES = 10000

  #Print parameters
  print("Actual parameters: {:.4f} {:.4f} {:.4f} {:.4f} {:.4f} {:.4f}".format(MEAN_0, SIG_0, PROB_0, \
                                                                              MEAN_1, SIG_1, PROB_1))
  print("Noise factor: {:.4f}".format(NOISE_FACT))
  print()

  val_list = gen_gauss_list(STATS, CLASS_NAMES, NUM_VALUES, NOISE_FACT)

  #Output file name. Students should use their folder and name.
  out_file_name = 'MY OUTPUT FILE NAME'

  #Write output csv file
  with open(out_file_name, 'w') as out_file_ptr:
    for i in range(len(val_list)):
      pass
      out_str = "{:s}, {:7.4f}\n".format(val_list[i][0], val_list[i][1])
      out_file_ptr.write(out_str)


  print("END")


Actual parameters: 0.0000 1.0000 0.5000 1.0000 2.0000 0.5000
Noise factor: 0.0000

END


**Sample output from running  the data generation code.**

Actual parameters: 0.0000 1.0000 0.5000 1.0000 2.0000 0.5000

Noise factor: 0.0000

END


In [None]:
def perf(input_set):
  perf = {}
  true_pos = 0
  true_neg = 0
  false_pos = 0
  false_neg = 0
  for lists in input_set:
    if lists[0] =='POS' and lists[4] == 'POS':
      true_pos += 1
    elif lists[0] == 'NEG' and lists[4] == 'NEG':
      true_neg += 1
    elif lists[0] == 'NEG'and lists[4] == 'POS':
      false_pos +=1 
    else:
      false_neg +=1
  perf['true_pos'] = true_pos
  perf['true_neg'] = true_neg
  perf['false_pos'] = false_pos
  perf['false_neg'] = false_neg

  perf['Prevalence'] = (true_pos + false_neg) / len(input_set)
  perf['Accuracy'] = (true_pos + true_neg) / len(input_set)
  perf['Sensitivity'] = true_pos /(true_pos + false_neg)
  perf['Specificity'] = true_neg/(true_neg + false_pos)
  perf['Precision'] = true_pos/(true_pos + false_pos)
  perf['PPV'] = true_pos/(true_pos + false_pos)
  return perf





  

In [None]:
'''
Bayes Classification - input data, split train and test sets,
  train by estimating the parameters for the two class Gaussians
  (mean, sigma, and class probability),then classify the test set.
'''

import csv

if __name__ == "__main__":
  input_file = '/content/drive/Shareddrives/CSC373_DMP_Wu_Chenyang/DMP_Classification/Data/Copy of BayesAssign1_03.csv'
  with open(input_file) as csv_input:
    reader1 = csv.reader(csv_input)
    raw_list = []
    for row in reader1:
        row[1] = float(row[1])
        raw_list.append(row)
    #print(raw_list)

    

    ratio = 0.3
    length = len(raw_list)
    
    index = int(ratio * length)
    
    test_set = raw_list[:index]
    train_set = raw_list[index:]
    #print(train_set)
    #print(test_set)

    noise_fact = 0.2 
    n_class = 2
    
    col0 = []
    col1 = []
    label = []
    neg_count = 0
    pos_count = 0
    #calculate the mean and variance 
    for lists in train_set:
      values = float(lists[1])
      lbls = lists[0]
      if lbls == 'NEG':
        col0.append(values)
        neg_count +=1
      else:
        col1.append(values)
        pos_count +=1
      
      label.append(lbls)
    


    mean0 = sum(col0) / len(col0)
    std0 = (sum([((x - mean0) ** 2) for x in col0]) / len(col0))**0.5

    mean1 = sum(col1) / len(col1)
    std1 = (sum([((x - mean1) ** 2) for x in col1]) / len(col1))**0.5

    
    #print(col1)


    # test set prediction


    prob = (pos_count + neg_count)/len(train_set)/2


    true_count = 0

    for lists in test_set:
      lists.append(gauss_val(lists[1],mean0,std0))
      lists.append(gauss_val(lists[1],mean1,std1))
      if lists[2] > lists[3]:
        prediction = "NEG"
        lists.append(prediction)
      else:
        prediction = "POS"
        lists.append(prediction)

      if lists[0] == lists[4]:
        true_count += 1 
    

    perf_stat = perf(test_set)
    

    
    

    print("Trainning fraction:", ratio,len(train_set), len(test_set))
    print("Estimated parameters (mean0, sig0, prob0, mean1, sig1, prob1): ",
          mean0,std0,prob,mean1,std1,prob)
    print("True POS:",perf_stat['true_pos'] )
    print("True NEG:",perf_stat['true_neg'])
    print("False POS:", perf_stat['false_pos'])
    print("False NEG:", perf_stat['false_neg'])
    print("Prevalence:",perf_stat['Prevalence'])
    print("Accuracy:",perf_stat['Accuracy'])
    print("Sensitivity:",perf_stat['Sensitivity'])
    print("Specificity:",perf_stat['Specificity'])
    print("Precision:",perf_stat['Precision'])
    print("PPV:",perf_stat['PPV'])


      
  

 
  #  stats = [mean0, std0, prob, mean1, std1]
  #  cl_names = ['NEG', 'POS']
  #  g_vals = gen_gauss_list(stats, cl_names, len(raw_list),noise_fact)
  #  print(g_vals)

    #for lists in 
      
    

    #print(raw_list)

      
      
            
      


    
  
  


Trainning fraction: 0.3 700 300
Estimated parameters (mean0, sig0, prob0, mean1, sig1, prob1):  0.0016445054945054991 1.017215617492088 0.5 3.0699084415584412 0.9878022977672176 0.5
True POS: 68
True NEG: 210
False POS: 18
False NEG: 4
Prevalence: 0.24
Accuracy: 0.9266666666666666
Sensitivity: 0.9444444444444444
Specificity: 0.9210526315789473
Precision: 0.7906976744186046
PPV: 0.7906976744186046


**Sample output from running the classification code.**

Training fraction: 0.7 10000 7000 3000

Estimated parameters (mean0, sig0, prob0, mean1, sig1, prob1): 0.0046 1.0103 0.4917 1.0268 2.0142 0.4917

True POS:  785

False POS: 217

True NEG:  1294

False NEG: 704


Prevalence:    0.496

Accuracy:      0.693

Sensitivity:   0.527

Specificity:   0.856

Precision:     0.783

PPV:           0.783


In [None]:
'''
Histogram plotting - or other analysis students may like to do.

I used information from the links below to code histogram plots.

https://matplotlib.org/stable/gallery/statistics/hist.html
https://stackoverflow.com/questions/33203645/how-to-plot-a-histogram-using-matplotlib-in-python-with-a-list-of-data


'''





