# Introduction To Machine Learning in Python : Assignment 1

## Tasks
1. The data set that you need is in one of the sheets of the Excel file Assignment_1_Data_and_Template.xlsx (attached). The other sheets contain shaded cells meant to be filled in by you. Examine the sheets carefully and understand what must be computed or described. Except for cutting and pasting your results in the specific cells provided, do not alter the spreadsheet in any other way. This is the only recognized means of submitting this assignment.
2. Construct separate histograms for male and female heights using 32 bins. Do not use a built in histogram program. Also remember that the histograms are the 2 lists of 32 counts that you enter into the correct place in the spreadsheet. They are not just pretty pictures!
3. Based on the histograms, compute the probability of being female of individuals with heights 55, 60, 65, 70, 75 and 80 inches.
4. Find the parameters of Gaussian models for the 2 PDFs to describe the data. You may use built in functions to compute the model parameters. (Means, Standard Deviations, Sample Sizes)
5. Use the Bayesian Formula with the model parameters found above to re-compute the probability of being female of individuals with heights 55, 60, 65, 70, 75 and 80 inches. Do not use a built in function for computing the pdf.
6. Repeat steps 2 through 5 using just the first 50 height entries in the data file. Now what are your observations regarding histogram classifiers and Bayesian classifiers?

## Histogram Classifiers

### Constructing Histograms Classifier

In [4]:
"""
Function for constructing a histogram by building an array 
where indices represents bins with counts.
Returns: Histograms for Male and Female

Parameters:
    X = feature array
    T = class label array
    B = number of bins
    xmin = feature range minimum
    xmax = feature range maximum
"""
import numpy as np

def Build1DHistogramClassifier(X,T,B,xmin,xmax):
    HF = np.zeros(B).astype('int32')
    HM = np.zeros(B).astype('int32')
    binindices = (np.round(((B-1)*(X-xmin)/(xmax-xmin)))).astype('int32')
    for i,b in enumerate(binindices):
        if T[i]=='Female':
            HF[b]+=1
        else:
            HM[b]+=1
    return [HF, HM]

In [5]:
# Example Constructing Histogram
hi = np.zeros(32).astype('int32')

classLabelArray = np.array(['Female','Male','Female','Female',
                            'Male','Female','Female','Male'])
featureArray = np.array([67, 50, 40, 65, 55, 40, 40, 55])

out = Build1DHistogramClassifier(featureArray, classLabelArray, 32, 
                                 40, 67)
print(out)

[array([3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 1], dtype=int32), array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)]


### Building Histogram Classifiers from the Dataset

In [269]:
# Get DataFrom Excel File
import pandas as pd
file = 'Assignment_1_Data_and_Template.xlsx'
xl = 0
try:
    xl = pd.ExcelFile("./DataFiles/" + file)
except FileNotFoundError as error:
    xl = pd.ExcelFile(
        '../DataFiles/'  + file)

df = xl.parse('Data')

genderArray = np.array(df.Gender.values.tolist())

heightArray = np.array((df.Height_Feet * 12) + df.Height_Inches)

heightArrayMax = heightArray.max()

heightArrayMin = heightArray.min()


In [270]:
# Use only first 50 values in height array
heightArray = heightArray[0:50]

In [271]:
# Build Histogram Classifier for the Dataset
histograms = Build1DHistogramClassifier(X = heightArray,T = genderArray
                           ,B = 32,xmin = heightArrayMin,
                           xmax=heightArrayMax)
# histograms = Build1DHistogramClassifier(X = heightArray[0:50], T = genderArray[0:50], B = 32,
#                                        xmin = heightArray.min(), xmax = heightArray.max())
print("Feature Array Minimum: %s"%heightArrayMin)
print("Feature Array Maximum: %s"%heightArrayMax)
print("Histograms... \n Female: %s \n Male: %s"%(histograms[0], histograms[1]))


Feature Array Minimum: 52
Feature Array Maximum: 83
Histograms... 
 Female: [   3    5   12   24   44  101  163  260  404  549  693  869 1076 1013  951
  823  695  494  299  217  110   58   20   12    5    0    0    0    0    0
    0    0] 
 Male: [  0   0   0   0   0   0   0   0   1  10  14  53 117 241 369 500 700 787
 849 882 873 779 610 432 274 155  83  38  24   5   3   1]


### Histogram Classifier Query

In [272]:
# Query Inputs
queriesArrayRaw = [55, 60, 65, 70, 75, 80]

In [273]:
# Calculate the bin number based based on the feature and the range
def computeQueryBinIndex(bins, query, rangeMin, rangeMax):
    return np.round(bins * (query - rangeMin) / (rangeMax - rangeMin)).astype('int32')


In [274]:
# Build query array adjusted from feature query value to bin indices
queriesArrayAdjusted = []

for i in queriesArrayRaw :
    queriesArrayAdjusted.append(computeQueryBinIndex(32 - 1 ,i, heightArray.min(),
                                                     heightArray.max())) 

print ("Adjusted queries array: %s"%queriesArrayAdjusted)

Adjusted queries array: [3, 8, 13, 18, 23, 28]


In [275]:
# Query the Histogram Classifiers
queryResultsMale = []
queryResultsFemale = []

for i in queriesArrayAdjusted :
    queryResultsFemale.append(histograms[0][i])
    queryResultsMale.append(histograms[1][i])
print ("Query Results Female: %s" % queryResultsFemale)
print ("Query Results Male: %s" % queryResultsMale)

Query Results Female: [24, 404, 1013, 299, 12, 0]
Query Results Male: [0, 1, 241, 849, 432, 24]


In [276]:
probabilityFemale = []

# Calculate probabilities
for i, j in zip(queryResultsFemale, queryResultsMale) :
    probabilityFemale.append(i / (i + j))
print("Probability Female... \n Given Queries %s\n Probabilities: %s"
      %(queriesArrayRaw, probabilityFemale))

Probability Female... 
 Given Queries [55, 60, 65, 70, 75, 80]
 Probabilities: [1.0, 0.9975308641975309, 0.80781499202551832, 0.26045296167247389, 0.027027027027027029, 0.0]


## Bayesian Classifier

### Calculate male height mean and female height mean, training set size, and standard deviation

In [279]:
# Separate male and female height arrays
maleHeightArray = []
femaleHeightArray = []

for i, j in zip(heightArray, genderArray):
    if j == 'Male' :
        maleHeightArray.append(i)
    else :
        femaleHeightArray.append(i)
        
meanMaleHeight = np.average(maleHeightArray)
meanFemaleHeight = np.average(femaleHeightArray)
print("Male height average: %f"%meanMaleHeight)
print("Female height average: %f"%meanFemaleHeight)


Male height average: 70.768077
Female height average: 64.725730


In [280]:
# Find the number of Males and Females in the training set
maleTrainingSetSize = len(maleHeightArray)
femaleTrainingSetSize = len(femaleHeightArray)
print("Number of Males: %i" %maleTrainingSetSize)
print("Number of Females: %i" %femaleTrainingSetSize)

Number of Males: 7800
Number of Females: 8900


In [281]:
# Find the standard deviation in the Male and Female training sets
maleHeightStandardDeviation = np.std(maleHeightArray)
femaleHeightStandardDeviation = np.std(femaleHeightArray)
print("Standard deviation of height in MALE training set:\n %f" %maleHeightStandardDeviation)
print("Standard deviation of height in FEMALE training set:\n %f" 
      %femaleHeightStandardDeviation)


Standard deviation of height in MALE training set:
 3.309455
Standard deviation of height in FEMALE training set:
 3.478239


### Calculate the Probabilities at Queries Using the Bayesian Classifier

In [282]:
# Calculates the pdf 
def gaussianDistributionProbabilityDensity (query, mean, stddev):
    return (1/(np.sqrt(np.pi * 2) * stddev)) * np.exp((-1/2) * np.square((query - mean)/stddev))

In [283]:
# Calculate the probabilities of female at a given height
probabilitiesFemale = []
for i in queriesArrayRaw :
    probabilitiesFemale.append(
        femaleTrainingSetSize * 
        gaussianDistributionProbabilityDensity(i, meanFemaleHeight, 
                                               femaleHeightStandardDeviation)
    / ((femaleTrainingSetSize * 
        gaussianDistributionProbabilityDensity(i, meanFemaleHeight, 
                                               femaleHeightStandardDeviation))
      + (maleTrainingSetSize * 
         gaussianDistributionProbabilityDensity(i, meanMaleHeight, 
                                                maleHeightStandardDeviation))))
print("Probability given height that individual is female: %s"%probabilitiesFemale)

Probability given height that individual is female: [0.99946000008647262, 0.98848581542193614, 0.83173037860746613, 0.26104194258468161, 0.030386242179453057, 0.0034390925216817927]


## Comparing Bayesian Classifier and Histogram Classifier
- use only first 50 entries in training set
- reuse original max and min values
- use same number of bins

### Histogram Classifier Results

P(female|heights) = [nan, 1.0, 1.0, 0.25, nan, nan]
 nan = 0/0

### Bayesian Classifier Results

P(female|heights) = [0.99994194980202489, 0.99655330803656506, 0.89179013301596843, 0.28515273971975497, 0.031746270866381675, 0.004555443018553729]

Observations: Histogram classifiers become highly inaccurate without the proper amount of data. Bayesian classifiers are more usable with smaller sample sizes, as it predicts the probabilities based on a smooth mathematical model as apposed to directly from data which may be incomplete. We can see with Histogram classifiers that another issue is that there are issues with undefined probabilities due to there being no data being at the queries.