# Intro To Machine Learning : Assignment 2

## Tasks
1. The data set that you need is in one of the sheets of the Excel file Assignment_2_Data_and_Template.xlsx (attached). The other sheets contain shaded cells meant to be filled in by you. Examine the sheets carefully and understand what must be computed or described. Except for cutting and pasting your results in the specific cells provided, do not alter the spreadsheet in any other way. This is the only recognized means of submitting this assignment.
2. Construct separate 2D histograms for male and female heights. You decide on the number of bins to use, making sure there is sufficient resolution and bin-filling. Represent height in rows, and handspan in columns -- both in ascending order of magnitude of feature. Do not use a built-in histogram program. 
3. Find the parameters of two 2D Gaussian models for the 2 PDFs to describe the data. Let the first dimension represent height, and the second dimension represent handspan. You may use built-in functions to compute these parameters, but do not use a built-in function to compute the pdf. 
4. Based on the histograms and Gaussian models, compute the likely gender (given as the probability of being female) of individuals with measurements as given below (Height in inches, handspan in centimeters). What are your observations?
5. Extra credit: Reconstruct a histogram using female model parameters that can be compared to the female histogram constructed in Part 2. Similarly, reconstruct a histogram using male model parameters.

## Import Excel File

In [78]:
# Get DataFrom Excel File
import pandas as pd
import numpy as np
file = 'Assignment_2_Data_and_Template.xlsx'
xl = pd.ExcelFile("../DataFiles/" + file)
df = xl.parse('Data')

sexArray = np.array(df.Sex)
heightArray = np.array(df.Height)
handspanArray = np.array(df.HandSpan)

heightArrayMax = heightArray.max()
heightArrayMin = heightArray.min()
handspanArrayMax = handspanArray.max()
handspanArrayMin = handspanArray.min()

print ("Min height: %s" %heightArrayMin)
print ("Max height: %s" %heightArrayMax)
print ("Min handspan: %s" %handspanArrayMin)
print ("Max handspan: %s" %handspanArrayMax)

print (heightArray[0:3])
print (handspanArray[0:3])
print (sexArray[0:3])

Min height: 57.0
Max height: 78.0
Min handspan: 16.0
Max handspan: 25.5
[ 68.  71.  73.]
[ 21.5  23.5  22.5]
['Female' 'Male' 'Male']


## Constructing 2D Histogram

In [79]:
# Rows = height, Cols = handspan, Z = male/female

# Determine the number of bins Sturges' Rule
numberOfMaleSamples = len([i for i in sexArray if i == 'Male'])
numberOfFemaleSamples = len([i for i in sexArray if i == 'Female'])
print("Number of Male Samples: %s" %numberOfMaleSamples)
print("Number of Female Samples: %s" %numberOfFemaleSamples)

numberOfBins = np.log2(78).astype('int32')
print("Number of Bins: %s" %numberOfBins)

Number of Male Samples: 78
Number of Female Samples: 89
Number of Bins: 6


In [80]:
# numberofbins = np.log2()

"""
Parameters:
"""
def CalculateBin (FeatureValue, FeatureMax, FeatureMin, NumberOfBins):
    binindex = (np.round(((NumberOfBins-1)*(FeatureValue-FeatureMin)/
                          (FeatureMax-FeatureMin)))).astype('int32')
    return binindex

def Build2DHistogramClassifier(Feat1, Feat2, ClassArray, NumBins1, 
                               NumBins2, Feat1Max, Feat1Min, 
                               Feat2Max, Feat2Min):
    HM = np.zeros(shape=(NumBins1, NumBins2)).astype('int32')
    HF = np.zeros(shape=(NumBins1, NumBins2)).astype('int32')
    for i, j, k in zip (Feat1, Feat2, ClassArray) :
        row = CalculateBin(i, Feat1Max, Feat1Min, NumBins1)
        col = CalculateBin(j, Feat2Max, Feat2Min, NumBins2)
        if(k == 'Female'):
            HF[row][col] += 1
        else:
            HM[row][col] += 1
            
    return [HF, HM]
    
    

In [84]:
histograms = Build2DHistogramClassifier(heightArray, handspanArray, sexArray,
                          numberOfBins, numberOfBins, 
                            heightArrayMax, heightArrayMin,
                          handspanArrayMax, handspanArrayMin)
histF = histograms[0]
histM = histograms[1]


In [83]:
writer = pd.ExcelWriter("../DataFiles/HistogramData.xlsx")

malehistogramDf = pd.DataFrame(data=histM[0:, 0:])
femalehistogramDf = pd.DataFrame(data=histF[0:, 0:])
# femalehistogramDf = pd.DataFrame.from_items(histF)

malehistogramDf.to_excel(writer, 'maleHistogram')
femalehistogramDf.to_excel(writer, 'femaleHistogram')
writer.save()

## Query Histograms

In [282]:
queryArray = [(69, 17.5),
             (66, 22),
             (70, 21.5),
             (69, 23.5)]
# queryArray = [(78, 16)]

probabilitiesList = []
for i in queryArray:
    y = CalculateBin(i[0], heightArrayMax, heightArrayMin, 6)
    z = CalculateBin(i[1], handspanArrayMax, handspanArrayMin, 6)

    femaleNumber = histF[y, z]
    maleNumber = histM[y, z]
    
    probabilitiesList.append(femaleNumber / 
                             (femaleNumber + maleNumber))
print (probabilitiesList)


[nan]




## Bayesian

In [266]:
maleList = []
femaleList = []
maleHeights = []
femaleHeights = []
maleHands = []
femaleHands = []
for i, j, k in zip(heightArray, handspanArray, sexArray) :
    if (k == 'Female'):
        femaleList.append([i, j, k])
        femaleHeights.append(i)
        femaleHands.append(j)
    else:
        maleList.append([i,j,k])
        maleHeights.append(i)
        maleHands.append(j)
print (len(maleList))
print(len(femaleList))
maleMeanVector = np.matrix([[np.mean(maleHeights) , np.mean(maleHands)]])
# maleMeanVector = np.matrix(np.mean(maleHeights), np.mean(maleHands))
femaleMeanVector = np.matrix([[np.mean(femaleHeights), np.mean(femaleHands)]])

print ("Male mean vector",maleMeanVector)
print ("Female mean vector",femaleMeanVector)

print("Male cov matrix:\n %s"%maleCovMatrix)
print("Female cov matrix:\n %s"%femaleCovMatrix)

78
89
Male mean vector [[ 71.28846154  22.30128205]]
Female mean vector [[ 65.25280899  19.6011236 ]]
Male cov matrix:
 [[ 7.08778721  1.80157343]
 [ 1.80157343  2.06064769]]
Female cov matrix:
 [[ 7.75780452  1.65170135]
 [ 1.65170135  1.75670327]]


In [272]:
def calculateBayesianProbability(Cov1, Cov2, Query, Mean1, Mean2,
                                 Num1, Num2):
    x = Num1 * calculatePDF(Cov1, Query, Mean1)
    
    y = Num2 * calculatePDF(Cov2, Query, Mean2)

    
    probability = x / (x + y)
    return probability

def calculatePDF(Cov, Query, Mean) :
    return ((1/(2 * np.pi * np.sqrt(np.linalg.det(Cov)))) * 
            np.exp(-(1/2) *(np.subtract(Query, Mean)) * np.linalg.inv(Cov) * 
                  np.subtract(Query, Mean).transpose()))


In [279]:
query = np.matrix([[69, 23.5]])

f= calculateBayesianProbability(femaleCovMatrix, maleCovMatrix, query,
                            femaleMeanVector, maleMeanVector,
                            numberOfFemaleSamples, numberOfMaleSamples)
# m= calculateBayesianProbability( maleCovMatrix, femaleCovMatrix, [71, 22],
#                             maleMeanVector,femaleMeanVector,
#                             numberOfMaleSamples, numberOfFemaleSamples )
# print(f)
print(f)
print("%s"%f)


[[ 0.0564518]]
[[ 0.0564518]]
