# Assignment 2: Income Prediction based on U.S. Census data

Student Name:

Student ID:

# General Info

<b>Due date</b>: Friday, 16 Apr 2021 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -10% per day (both week and weekend days counted)

<b>Marks</b>: 15% of mark for class 

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/105477/pages/python-and-jupyter-notebooks?module_item_id=2613813) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and some handy python packages including Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. You can use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it. While the main focus is on correctness of your methods, you will lose marks if your code is not understandable.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


## Overview

In this assignment, you will be working with the famous *Adult* a data set containing demographic and income data from the united states in 1994. The dataset provided for this assignment is derived from <a href="http://archive.ics.uci.edu/ml/datasets/Adult">this</a> resource. The data set consists of about 48,000 individuals each characterized through a set of 14 attributes. Your task is to predict whether the individual earns up to \\$ 50,000 a year (<=50K) or more than \\$50,000 per year (>50K).

The attributes are

|ID|Feature Name| Feature Type | Feature Values|
| :-| :-| :-| :-|
|0|age| continuous| |
|1|workclass| categorical | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, ?|
|2|fnlwgt| continuous| |
|3|education|  categorical |Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|4|education-num| continuous| |
|5|marital-status|  categorical |Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|6|occupation|  categorical |Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces, ?|
|7|relationship|  categorical |Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried|
|8|race|  categorical |White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|9|sex|  categorical |Female, Male|
|10|capital-gain| continuous| |
|11|capital-loss| continuous| |
|12|hours-per-week| continuous| |
|13|native-country|  categorical |United-States, Cambodia, England, Puerto-Rico, Canada, Germany, India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, ?|

You can find out more about the individual attributes / values, and the origin of the data set in the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"> data set description</a>  


You will build a number of classifiers to predict the income class based on the attributes above.


#### The following instructions hold for every question in the assignment
- leave the order of instances intact, i.e., do not shuffle the data
- do not change the names or types of variables provided by us in the code cells below
- '?' denotes an unknown value, and you should treat it as just another value for its feature in all tasks below.

## Question 1: Loading and pre-processing the data (1 mark)

You were provide two data files:

**adult.train** contains about 32,000 training instances, one instance per line in comma-separated value (csv) format. Each line contains 14 fields. The first 13 fields correspond to the features listed above, the final field denotes the class label

**adult.test** is formatted exactly like adult.train, and contains about 16,000 further instances for evaluation.


### 1a Read the data [0.5 marks]

First, you will read in the data and create traing features, training labels, test features and test labels. Do not apply any data transformations.

In [659]:
# load data
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn import datasets
from collections import Counter
import matplotlib.pyplot as plt
import copy
import math
from matplotlib import pyplot
import scipy as sp
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import  CountVectorizer
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
x_train = []
y_train = []
x_test = []
y_test = []
with open("adult.train", mode='r') as tr:
    for line in tr:
        atts = line.strip().split(",")
        x_train.append(atts[:-1]) #all atts, excluding the class
        y_train.append(atts[-1])
with open("adult.test", mode='r') as te:
    for line in te:
        atts = line.strip().split(",")
        x_test.append(atts[:-1]) #all atts, excluding the class(label)
        y_test.append(atts[-1])
len(x_train), len(y_train), len(x_test), len(y_test), len(x_train[0]), len(x_test[0])

(32525, 32525, 16262, 16262, 14, 14)

<b>For your testing:</b>

In [660]:
assert len(x_train)==len(y_train)==32525
assert len(x_train[0])==len(x_test[0])==14

### 1b: Attribute Types [0.5 marks]

You will create three feature representations, based on the different attribute types (categorical, numeric) in the original *Adult* data.

**Your tasks**

Denote with $I$ the number of training instances; $N$ the number of numeric features in the data set; and $v_f$ the number of possible values for feature $f$.

1. Create a train data set with only numeric features `x_train_num` (size $(I\times N)$); and equivaluently a test data set `x_test_num`
2. Create a train data set with only categorical features *in a 1-hot representation* `x_train_1hot` (size $(I\times\sum_f v_f)$); and equivalently a test data set `x_test_1hot`
3. Create a train data set with both numeric and 1-hot categorical features `x_train_full` (size $I \times N+\sum_f v_f$) where the first $N$ columns represent the numerical features and the remaining columns the categorical features; and equivalently a test data set `x_test_full`

**Note:** You may use classes and functions from ```scikit-learn```.



In [661]:
from sklearn.preprocessing import OneHotEncoder
# check whether is number(int float)
def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        pass
    try:
        import unicodedata
        unicodedata.numeric(s)
        return True
    except (TypeError, ValueError):
        pass
    return False
# change str list to int list
def transfer_to_floatversion(numset):
    returnset=[]
    for i in range(len(numset)):
        returnset.append([])
        for j in range(len(numset[i])):
            returnset[i].append(float(numset[i][j]))
    return returnset
#keep only numbers
def transfer_to_numeric(dataset):
    returnset =copy.deepcopy(dataset)
    for i in range(len(dataset)):
        k=0
        for j in range(len(dataset[i])):
            if is_number(dataset[i][j]):
                continue
            else:
                del returnset[i][j-k]
                k=k+1
    return returnset
#change str to onehot
def transfer_to_onehotall(dataset):
    returnset =copy.deepcopy(dataset)
    for i in range(len(dataset)):
        k=0
        for j in range(len(dataset[i])):
            if is_number(dataset[i][j]):
                del returnset[i][j-k]
                k=k+1
            else:
                continue
    ohe = OneHotEncoder()
    ohe.fit(returnset)
    X_trans = ohe.transform(returnset).toarray()
    return X_trans
def transfer_to_catonehot(dataset):
    returnset =copy.deepcopy(dataset)
    for i in range(len(dataset)):
        k=0
        for j in range(len(dataset[i])):
            if is_number(dataset[i][j]):
                del returnset[i][j-k]
                k=k+1
            else:
                continue
    ohe = OneHotEncoder()
    ohe.fit(returnset)
    X_trans = ohe.transform(returnset).toarray()
    return X_trans
#keep only str
def transfer_to_cat(dataset):
    returnset =copy.deepcopy(dataset)
    for i in range(len(dataset)):
        k=0
        for j in range(len(dataset[i])):
            if is_number(dataset[i][j]):
                del returnset[i][j-k]
                k=k+1
            else:
                continue
    return returnset
#get cat name or count
def get_cat_name(dataset):
    testtable=np.array(dataset)
    templist=[]
    featurelist=[]
    for i in range(len(dataset[0])):
        templist.append(sorted(set(testtable[:,i])))
    for i in range(len(templist)):
        for j in range(len(templist[i])):
            featurelist.append(templist[i][j])
    return featurelist
#combine number and str seperately
def combine(numset,strset):
    returnset =copy.deepcopy(numset)
    for i in range(len(numset)):
        for j in range(len(strset[0])):
            returnset[i].append(strset[i][j])
    return returnset

In [662]:
x_train_num=transfer_to_numeric(x_train)
x_test_num=transfer_to_numeric(x_test)
x_train_1hot=transfer_to_catonehot(x_train)
x_test_1hot=transfer_to_catonehot(x_test)
x_train_full = combine(x_train_num,x_train_1hot)
x_test_full = combine(x_test_num,x_test_1hot)
x_train_cat=transfer_to_cat(x_train)

<b> For your testing</b>

In [663]:
assert len(x_train_1hot[0])==98
assert len(x_train_full[0])==104

## Question 2: A 0-R baseline [0.5 marks]

Implement a zero-r baseline, as introduced in the Evaluation lecture.

In [664]:
zero_r_prediction=[]
from collections import Counter
result = Counter(y_test)
zero_r_order=sorted(result.items(),key=lambda x:x[1],reverse=False)
for i in range(len(y_test)):
    zero_r_prediction.append(zero_r_order[1][0])
print("The most frequent occurrence in test dataset is  %s"%(result.most_common(1)))
print(f'The prediction is :{zero_r_order[1][0]}')

The most frequent occurrence in test dataset is  [(' <=50K', 12419)]
The prediction is : <=50K


## Question 3: Feature selection [2 marks]

In this question you will implement pointwise mutual information (PMI) for feature selection (Question 3a)

In question 3a. you will use the implemented function to create a 1-R classifier based on the single 1-hot attribute  (i.e., categorical feature value) with highest PMI for class ">50K". 
```
argmax_a pmi(a=1,c=">50K")
```
In question 3b. you will apply your 1-R classifier to the test instances, and store your predicted labels in `one_r_predictions`.

<b> You should implement PMI from scratch yourself. You may use native Python libraries like math or numpy to help you, but you may not use existing implementations of PMI.</b>



### 3a. Implement PMI [1 mark]

Implement a function to compute PMI.


<b>For your testing:</b>

In [665]:
def pmi(features,labels,tgt_value,tgt_class):
    labelresult=Counter(labels)
    pmis={}
    for i in range(len(features[0])):
        featureresult=Counter(np.array(features)[:,i])
        count=0
        for j in range(len(features)):  
            if features[j][i]==tgt_value and labels[j]==tgt_class:
               count=count+1
        if count==0:
            pmis[i]=0
        else:
            Punity=count/len(features)
            Pf=featureresult[tgt_value]/len(features)
            Pl=labelresult[tgt_class]/len(labels)
            pmis[i]=np.log2(Punity/(Pf*Pl))
    return pmis

In [666]:
test_features = [[1,1], [1,0], [0,1], [0,0]]
test_labels = [1,1,0,0]
test_pmi = pmi(test_features, test_labels, tgt_value=1,tgt_class=1)
assert test_pmi[0]==1.0
assert test_pmi[1]==0.0

### 3b Create 1-R baseline [1 mark]

- Apply your PMI function to the *1_hot feature representation* of training data, and determine the (i) 1-hot feature with *highest* PMI with class '>50K' () and (ii) 1-hot feature with *lowest* (most negative) PMI with class '>50K'. Store the name (string) of the corresponding highest/lowest PMI features in `highest_pmi_feature_name` and `lowest_pmi_feature_name`, respectively.

- The feature with *highest* PMI will consitute your 1-R predictor, which you should use to predict the class labels for the test set (`one_r_predictions`)

In [667]:
one_r_predictions = []
pmi(x_train_1hot,y_train,tgt_value=1,tgt_class=' >50K')
d_order=sorted(pmi(x_train_1hot,y_train,tgt_value=1,tgt_class=' >50K').items(),key=lambda x:x[1],reverse=False)
highest_pmi_feature_name=get_cat_name(x_train_cat)[d_order[-1][0]]
lowest_pmi_feature_name=get_cat_name(x_train_cat)[d_order[0][0]]
print(f"The feature with highest PMI for the class '>50K' is: {highest_pmi_feature_name}")
print(f"The feature with lowest PMI for the class '>50K' is: {lowest_pmi_feature_name}")
for i in range(len(y_test)):
    if x_test_1hot[i][d_order[-1][0]]== 1:
        one_r_predictions.append(' >50K')
    else:
        one_r_predictions.append(' <=50K')
print(one_r_predictions[:10])

The feature with highest PMI for the class '>50K' is:  Doctorate
The feature with lowest PMI for the class '>50K' is:  Priv-house-serv
[' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K', ' <=50K']


## Question 4: Naive Bayes [3 marks]

We will construct three Naive Bayes classifiers

1. One for instances with only 1-hot encoded (binary) categorical attributes.
2. One for instances with only numerical attributes.
3. One for instances with the full set of numerical *and* categorical attributes, ensuring that your classifier computes posterior class probabilities $p(y|x)$ as $p(x|y)p(y)$.

For each classifier, you will 
1. Train it on the training set
2. Use the models to predict labels for the test set


### 4a Implementing the Naive Bayes classifiers [1.5 marks]

Implement three functions which train a NB classifier given the specified input feature types and predict labels for a given test set.

You may add additional quantities to your `return` statements.

<b> You may (and are, indeed, encouraged) to, use the existing NB implementations from `sklearn`. You should use the default parameterizations of these algorithms.</b> 
    
If you choose to implement your classifiers from scratch, please use Laplace smoothing (alpha=1) for the categorical feature NB.


In [668]:
wholedata=[]
wholelabel=[]        
with open("adult.train", mode='r') as tr:
    for line in tr:
        atts = line.strip().split(",")
        wholedata.append(atts[:-1]) #all atts, excluding the class
        wholelabel.append(atts[-1])
with open("adult.test", mode='r') as te:
    for line in te:
        atts = line.strip().split(",")
        wholedata.append(atts[:-1]) #all atts, excluding the class(label)
        wholelabel.append(atts[-1])

In [669]:
from sklearn import datasets 
from sklearn.naive_bayes import MultinomialNB
##Change all features to labelcode!!!!!!!
def transfer_to_labelcoder(dataset):  
    wo = LabelEncoder()
    temp=[]
    labelcode=[]
    #transfer train feature LabelEncoder
    for j in range(len(dataset[0])):
            wo.fit(np.array(dataset)[:,j])
            temp.append(wo.transform(np.array(dataset)[:,j]))
    labelcode = [ [row[i] for row in temp] for i in range(len(temp[0])) ]          
    return(labelcode)
def nb_binary_features(train_features, train_labels, test_features):
    predictions = []
    clf = MultinomialNB()
    clf.fit(train_features, train_labels)
    predictions = clf.predict(test_features)
    clf = MultinomialNB()
    return predictions
def nb_continuous_features(train_features, train_labels, test_features):
    train=[]
    test=[]
    predictions = []
    train_features=np.array(train_features)
    test_features=np.array(test_features)
    #convert to int 
    for i in range(len(train_features)):
        train.append([])
        for j in range(len(train_features[i])):
            train[i].append(int(train_features[i][j]))
    for i in range(len(test_features)):
        test.append([])
        for j in range(len(test_features[i])):
            test[i].append(int(test_features[i][j]))   
    clf = MultinomialNB()
    clf.fit(train, train_labels)
    predictions = clf.predict(test)
    return predictions
def nb_full(train_features, train_labels, test_features): 
    ohe = OneHotEncoder()
    ohe.fit(train_features)
    X_trans = ohe.transform(train_features).toarray()
    X_trans1 = ohe.transform(test_features).toarray()
    predictions = []
    clf = MultinomialNB()
    clf.fit(X_trans, train_labels)
    predictions = clf.predict(X_trans1)
    return predictions

In [670]:
print(nb_binary_features(x_train_1hot,y_train,x_test_1hot))
print(nb_continuous_features(x_train_num,y_train,x_test_num))
print(nb_full(wholedata,wholelabel,x_test))



[' <=50K' ' <=50K' ' >50K' ... ' >50K' ' <=50K' ' >50K']
[' <=50K' ' <=50K' ' <=50K' ... ' <=50K' ' >50K' ' <=50K']
[' <=50K' ' <=50K' ' >50K' ... ' >50K' ' <=50K' ' >50K']


### 4b Apply your classifiers to the data sets you created in Questions 1 and 3b [0.5 marks]

...namely, the

1. 1-hot categorical features `x_{train,test}_1hot`
2. numerical features `x_{train,test}_num`
3. combined numerical and 1-hot categorical features `x_{train,test}_full`


In [671]:
from collections import Counter

categorical_nb_predictions = nb_binary_features(x_train_1hot,y_train,x_test_1hot)
numeric_nb_predictions = nb_continuous_features(x_train_num,y_train,x_test_num)
full_nb_predictions=nb_full(wholedata,wholelabel,x_test)

print(f"Numerical NB predicted class distribution \t {Counter(numeric_nb_predictions)}")
print(f"Categorical NB predicted class distribution\t {Counter(categorical_nb_predictions)}")
print(f"Full NB predicted class distribution\t {Counter(full_nb_predictions)}")
print(Counter(y_test))

Numerical NB predicted class distribution 	 Counter({' <=50K': 15002, ' >50K': 1260})
Categorical NB predicted class distribution	 Counter({' <=50K': 11212, ' >50K': 5050})
Full NB predicted class distribution	 Counter({' <=50K': 12081, ' >50K': 4181})
Counter({' <=50K': 12419, ' >50K': 3843})



### 4c Explain your Naive Bayes classifier implementation on the full attribute set (numerical and binary features). [1 mark]

Please limit your answer to 2-3 sentences


## Question 5: Logistic Regression [2.5 marks]

Apply a Logistic Regression classifier to the full training data set (`x_{train,test}_full`)

<b> Use the existing implementation in sklearn with default parameters. </b>

### 5a The Logistic Regression classifier [0.5 marks]

In [672]:
from sklearn.linear_model import LogisticRegression
lr_prediction =[]
lr = LogisticRegression()
lr.fit(transfer_to_floatversion(x_train_full), y_train)
lr_prediction=lr.predict(transfer_to_floatversion(x_test_full))
print(Counter(lr_prediction))
print(lr.coef_[0])

Counter({' <=50K': 14837, ' >50K': 1425})
[-7.21387558e-03 -3.73286510e-06 -1.67825964e-03  3.39459956e-04
  7.80962890e-04 -7.92572640e-03 -6.53593223e-05  1.42172837e-05
 -2.11567947e-06 -3.11884460e-04  3.84111813e-05 -2.09294529e-05
 -8.10313599e-06 -3.66649752e-05 -4.98638319e-05 -1.63724905e-05
 -6.12483078e-06 -1.19642566e-05 -2.76905298e-05 -2.05340648e-05
 -8.40118099e-06 -1.18541315e-05  9.81973438e-05  2.90157238e-05
 -2.53763649e-04  6.68810707e-05 -2.35850588e-06  3.32547049e-05
 -1.37519982e-04 -1.64020514e-04  6.97605095e-07  3.77231225e-04
 -1.61455754e-05 -4.71026000e-04 -4.01181347e-05 -4.23821912e-05
 -6.53593223e-05 -1.06488123e-04 -2.19518263e-07 -4.88516576e-05
  1.16364285e-04 -3.89472776e-05 -5.24791514e-05 -5.67940170e-05
 -1.47613316e-04 -7.31367328e-06  9.14190391e-05  6.63184111e-06
 -2.14970139e-05  2.51473327e-06 -2.71304129e-05  3.36872624e-04
 -3.04279456e-04 -4.20716721e-05 -2.48113347e-04 -1.44674253e-04
  4.65025189e-05 -1.58387319e-05 -1.28973282e-05

In [687]:
a=[]
for i in range(len(lr.coef_[0])):
    a.append(i)
dic = zip(a,lr.coef_[0])
print(dict(dic))

{0: -0.0072138755790519555, 1: -3.7328651002995527e-06, 2: -0.0016782596410526564, 3: 0.0003394599563894541, 4: 0.0007809628903558092, 5: -0.007925726398241137, 6: -6.535932227514622e-05, 7: 1.421728366273114e-05, 8: -2.115679465371112e-06, 9: -0.0003118844596275048, 10: 3.841118129949208e-05, 11: -2.0929452932320778e-05, 12: -8.103135985802209e-06, 13: -3.666497518049846e-05, 14: -4.986383194133196e-05, 15: -1.6372490534483023e-05, 16: -6.124830781967234e-06, 17: -1.1964256579556898e-05, 18: -2.769052976958806e-05, 19: -2.05340647527776e-05, 20: -8.401180992116505e-06, 21: -1.1854131506711071e-05, 22: 9.819734383182207e-05, 23: 2.9015723779803417e-05, 24: -0.00025376364861786564, 25: 6.688107072742726e-05, 26: -2.358505884688427e-06, 27: 3.325470494069792e-05, 28: -0.00013751998206199554, 29: -0.00016402051430343027, 30: 6.976050948883326e-07, 31: 0.0003772312253068853, 32: -1.6145575443386998e-05, 33: -0.00047102600010380595, 34: -4.011813465996152e-05, 35: -4.238219121513341e-05, 36

In [691]:
#Take the absolute value
totalabs=[]
a=[]
for i in range(len(lr.coef_[0])):
    totalabs.append(abs(lr.coef_[0][i]))
totalabs=np.array(totalabs)
for i in range(len(lr.coef_[0])):
    a.append(i)
dic = zip(a,totalabs)
order=sorted(dict(dic).items(),key=lambda x:x[1],reverse=False)
print(order)

[(96, 1.0977913983190131e-07), (66, 1.1808353065008397e-07), (65, 1.861519933222313e-07), (38, 2.195182633475964e-07), (81, 2.1977866587789811e-07), (92, 2.502277872997251e-07), (99, 2.8354476877486764e-07), (79, 3.2691190548856737e-07), (80, 3.4493686837915777e-07), (73, 4.1614222783866347e-07), (103, 4.293399901210092e-07), (88, 4.369908732329615e-07), (67, 5.20699344058636e-07), (84, 5.870021145684357e-07), (83, 5.959348450724547e-07), (75, 5.967881264702952e-07), (100, 6.204401655178435e-07), (30, 6.976050948883326e-07), (87, 7.040907151601227e-07), (71, 7.251593278829925e-07), (91, 7.367399911537851e-07), (76, 7.930985076061114e-07), (85, 7.978581891167708e-07), (74, 8.096844945268846e-07), (90, 8.151521486377613e-07), (98, 9.853858072745964e-07), (93, 9.967372289596344e-07), (69, 1.114530963396452e-06), (82, 1.1353380151869265e-06), (78, 1.1935598989497887e-06), (94, 1.4875626800741608e-06), (86, 1.8002532734921485e-06), (77, 1.961205317069596e-06), (8, 2.115679465371112e-06), (9

In [623]:
print(d_order)#PMI

[(39, -5.16672902846202), (48, -4.185537380669371), (64, -3.076843524944825), (62, -2.8302035573617004), (10, -2.7549154300574634), (47, -2.67474508137348), (38, -2.5338387361636814), (27, -2.38871290636816), (71, -2.362598007278703), (11, -2.3269388750711215), (8, -2.2378866055013926), (83, -2.2318413159693526), (13, -2.1954880214434445), (84, -2.035023349250199), (12, -1.9565492912271139), (36, -1.9391482681836592), (49, -1.927988359959914), (28, -1.9045745830342142), (85, -1.9017568183867342), (7, -1.8560015551717777), (96, -1.6917216035702691), (9, -1.6613896033681306), (73, -1.6480002261409512), (26, -1.5640012401941834), (66, -1.5055559611207459), (29, -1.4909108190486453), (72, -1.4069921266371563), (54, -1.385853359579006), (46, -1.2244675924197288), (0, -1.2124820233779183), (30, -1.2124820233779183), (23, -1.2087111180647157), (89, -1.1954880214434447), (94, -1.1954880214434447), (88, -1.157013873628809), (56, -1.1370665448929695), (32, -1.1174855094421714), (82, -1.117485509

### 5b Inspect the learnt coefficients [2 marks]

**Part 1**
Inspect the coefficients (weights) learnt by the classifier, and compare them to the PMI values in Question 1. You should update your code above in order to achieve this. Please indicate through comments which parts of the code are relevant to question 5b.

**Part 2**
What does a high coefficient for a feature imply? For which two features does your model learn the largest coefficients? Compare your answer to the highest PMI features you detected in Q3.

The higher the coefficient, the greater influence of the appearance of this feature on the result prediction result is.

features (0 and 5) are largest coefficients, age and hours-per-week(absolute value)
PMI: Doctorate  and Prof-school

## Question 6: Evaluation [1 mark]

We will evaluate our baselines and classifiers on the instances in the test set. 

Compute 
- accuracy
- macro-averaged F1 score 

for the two baselines, the three NB classifiers and the LR classifier.

**You may use existing implementations and/or Python libraries like numpy, scipy or sklearn.**

In [674]:
from sklearn.metrics import accuracy_score, f1_score

zero_r_acc=accuracy_score(zero_r_prediction ,y_test)
one_r_acc=accuracy_score(one_r_predictions,y_test)
num_nb_acc=accuracy_score(numeric_nb_predictions,y_test)
cat_nb_acc=accuracy_score(categorical_nb_predictions,y_test)
full_nb_acc=accuracy_score(full_nb_predictions,y_test)
lr_acc=accuracy_score(lr_prediction,y_test)

zero_r_f1=f1_score(y_test,zero_r_prediction, average='macro')
one_r_f1=f1_score(y_test,one_r_predictions, average='macro')
num_nb_f1=f1_score(y_test,numeric_nb_predictions, average='macro')
cat_nb_f1=f1_score(y_test,categorical_nb_predictions, average='macro')
full_nb_f1=f1_score(y_test,full_nb_predictions, average='macro')
lr_f1=f1_score(y_test,lr_prediction, average='macro')



print(f"Zero R\t\tAccuracy: {round(zero_r_acc, 2)}\tMacro F1: {round(zero_r_f1, 2)}")
print(f"One  R\t\tAccuracy: {round(one_r_acc, 2)}\tMacro F1: {round(one_r_f1, 2)}")
print(f"NB Num \t\tAccuracy: {round(num_nb_acc, 2)}\tMacro F1: {round(num_nb_f1, 2)}")
print(f"NB Cat \t\tAccuracy: {round(cat_nb_acc, 2)}\tMacro F1: {round(cat_nb_f1, 2)}")
print(f"NB Full \tAccuracy: {round(full_nb_acc, 2)}\tMacro F1: {round(full_nb_f1, 2)}")
print(f"LR \t\tAccuracy: {round(lr_acc, 2)}\tMacro F1: {round(lr_f1, 2)}")

Zero R		Accuracy: 0.76	Macro F1: 0.43
One  R		Accuracy: 0.77	Macro F1: 0.46
NB Num 		Accuracy: 0.79	Macro F1: 0.6
NB Cat 		Accuracy: 0.8	Macro F1: 0.75
NB Full 	Accuracy: 0.88	Macro F1: 0.84
LR 		Accuracy: 0.8	Macro F1: 0.63


## Question 7: Discussion [5 marks]

Critically analyze the performance of the models by answering the following questions. 

**(a)** The three Naive Bayes (NB) classifiers lead to different performance. Which of the three NB classifiers performs best, and why do you think it is the case? **[1 mark]**

**(b)** There is a systematic difference between Accuracy and F1 score, across all models. Describe and explain the difference in the context of the data set. **[1 mark]**

**(c)** Assume that your best classifier will be deployed in a bank to decide whether a customer will be granted a loan, or not. A loan will be granted if the classifier predicts an income of ">50K", othewise it won't. The bank manager wants to avoid falsely granting a loan to an applicant with insufficient income by all means. Can you reassure the bank manager with the evaluation results in Q7 that your classifier is adequate? If not, describe an alternative evaluation metric and explain why it is more appropriate.

**(d)** Do you observe a clear difference in performance between NB and Logistic Regression (LR) in Q7? Referring to the number of parameters and assumptions underlying each model, provide one reason why NB might outperform LR, and one reason why LR might outperform NB. In your answer, refer back to the *Adult* data set used in this assignment. **[1 mark]**

**(e)** Assume that you have access to a (hypothetical) infinitely large U.S. census database (i.e., an enormous version of the *Adult* data set used in this assignment). Is it true that Logistic Regression, but not Naive Bayes, will achieve perfect test set performance? Why (not)? **[1 mark]**

<b>We expect a maximum of 2-3 sentences per questinon.

_Your answer here_

(a)The third  NB classifiers  NB classifiers best，because all features are one-hot form，Sparse matrix makes feature distribution more reasonable
 

(b)F1=2*P*R/(P+R) cause there are much more <=50K than >50k, so accuracy isn't a good metric, so recall and precision value will be different with accuracy and so that F1 socre is systematic different with accuracy

(c)Can’t ensure the accuracy of the forecast,cause In the case of binary classification and the positive and negative examples are unbalanced,The accuracy evaluation is basically of no reference value .recall and Specificity metric Pay more attention to the relationship between the true value and the true value predicted , Once we konw of real richman who >50K, we can calculate the recall and Specificity score, if it's high, means our prediction is good.


(d)NB  outperform LR, For NB, we don’t need to optimize the parameters，when we don't have much more data to optimize coefficient,NB is better. 
  LR might outperform NB,When the data dissatisfaction condition is independent hypothesis LR work better and when We have detailed data to train the model, LR is better.


(e)No,it's depends, if  the data set is too large ,LR Prone to under-fitting, generally not very accurate.When the feature space is large, the performance of logistic regression is not very good;So,There is no free lunch, lot's of algorithm should be considered in given situation

## Final thoughts ...

Do you think it is ethical and safe to build and deploy income-level classifiers based on the set of attributes provided with the data set? Could such a classifier do harm if not developed, tested and deployed carefully?

When predicting the income of a person, should demographic attributes like 'gender' or 'race' even be considered? Due to historical events there are systematic income differences between men/women and people of different ethnicities. But no doubt these are *correlations* and not factors to be used to *predict* income in the future. 

But ML models are only as good as their input data: a naively developed ML algorithm will reflect, and often amplify, biases in the data.

In the last 1-2 weeks of this subject we will discuss ethics in machine learning and look at questions such as
- historical artifacts which lead to biased data sets, and hence potentially biased ML algorithms
- how to determine whether an algorithm is indeed biased
- how to mitigate bias in the data and in algorithms

