# Income Prediction based on U.S. Census data

# General Info

<b>Materials</b>: See [Using Jupyter Notebook and Python page](https://canvas.lms.unimelb.edu.au/courses/105477/pages/python-and-jupyter-notebooks?module_item_id=2613813) on Canvas (under Modules>Resources) for information on the basic setup required for this class, including an iPython notebook viewer and some handy python packages including Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. You can use any Python built-in packages, but do not use any other 3rd party packages (the packages listed above are all fine to use); if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions. 

You should be careful to use Python built-in functions and operators when appropriate and pick descriptive variable names that adhere to <a href="https://www.python.org/dev/peps/pep-0008/">Python style requirements</a>. If you think it might be unclear what you are doing, you should comment your code to help the marker make sense of it. While the main focus is on correctness of your methods, you will lose marks if your code is not understandable.

<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on the discussion board; we recommend you check it regularly.

<b>Academic misconduct</b>: For most people, collaboration will form a natural part of the undertaking of this homework, and we encourge you to discuss it in general terms with other students. However, this ultimately is still an individual task, and so reuse of code or other instances of clear influence will be considered cheating. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place.


## Overview

In this assignment, you will be working with the famous *Adult* a data set containing demographic and income data from the united states in 1994. The dataset provided for this assignment is derived from <a href="http://archive.ics.uci.edu/ml/datasets/Adult">this</a> resource. The data set consists of about 48,000 individuals each characterized through a set of 14 attributes. Your task is to predict whether the individual earns up to \\$ 50,000 a year (<=50K) or more than \\$50,000 per year (>50K).

The attributes are

|ID|Feature Name| Feature Type | Feature Values|
| :-| :-| :-| :-|
|0|age| continuous| |
|1|workclass| categorical | Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, ?|
|2|fnlwgt| continuous| |
|3|education|  categorical |Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool|
|4|education-num| continuous| |
|5|marital-status|  categorical |Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse|
|6|occupation|  categorical |Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces, ?|
|7|relationship|  categorical |Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried|
|8|race|  categorical |White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black|
|9|sex|  categorical |Female, Male|
|10|capital-gain| continuous| |
|11|capital-loss| continuous| |
|12|hours-per-week| continuous| |
|13|native-country|  categorical |United-States, Cambodia, England, Puerto-Rico, Canada, Germany, India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, ?|

You can find out more about the individual attributes / values, and the origin of the data set in the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.names"> data set description</a>  


You will build a number of classifiers to predict the income class based on the attributes above.


#### The following instructions hold for every question in the assignment
- leave the order of instances intact, i.e., do not shuffle the data
- do not change the names or types of variables provided by us in the code cells below
- '?' denotes an unknown value, and you should treat it as just another value for its feature in all tasks below.

## Question 1: Loading and pre-processing the data

You were provide two data files:

**adult.train** contains about 32,000 training instances, one instance per line in comma-separated value (csv) format. Each line contains 14 fields. The first 13 fields correspond to the features listed above, the final field denotes the class label

**adult.test** is formatted exactly like adult.train, and contains about 16,000 further instances for evaluation.


### 1a Read the data
First, you will read in the data and create traing features, training labels, test features and test labels. Do not apply any data transformations.

In [33]:
# load data


data = open("adult.train",'r').readlines()
test_data = open("adult.test",'r').readlines()

x_train = []
y_train = []
x_test = []
y_test = []

############################
## your code begins here
############################
ft_train = []
ft_test = []

#helper function to clean all spaces from dataset
def remove_space(data): 
        if type(data) is str:
                return data.replace(" ","")
        else:
            for d in data:
                    return [remove_space(d) for d in data]
                
for line in data:
        dt = line.strip().split(",")
        ft_train.append(dt[:-1]) 
        y_train.append(dt[-1].strip())
    
for line in test_data:
        t_dt = line.strip().split(",")
        ft_test.append(t_dt[:-1]) 
        y_test.append(t_dt[-1].strip())
        
x_train = remove_space(ft_train)
x_test = remove_space(ft_test)
        

############################
## your code ends here
############################

len(x_train), len(y_train), len(x_test), len(y_test), len(x_train[0]), len(x_test[0])
#print(y_test[:10])

(32525, 32525, 16262, 16262, 14, 14)

<b>For your testing:</b>

In [34]:
assert len(x_train)==len(y_train)==32525
assert len(x_train[0])==len(x_test[0])==14

### 1b: Attribute Types

You will create three feature representations, based on the different attribute types (categorical, numeric) in the original *Adult* data.

**Your tasks**

Denote with $I$ the number of training instances; $N$ the number of numeric features in the data set; and $v_f$ the number of possible values for feature $f$.

1. Create a train data set with only numeric features `x_train_num` (size $(I\times N)$); and equivaluently a test data set `x_test_num`
2. Create a train data set with only categorical features *in a 1-hot representation* `x_train_1hot` (size $(I\times\sum_f v_f)$); and equivalently a test data set `x_test_1hot`
3. Create a train data set with both numeric and 1-hot categorical features `x_train_full` (size $I \times N+\sum_f v_f$) where the first $N$ columns represent the numerical features and the remaining columns the categorical features; and equivalently a test data set `x_test_full`

**Note:** You may use classes and functions from ```scikit-learn```.



In [35]:
x_train_num = []
x_test_num = []

x_train_1hot = []
x_test_1hot = []

x_train_full = []
x_test_full = []

################################
### Your code begins here ######
################################
import pandas as pd
import numpy as np

Columns_x = ('age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 
           'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 
           'native-country')

Columns_y = 'income-level'

Numeric_columns = ('age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week')

Categorial_columns = ('workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 
                      'sex', 'native-country')

#####transfer train & test data into pandas DataFrame for further usage#####
df_x_train = pd.DataFrame(x_train, columns=list(Columns_x))  #train_features
df_y_train = pd.DataFrame(y_train, columns=[Columns_y])  #train_labels

df_x_test = pd.DataFrame(x_test, columns=list(Columns_x))   #test_features
df_y_test = pd.DataFrame(y_test, columns=[Columns_y])   #test_labels

#numeric features using pandas
x_train_num = df_x_train[list(Numeric_columns)].values.tolist()
x_test_num = df_x_test[list(Numeric_columns)].values.tolist()

#one-hot using pandas
pre_x_1hot_train = pd.DataFrame(df_x_train, columns = list(Categorial_columns))#dataframe contains Categorial_columns
pre_x_1hot_test = pd.DataFrame(df_x_test, columns = list(Categorial_columns))#dataframe contains Categorial_columns

a_train = pd.get_dummies(pre_x_1hot_train, columns = list(Categorial_columns))
a_test = pd.get_dummies(pre_x_1hot_test, columns = list(Categorial_columns))

x_train_1hot = a_train.values.tolist()
x_test_1hot = a_test.values.tolist()
#print(x_train_1hot[0])

#both numeric and 1-hot categorical features

for i in range(0, len(x_train_num)):
    xtf = x_train_num[i] + x_train_1hot[i]
    x_train_full.append(xtf)
    
for i in range(0, len(x_test_num)):
    xtf_1 = x_test_num[i] + x_test_1hot[i]
    x_test_full.append(xtf_1)

################################
### Your code ends here ########
################################

#print(x_train_1hot[:10])

<b> For your testing</b>

In [36]:
assert len(x_train_1hot[0])==98
assert len(x_train_full[0])==104

## Question 2: A 0-R baseline

Implement a zero-r baseline, as introduced in the Evaluation lecture.

In [37]:
from collections import Counter

majority_class = ""
zero_r_predictions = []

############################
## your code begins here
############################
majority_class = max(set(y_train), key=y_train.count)

zero_r_predictions = [majority_class for i in range(len(y_test))]

############################
## your code ends here
############################

print(f"The marority class is: {majority_class}")


The marority class is: <=50K


## Question 3: Feature selection

In this question you will implement pointwise mutual information (PMI) for feature selection (Question 3a)

In question 3a. you will use the implemented function to create a 1-R classifier based on the single 1-hot attribute  (i.e., categorical feature value) with highest PMI for class ">50K". 
```
argmax_a pmi(a=1,c=">50K")
```
In question 3b. you will apply your 1-R classifier to the test instances, and store your predicted labels in `one_r_predictions`.

<b> You should implement PMI from scratch yourself. You may use native Python libraries like math or numpy to help you, but you may not use existing implementations of PMI.</b>



### 3a. Implement PMI

Implement a function to compute PMI.


In [38]:
'''
PMI: A function to compute the pointwise mutual information between 
all features in the data set and a target class of interest
Input: - features for N input instances
       - labels for N input instances
       - target feature value (default=1)
       - target class (fdefault='>50K')
Output: a dictionary of the form {feature: ppmi} where each 
        feature is denoted by its ID (position in the 1-hot encoded representatoin)

'''

############################
## your code begins here
############################
import math
import sys

def pmi(features, labels, tgt_value=1, tgt_class=">50K"):
    pmis = {}
           
    helper_lst = extract_elem(features)
    
    #calculate P(A) - features
    p_a = []
    p_a = calculate_p(helper_lst, tgt_value)
    #print(p_a)

    #P(C) - labels
    lenth_c = len(labels)
    count=0
    for elem in labels:
        if elem == tgt_class:
            count+=1
    p_c = count/lenth_c

    #P(A,C) - joint probabilities
    p_ac = []
    
    for row in helper_lst:
        count_jc = 0
        for i in range(0,len(row)):
            if row[i] == tgt_value and labels[i] == tgt_class:
                count_jc += 1
        p_ac.append(count_jc/len(features))

    #indexes in helper_lst are the keys for pmis

    for i in range(0, len(helper_lst)):
        #print(type((p_ac.get(i)/(p_a.get(i)*p_c))))
        if (p_ac[i]/(p_a[i]*p_c)) == 0:
            
            pmis[i] = math.inf  # = 0
            
        else:                       
            pmis[i] = math.log((p_ac[i]/(p_a[i]*p_c)), 2) ##avoid devide by zero    
        
    return pmis

#Helper function to calculate probability
def calculate_p(FL_list, tgt_value):
    p = [] #dictionary using index of features list as key.
    for row in FL_list:
        count = 0
        for elem in row:
            if elem == tgt_value:
                count += 1
        p.append(count/len(row))
    #print('this is p')
    #print(p)
    #print(list(p.values()))
    return p

def extract_elem(lst):
    new_lst = []
    for i in range(0, len(lst[0])):
        new_lst.append([item[i] for item in lst])
    return new_lst

############################
## your code ends here
############################


<b>For your testing:</b>

In [39]:
test_features = [[1,1], [1,0], [0,1], [0,0]]
test_labels = [1,1,0,0]
test_pmi = pmi(test_features, test_labels, tgt_class=1)

# where the index 0 refers to "feature 1" and index 1 refers to "feature 2"
assert test_pmi[0]==1.0
assert test_pmi[1]==0.0

### 3b Create 1-R baseline

- Apply your PMI function to the *1_hot feature representation* of training data, and determine the (i) 1-hot feature with *highest* PMI with class '>50K' () and (ii) 1-hot feature with *lowest* (most negative) PMI with class '>50K'. Store the name (string) of the corresponding highest/lowest PMI features in `highest_pmi_feature_name` and `lowest_pmi_feature_name`, respectively.

- The feature with *highest* PMI will consitute your 1-R predictor, which you should use to predict the class labels for the test set (`one_r_predictions`)

In [40]:
one_r_predictions = []

highest_pmi_feature_name = "" # feature with highest PMI
lowest_pmi_feature_name = "" # feature with lowest PMI

############################
## your code begins here
############################
Categorial_list = list(Categorial_columns)
#('workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'native-country')
feature_names = []  #total 98 names
features_count = {} #dictionary stores 8 {Categorial_feature:count}
pmi_aggregate = {} #dictionary stores 8 {Categorial_feature:pmi}
    
def obtain_1r_ftNames(catg_train_data):
    lst = []
    for col in catg_train_data:
        lst.append(col)
    #print(lst)
    return lst

#print(type(feature_names[0]))
def counter_tool(lst, substring):
    count_f = 0 
    for elem in lst:
        if substring in elem:
            count_f+=1
    return count_f

def count_ft(feature_names_lst):
    d = {}
    for item in Categorial_columns:
        num = counter_tool(feature_names_lst, item)
        d[item] = num
    return d

#calculate aggregate pmi for each     
def pmi_cal(lst, substring, pmis_dict):
    pmi_sum = 0
    for elem in lst:
        if substring in elem:
            pmi_sum += pmis_dict[lst.index(elem)]
    return pmi_sum

def pmi_aggr(One_R_pmis):
    pmi_dic={}
    for item in Categorial_columns:
        num = pmi_cal(feature_names, item, One_R_pmis)
    #print(num)
        pmi_dic[item] = num
    return pmi_dic

def max_name(pmi_aggr_dict):
    m_name = max(pmi_aggr_dict, key = pmi_aggr_dict.get)
    return m_name

#dictionary stores {Categorial_subfeatures:pmi}
pmis = pmi(x_train_1hot, y_train)

#a list of Categorial names for all 98 sub-features
feature_names = obtain_1r_ftNames(a_train) #pre_x_1hot_train check Question 1-b

#dictionary stores sum_pmi group by Categorial_feature names
pmi_aggregate = pmi_aggr(pmis) 
#print(pmi_aggregate)

#dictionary stores 8 {Categorial_feature:count(how many sub_features)}
features_count = count_ft(feature_names)

#Categorial feature name with highest/lowest pmi
highest_pmi_feature_name = max_name(pmi_aggregate)  
lowest_pmi_feature_name = min(pmi_aggregate, key = pmi_aggregate.get)

#find the (train/test) feature with highest PMI, extract 'workclass' column
feature_Hpmi = df_x_train[highest_pmi_feature_name].values.tolist()
#print(feature_Hpmi)
x_test_Hpmi = df_x_test[highest_pmi_feature_name].values.tolist()

#use y_train as train labels

#find sub-feature names under 'highest_pmi_feature_name'
def find_pattern(features_count_dict, high_feature_name, catg_feature_names):
    p=[]
    for i in range(features_count_dict[high_feature_name]):
        p.append(catg_feature_names[i][len(high_feature_name)+1:])
    return p

highest_sub_features = find_pattern(features_count, highest_pmi_feature_name, feature_names)
#print(highest_sub_features)

def label_predict(patterns, single_feature_raw_data, single_feature_labels, test_data): #returns a list of 1-r labels under each sub features
    label_predict = []
    final_predict = []

    for wc in patterns:
        a = [] 
        labels = [] 
        for i in range(len(single_feature_raw_data)):
            if single_feature_raw_data[i] == wc:
                labels.append(single_feature_labels[i])
        a.append(max(set(labels), key=labels.count)) 
        #print(a)
        label_predict += a
#print(label_predict)

    for item in test_data:
        for wc in patterns:
            if item == wc:
                final_predict.append(label_predict[patterns.index(wc)])
                
    return final_predict

one_r_predictions = label_predict(highest_sub_features, feature_Hpmi, y_train, x_test_Hpmi)


############################
## your code ends here
############################

print(one_r_predictions[:10])
print(f"The feature with highest PMI for the class '>50K' is: {highest_pmi_feature_name}")
print(f"The feature with lowest PMI for the class '>50K' is: {lowest_pmi_feature_name}")

['<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K', '<=50K']
The feature with highest PMI for the class '>50K' is: workclass
The feature with lowest PMI for the class '>50K' is: native-country


## Question 4: Naive Bayes

We will construct three Naive Bayes classifiers

1. One for instances with only 1-hot encoded (binary) categorical attributes.
2. One for instances with only numerical attributes.
3. One for instances with the full set of numerical *and* categorical attributes, ensuring that your classifier computes posterior class probabilities $p(y|x)$ as $p(x|y)p(y)$.

For each classifier, you will 
1. Train it on the training set
2. Use the models to predict labels for the test set


### 4a Implementing the Naive Bayes classifiers

Implement three functions which train a NB classifier given the specified input feature types and predict labels for a given test set.

You may add additional quantities to your `return` statements.

<b> You may (and are, indeed, encouraged) to, use the existing NB implementations from `sklearn`. You should use the default parameterizations of these algorithms.</b> 
    
If you choose to implement your classifiers from scratch, please use Laplace smoothing (alpha=1) for the categorical feature NB.


In [41]:

##################################
## your code begins here
##################################
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB

def nb_binary_features(train_features, train_labels, test_features):  #1-hot
    predictions = []

    brn = BernoulliNB()
    brn.fit(train_features, train_labels)
    
    predictions = brn.predict(test_features).tolist()
    return predictions


def nb_continuous_features(train_features, train_labels, test_features): #numeric
    predictions = []
    
    gn = GaussianNB()
    gn.fit(train_features, train_labels)
    
    predictions = gn.predict(test_features).tolist()
    
    return predictions


def nb_full(train_features, train_labels, test_features):   #full 
    predictions = []
    
    #split arrays into two parts: num & one-hot part
    lenth = len(Numeric_columns)
    indices_num = [i for i in range(lenth)]
    indices_1hot = [i for i in range(lenth, len(train_features[0]))]
    #print(indices_1hot)
    
    num_x_train=[]
    num_x_test = []
    
    for item in train_features:
        num_x_train.append(item[indices_num])
        
    for item in test_features:
        num_x_test.append(item[indices_num])

    #print(num_x_test[:10])
    hot_x_train=[]
    hot_x_test = []
    
    for item in train_features:
        hot_x_train.append(item[indices_1hot])
        
    for item in test_features:
        hot_x_test.append(item[indices_1hot])
    
    brn = BernoulliNB()
    brn.fit(hot_x_train, train_labels)
    brn_predict = brn.predict(hot_x_test)
    proba_brn = brn.predict_proba(hot_x_test)
    
    gn = GaussianNB()
    gn.fit(num_x_train, train_labels)
    gn_predict = gn.predict(num_x_test)
    proba_gn = gn.predict_proba(num_x_test)

    proba_brn_1 = []
    proba_brn_2 = []
    
    proba_gn_1 = []
    proba_gn_2 = []
    
    for item in proba_brn:
        proba_brn_1.append(item[0])
        proba_brn_2.append(item[1])
        
    for item in proba_gn:
        proba_gn_1.append(item[0])
        proba_gn_2.append(item[1])
        
    aggr_proba_1 = [a * b for a, b in zip(proba_brn_1, proba_gn_1)]
    aggr_proba_2 = [a * b for a, b in zip(proba_brn_2, proba_gn_2)]
    
    for i in range(len(aggr_proba_1)):
        if aggr_proba_1[i] > aggr_proba_2[i]:
            predictions.append('<=50K')
        else:
            predictions.append('>50K')
    #print(len(predictions))
    return predictions


############################
## your code ends here
############################
#x_train_full_1 = np.array(x_train_full)
#x_train_full_1 = x_train_full_1.astype('int')
#x_test_full_1 = np.array(x_test_full)
#x_test_full_1 = x_test_full_1.astype('int')
#nb_full(x_train_full_1, y_train, x_test_full_1)

### 4b Apply your classifiers to the data sets you created in Questions 1 and 3b

...namely, the

1. 1-hot categorical features `x_{train,test}_1hot`
2. numerical features `x_{train,test}_num`
3. combined numerical and 1-hot categorical features `x_{train,test}_full`


In [42]:
from collections import Counter

############################
## your code begins here
############################

x_train_num_1 = np.array(x_train_num)
x_train_num_1 = x_train_num_1.astype('int')
x_test_num_1 = np.array(x_test_num)
x_test_num_1 = x_test_num_1.astype('int')
#print(type(x_train_num_1))

x_train_full_1 = np.array(x_train_full)
x_train_full_1 = x_train_full_1.astype('int')
x_test_full_1 = np.array(x_test_full)
x_test_full_1 = x_test_full_1.astype('int')

categorical_nb_predictions = nb_binary_features(a_train, y_train, a_test)
numeric_nb_predictions = nb_continuous_features(x_train_num_1, y_train, x_test_num_1)
full_nb_predictions = nb_full(x_train_full_1, y_train, x_test_full_1)

#print(a_test)
############################
## your code ends here
############################

print(f"Numerical NB predicted class distribution \t {Counter(numeric_nb_predictions)}")
print(f"Categorical NB predicted class distribution\t {Counter(categorical_nb_predictions)}")
print(f"Full NB predicted class distribution\t {Counter(full_nb_predictions)}")


Numerical NB predicted class distribution 	 Counter({'<=50K': 14440, '>50K': 1822})
Categorical NB predicted class distribution	 Counter({'<=50K': 10046, '>50K': 6216})
Full NB predicted class distribution	 Counter({'<=50K': 13393, '>50K': 2869})


## Question 5: Logistic Regression

Apply a Logistic Regression classifier to the full training data set (`x_{train,test}_full`)

<b> Use the existing implementation in sklearn with default parameters. </b>

### 5a The Logistic Regression classifier

In [43]:
from sklearn.linear_model import LogisticRegression

lr_predictions = []

############################
## your code begins here
############################
lreg = LogisticRegression()

lreg.fit(x_train_full_1, y_train)
lr_predictions = lreg.predict(x_test_full_1)

#Q5b
#coefficients (weights) learnt by the classifier
coefficients = lreg.coef_.ravel()
#compare them to the PMI values in Question 3
#find features with the largest coefficients
coe_cat_dict={}
coe_num_dict={} #feature name as keys!##
coe_num = coefficients[:len(Numeric_columns)]
coe_cat = coefficients[len(Numeric_columns):]

for i in range(len(coe_cat)):
    coe_cat_dict[i] = coe_cat[i]
    
for i in range(len(coe_num)):
    coe_num_dict[str(Numeric_columns[i])] = coe_num[i]
    
#get aggregated coe_cat using previous helper function
coe_cat_aggre = pmi_aggr(coe_cat_dict)
all_features_coe_dict = dict(coe_num_dict, **coe_cat_aggre)

#find two features with largest coefficients
sorted_coe = sorted(all_features_coe_dict, key=all_features_coe_dict.get, reverse=True)
f1_largest_coe = sorted_coe[0]
f2_2nd_largest_coe = sorted_coe[1]


############################
## your code ends here
############################

print(f"LR predicted class distribution\t {Counter(lr_predictions)}")
print("\n")
#answers for Q5b part1
print(f"pmi values in Q3: {pmi_aggregate}")
print("\n")
print(f"coefficient values in Q5a: {all_features_coe_dict}")
print("\n")
#answers for Q5b part2
print(f"The feature with the largest coefficients is: {f1_largest_coe}")
print(f"The feature with the second largest coefficients is {f2_2nd_largest_coe}")

LR predicted class distribution	 Counter({'<=50K': 14837, '>50K': 1425})


pmi values in Q3: {'workclass': 1.2450527236150353, 'education': -10.54470295370337, 'marital-status': -6.814652064391762, 'occupation': -12.325804989323599, 'relationship': -8.13279966987268, 'race': -3.172058817071518, 'sex': -0.7928749299032869, 'native-country': -20.394740371340824}


coefficient values in Q5a: {'age': -0.007213875579054285, 'fnlwgt': -3.7328651002938924e-06, 'education-num': -0.0016782596410539507, 'capital-gain': 0.00033945995638995797, 'capital-loss': 0.0007809628903499234, 'hours-per-week': -0.007925726398248574, 'workclass': -0.00035576358532378476, 'education': -0.00035576358532391134, 'marital-status': -0.00035576358532371294, 'occupation': -0.00035576358532383973, 'relationship': -0.00035576358532409284, 'race': -0.00035576358532390614, 'sex': -0.0003557635853237865, 'native-country': -0.0003557635853238577}


The feature with the largest coefficients is: capital-loss
The feature wit

## Question 6: Evaluation

We will evaluate our baselines and classifiers on the instances in the test set. 

Compute 
- accuracy
- macro-averaged F1 score 

for the two baselines, the three NB classifiers and the LR classifier.

**You may use existing implementations and/or Python libraries like numpy, scipy or sklearn.**

In [44]:
zero_r_acc = 0
zero_r_f1 = 0

one_r_acc = 0
one_r_f1 = 0

num_nb_acc = 0
num_nb_f1 = 0

cat_nb_acc = 0
cat_nb_f1 = 0

full_nb_acc = 0
full_nb_f1 = 0

lr_acc = 0
lr_f1 = 0

############################
## your code begins here
############################
from sklearn.metrics import f1_score

def accuracy(predict, test_label):
    TP = 0
    FN = 0
    FP = 0
    TN = 0
    
    for i in range(len(test_label)):
        if (test_label[i] == '>50K') and (predict[i] == '>50K'):
            TP += 1
        elif (test_label[i] == '<=50K') and (predict[i] == '>50K'):
            FN += 1
        elif (test_label[i] == '<=50K') and (predict[i] == '<=50K'):
            TN += 1
        elif (test_label[i] == '>50K') and (predict[i] == '<=50K'):
            FP += 1
        
    accuracy = (TP + TN) / (TP + FN + FP + TN)
    return accuracy

zero_r_acc = accuracy(zero_r_predictions, y_test)
zero_r_f1 = f1_score(y_test, zero_r_predictions, average='macro')

one_r_acc = accuracy(one_r_predictions, y_test)
one_r_f1 = f1_score(y_test, one_r_predictions, average='macro')

num_nb_acc = accuracy(numeric_nb_predictions, y_test)
num_nb_f1 = f1_score(y_test, numeric_nb_predictions, average='macro')

cat_nb_acc = accuracy(categorical_nb_predictions, y_test)
cat_nb_f1 = f1_score(y_test, categorical_nb_predictions, average='macro')

full_nb_acc = accuracy(full_nb_predictions, y_test)
full_nb_f1 = f1_score(y_test, full_nb_predictions, average='macro')

lr_acc = accuracy(lr_predictions, y_test)
lr_f1 = f1_score(y_test, lr_predictions, average='macro')

############################
## your code ends here
############################

print(f"Zero R\t\tAccuracy: {round(zero_r_acc, 2)}\tMacro F1: {round(zero_r_f1, 2)}")
print(f"One  R\t\tAccuracy: {round(one_r_acc, 2)}\tMacro F1: {round(one_r_f1, 2)}")
print(f"NB Num \t\tAccuracy: {round(num_nb_acc, 2)}\tMacro F1: {round(num_nb_f1, 2)}")
print(f"NB Cat \t\tAccuracy: {round(cat_nb_acc, 2)}\tMacro F1: {round(cat_nb_f1, 2)}")
print(f"NB Full \tAccuracy: {round(full_nb_acc, 2)}\tMacro F1: {round(full_nb_f1, 2)}")
print(f"LR \t\tAccuracy: {round(lr_acc, 2)}\tMacro F1: {round(lr_f1, 2)}")

Zero R		Accuracy: 0.76	Macro F1: 0.43
One  R		Accuracy: 0.77	Macro F1: 0.5
NB Num 		Accuracy: 0.8	Macro F1: 0.64
NB Cat 		Accuracy: 0.76	Macro F1: 0.72
NB Full 	Accuracy: 0.83	Macro F1: 0.75
LR 		Accuracy: 0.8	Macro F1: 0.63
