# Assignment 2: Classification and Evaluation (20 marks)

Student Name: `Ruijie Hu`

Student ID: `1371896`

## General info

<b>Due date</b>: Monday, 1 September 2023 5pm

<b>Submission method</b>: Canvas submission

<b>Submission materials</b>: completed copy of this iPython notebook

<b>Late submissions</b>: -10% per day up to 5 days (both weekdays and weekends count)
<ul>
    <li>one day late, -2.0;</li>
    <li>two days late, -4.0;</li>
    <li>three days late, -6.0;</li>
    <li>four days late, -8.0;</li>
    <li>five days late, -10.0;</li>
</ul>

<b>Marks</b>:  This assignment will be marked out of 20, and make up 20% of your overall mark for this subject.

<b>Materials</b>: See [Using Jupyter Notebook and Python page] on Canvas (under Modules> Coding Resources) for information on the basic setup required for this class, including an iPython notebook viewer and the python packages `numpy`, `pandas`, `matplotlib` and `sklearn`. You can use any Python built-in packages, but do not use any other 3rd party packages; if your iPython notebook doesn't run on the marker's machine, you will lose marks. <b> You should use Python 3</b>.  

<b>Evaluation</b>: Your iPython notebook should run end-to-end without any errors in a reasonable amount of time, and you must follow all instructions provided below, including specific implementation requirements and instructions for what needs to be printed (please avoid printing output we don't ask for). You should edit the sections below where requested, but leave the rest of the code as is. You should leave the output from running your code in the iPython notebook you submit, to assist with marking. The amount each section is worth is given in parenthesis after the instructions.


<b>Updates</b>: Any major changes to the assignment will be announced via Canvas. Minor changes and clarifications will be announced on Canvas>Assignments>Assignmnet2; we recommend you check it regularly.

<b>Academic misconduct</b>: This assignment is an individual task, and so reuse of code or other instances of clear influence will be considered cheating. Please check the <a href="https://canvas.lms.unimelb.edu.au/courses/151131/modules#module_825112">CIS Academic Honesty training</a> for more information. We will be checking submissions for originality and will invoke the University’s <a href="http://academichonesty.unimelb.edu.au/policy.html">Academic Misconduct policy</a> where inappropriate levels of collusion or plagiarism are deemed to have taken place. Content produced by an AI (including, but not limited to ChatGPT) is not your own work, and submitting such content will be treated as a case of academic misconduct, in line with the <a href="https://academicintegrity.unimelb.edu.au/plagiarism-and-collusion/artificial-intelligence-tools-and-technologies"> University's policy</a>.

**IMPORTANT**

Please carefully read and fill out the <b>Authorship Declaration</b> form at the bottom of the page. Failure to fill out this form results in the following deductions:
<UL TYPE=”square”>
<LI>Missing Authorship Declaration at the bottom of the page, -2.0
<LI>Incomplete or unsigned Authorship Declaration at the bottom of the page, -1.0
</UL>


## Overview:
For this assignment, you will apply a number of classifiers to various datasets, and
explore various evaluation paradigms and analyze the impact of multiple parameters on the performance of the classifiers. You will then answer a number of conceptual
questions about the Naive Bayes classifier, K-nearest neighbors, and a number of baselines based on your observations.
## Data Sets:
In this assignment, you will work with two datasets. These datasets are adapted from a UCI archive public dataset:

 - **Adult**: You predict whether an adult person earns less than 50K or 50K or more US dollar per year, based on various personal attributes like age or education level. More information can be found<a href="https://archive.ics.uci.edu/dataset/2/adult"> here </a>.
 - **Student**: You predict a student’s final grade {A+, A, B, C, D, F} based on a number of personal and performance related attributes, such as school, parent’s education level, number of absences, etc. More information can be found<a href="https://archive.ics.uci.edu/dataset/697/predict+students+dropout+and+academic+success"> here </a>.

More information about these datasets can be found in `readme.txt` file.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import copy
import math
from collections import defaultdict, Counter

In [2]:
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import GaussianNB, BernoulliNB,CategoricalNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))



The scikit-learn version is 1.2.2.


In [3]:
import warnings

# ignore future warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Question 1. Reading and Pre-processing [1.5 marks]

**A)** First, you will read in the data using the `fileName` parameter into a pandas DataFrame. You will also need to input the list of numerical feature names `num_feat` to the function to make your pre-processing easier.

**B)** Second, you replace missing values denoted by `?` using the following two strategies:

   * <b>Continuous features</b>: For each feature find the <b>average feature value</b> in the dataset
   * <b>Categorical features</b>: For each feature find the <b>most frequent value</b> in the dataset  


**C)** Third, you will use one-hot encoding to convert all nominal (and ordinal) attributes to numeric. You can achieve this by either using `get_dummies()` from the pandas library or `OneHotEncoder()` from the scikit-learn library. The resulting dataset includes all originally numeric features as well as the one-hot encoded features that are now numeric, call this data `num_dataset`.

**D)** Fourth, you will use **equal-width** binning ( 4 bins ) to convert numerical features into categorical. You can achieve this by using `cut()` from pandas library. The resulting dataset includes all originally categorical features as well as the discretized features that are now categorical, call this data `cat_dataset`.


In [4]:
import pandas as pd

def preprocess(fileName1, fileName2):
    ## read the csv files
    data1 = pd.read_csv(fileName1)
    data2 = pd.read_csv(fileName2)

    # Drop the first column of data2
    data2 = data2.drop(data2.columns[0], axis=1)

    # Check if the number of rows in both datasets are the same
    if len(data1) != len(data2):
        raise ValueError("The number of rows in the two datasets do not match!")

    # Concatenate data2 (word embeddings) with 'rating' and 'dr-id-adjusted' columns from data1
    merged_data = pd.concat([data1[['dr-id-adjusted', 'rating']], data2], axis=1)

    # Splitting the dataset into features and target
    # Assuming all other columns except 'rating' in merged_data are features
    features = merged_data.drop(columns=['rating'])
    target = merged_data['rating']

    return merged_data, features, target


In [6]:
## read the data
dataset_train, features_train,target_train = preprocess("D:/unimelb-3rd/IML/ASS3/dataset/TRAIN.csv","D:/unimelb-3rd/IML/ASS3/dataset/384EMBEDDINGS_TRAIN.csv")
dataset_val, features_val, target_val = preprocess("D:/unimelb-3rd/IML/ASS3/dataset/VALIDATION.csv","D:/unimelb-3rd/IML/ASS3/dataset/384EMBEDDINGS_VALIDATION.csv")
print(len(dataset_train))
print(len(dataset_val))
dataset_val

43003
5500


Unnamed: 0,dr-id-adjusted,rating,0,1,2,3,4,5,6,7,...,374,375,376,377,378,379,380,381,382,383
0,33620,1,-0.022207,0.064185,0.023957,0.021580,-0.063474,-0.060289,0.057354,0.066238,...,-0.033674,-0.055816,0.013047,-0.035806,-0.016576,0.040326,0.005111,-0.029757,-0.063486,0.000462
1,33620,-1,-0.047829,0.039449,0.025721,0.024461,0.013234,-0.007365,-0.025881,-0.007678,...,0.088275,-0.104800,-0.039734,-0.038932,-0.067038,-0.025953,-0.077584,0.018969,-0.091612,-0.016109
2,33626,1,-0.015018,-0.004742,-0.015077,0.026958,-0.061960,0.000557,0.038323,0.099361,...,0.025605,0.056154,0.024323,-0.017221,-0.075064,0.007564,-0.072174,-0.020699,-0.057820,0.039419
3,33626,1,-0.014154,-0.015275,0.032033,0.045189,-0.076433,-0.001758,-0.019410,0.068679,...,-0.009228,0.029588,-0.022354,-0.013990,-0.082329,0.040468,-0.015963,-0.068362,-0.050644,0.096260
4,33628,1,-0.069949,-0.012617,0.035879,-0.041826,-0.076924,-0.057929,0.026214,-0.029924,...,-0.021068,-0.038854,0.042229,-0.022856,0.026084,0.134875,0.022401,-0.051483,-0.014984,0.006698
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5495,38063,1,0.006699,0.033123,0.003509,0.063899,0.020196,-0.076222,-0.052318,0.042699,...,0.000500,0.067840,0.030330,-0.008184,-0.027531,-0.051565,-0.057745,-0.026203,-0.160159,-0.000017
5496,38064,1,-0.027958,-0.021277,0.022908,0.005755,-0.139231,-0.047026,-0.070944,0.033015,...,0.024188,0.012772,0.103551,-0.042511,-0.064135,0.052720,-0.005511,0.015123,-0.008484,0.012005
5497,38065,1,-0.052219,-0.011069,0.007917,0.008442,-0.076079,-0.029523,-0.030472,0.104013,...,0.018229,0.026702,-0.071201,-0.036976,-0.092275,0.027811,0.011771,0.012291,0.025588,0.010331
5498,38065,1,-0.000394,-0.083509,0.002389,0.039134,-0.119880,-0.060715,-0.020452,0.030014,...,0.017917,0.024247,-0.003603,-0.046223,-0.025451,0.105774,-0.067439,-0.058233,0.051917,0.006985


#### Question 2 . Baseline methods and Discussion [4.5 marks]
**A)** For 10 rounds, use `train_test_split` to divide the processed `cat_dataset` into 80% train, 20% test . Set the `random_state` equal to the loop counter. For example in the loop
``` python
for i in range(10):
```
make `random_state` equal to `i`.
Use the splitted datasets to train and test the following models: **[1 mark]**

- Zero-R
- One-R
- Weighted Random

Report the average accuracy over the 10 runs.


In [7]:
## You can define your helper functions for One-R or other baselines in this block
## for One-R at training time, you can break the ties randomly
## for One-R at prediction time, if the test contains an unseen feature value, return the majority class
# Zero-R Classifier
import random
def one_r_classifier(train_x, train_y):
    best_error = float('inf')
    best_rules = {}
    best_attribute = None
    majority_class = train_y.mode()[0]

    for attribute in train_x.columns:
        rules = {}
        error_count = 0

        for value in pd.unique(train_x[attribute]):
            # Get the most common class for this attribute value
            possible_classes = train_y[train_x[attribute] == value].value_counts()

            # If there's a tie, pick randomly
            if len(possible_classes) > 1 and possible_classes.iloc[0] == possible_classes.iloc[1]:
                chosen_class = random.choice(possible_classes.index[:2])
            else:
                chosen_class = possible_classes.idxmax()

            rules[value] = chosen_class
            error_count += sum(train_y[train_x[attribute] == value] != chosen_class)

        # If this attribute has a lower error than the best so far, update best_rules and best_attribute
        if error_count < best_error:
            best_error = error_count
            best_rules = rules
            best_attribute = attribute

    # Return the lambda function for prediction and the best attribute
    return lambda x: best_rules.get(x[best_attribute], majority_class), best_attribute


In [8]:
def baselines(dataset_train,dataset_val):

    ZeroR_Acc_1 = []
    WRand_Acc_1 = []

    ## your code here
    report = []
    train_x, train_y = dataset_train.drop(['rating'],axis = 1), dataset_train['rating']
    val_x, val_y = dataset_val.drop(['rating'], axis=1), dataset_val['rating']

      # Train and test Zero-R
    zero_r = DummyClassifier(strategy='most_frequent')
    zero_r.fit(train_x, train_y)
    zero_r_predictions = zero_r.predict(val_x)
    zero_r_accuracy = sum(1 for pred, true in zip(zero_r_predictions, val_y) if pred == true) / len(val_y)
    report_zero = classification_report(val_y,zero_r_predictions,zero_division=0)
    report.append(report_zero)
    ZeroR_Acc_1.append(zero_r_accuracy)

      # Train and test Weighted Random
    weighted_random = DummyClassifier(strategy='stratified')
    weighted_random.fit(train_x, train_y)
    weighted_random_predictions = weighted_random.predict(val_x)
    report_w = classification_report(val_y, weighted_random_predictions,zero_division=0)
    report.append(report_w)
    weighted_random_accuracy = sum(1 for pred, true in zip(weighted_random_predictions, val_y) if pred == true) / len(val_y)
    WRand_Acc_1.append(weighted_random_accuracy)


    print("Accuracy of ZeroR:", np.mean(ZeroR_Acc_1).round(2))
    print("Accuracy of Weighted Random:", np.mean(WRand_Acc_1).round(2))
    #print(report[0])
    #print(report[1])
    #print(report[2])
    # Print the most often selected feature


##Adult Dataset and Student Dataset results:
baselines(dataset_train,dataset_val)


Accuracy of ZeroR: 0.73
Accuracy of Weighted Random: 0.6
