# The University of Melbourne, School of Computing and Information Systems
# COMP90049 Introduction to Machine Learning, 2020 Semester 1
-----
## Project 1: Understanding Student Success with Naive Bayes
-----
###### Student Name(s): Zongcheng Du
###### Python version: Python. 3.7
###### Submission deadline: 11am, Wed 22 Apr 2019

This iPython notebook is a template which you will use for your Project 1 submission. 

Marking will be applied on the five functions that are defined in this notebook, and to your responses to the questions at the end of this notebook.

You may change the prototypes of these functions, and you may write other functions, according to your requirements. We would appreciate it if the required functions were prominent/easy to find. 

In [2]:
# This function should open a data file in csv, and transform it into a usable format

# Import necessary library -- pandas
import pandas

def load_data(csvFileName):
    
    # Read data and return data
    data = pandas.read_csv(csvFileName)
    return data

In [3]:
# This function should split a data set into a training set and hold-out test set

# Use train_test_split function to random split training data and test data
from sklearn.model_selection import train_test_split

def split_data(data):
    
    # Column[0:29] is features, and column[29] is labels
    features = data.iloc[:, : -1]
    label = data.iloc[:, -1]
    
    # Return x_train, x_test, y_train, y_test
    return train_test_split(features, label, test_size = 0.25) # 75% is training data, and 25% is test data.

In [4]:
# This function should build a supervised NB model

def train(x_train, y_train):
    
    """ 
    probInTrain should be {'label 1': {'countNum': num of lable 1 instances
                                            'prob': P(y = label1)
                                            'probFeatures': {'feature1' : P(x = feature1| y = label1)
                                                             'feature2' : P(x = feature2| y = label1)
                                                             ...
                                                             }
                                      }
                           'label 2': ...
                           }
    """
    probInTrain = {}
    
    for yClass, yCount in y_train.value_counts().items():
        probInTrain[yClass] = {'countNum': yCount, 'prob': yCount / len(y_train), 'probFeatures':{}}
    
    # Use concat to prepare data for groupby
    trainData = pandas.concat([x_train, y_train], axis = 1) # 1 means concat by column.
    
    # calcalate the M(alpha)
    numOfFeatures = {}
    for oneFeature in x_train.columns:
        numOfFeatures[oneFeature] = x_train[oneFeature].value_counts().index
    
    # Group by data with different Y values
    for yClass, group in trainData.groupby(y_train.values):
        # Select one specified feature in column in x_train
        for oneFeature in x_train.columns:
            probForOneFeature = {}
            featureSummary = group[oneFeature].value_counts()
            for featureName in numOfFeatures[oneFeature]:
                if not featureSummary.get(featureName):
                    featureSummary[featureName] = 0
                
            # Get value and probability for the specified feature
            for featureValue, featureCount in featureSummary.items():
                """
                # without smoothing
                featureProb = featureCount / group[oneFeature].size
                probForOneFeature[featureValue] = featureProb
                """
                
                # Use Laplace smoothing to calculate P(x = featureValue| y = yClass)
                featureProb = (featureCount + 1) / (group[oneFeature].size + len(numOfFeatures[oneFeature]))
                probForOneFeature[featureValue] = featureProb
                
            # Add dict of probForOneFeature into 'probFeatures' in outside dict
            probInTrain[yClass]['probFeatures'][oneFeature] = probForOneFeature
            
    # Return probability of all train data
    return probInTrain

In [5]:
# This function should predict the class for an instance or a set of instances, based on a trained model 
def predict(probInTrain, x_test):
    y_predict = []
    for i in range(0, len(x_test)):
        instance = x_test.iloc[i]
        maxRate = 0
        classSelect = None
        for yClass, yInfo in probInTrain.items():
            rate = 1;
            rate *= yInfo['prob']
            probFeatures = yInfo['probFeatures']
            for oneFeature, probForOneFeature in probFeatures.items():
                rate *= probForOneFeature.get(instance[oneFeature])
            if maxRate == 0 or rate > maxRate:
                maxRate = rate
                classSelect = yClass
        y_predict.append(classSelect)
    return y_predict

In [6]:
# This function should evaluate a set of predictions in terms of accuracy
def evaluate(y_test, y_predict):
    correct = 0
    for i in range(0, len(y_test)):
        if y_test.values[i] == y_predict[i]:
            correct += 1
    return correct / len(y_test)

## Questions (you may respond in a cell or cells below):

You should respond to Question 1 and two additional questions of your choice. A response to a question should take about 100–250 words, and make reference to the data wherever possible.

### Question 1: Naive Bayes Concepts and Implementation

- a Explain the ‘naive’ assumption underlying Naive Bayes. (1) Why is it necessary? (2) Why can it be problematic? Link your discussion to the features of the students data set. [no programming required]
- b Implement the required functions to load the student dataset, and estimate a Naive Bayes model. Evaluate the resulting classifier using the hold-out strategy, and measure its performance using accuracy.
- c What accuracy does your classifier achieve? Manually inspect a few instances for which your classifier made correct predictions, and some for which it predicted incorrectly, and discuss any patterns you can find.

### Question 2: A Closer Look at Evaluation

- a You learnt in the lectures that precision, recall and f-1 measure can provide a more holistic and realistic picture of the classifier performance. (i) Explain the intuition behind accuracy, precision, recall, and F1-measure, (ii) contrast their utility, and (iii) discuss the difference between micro and macro averaging in the context of the data set. [no programming required]
- b Compute precision, recall and f-1 measure of your model’s predictions on the test data set (1) separately for each class, and (2) as a single number using macro-averaging. Compare the results against your accuracy scores from Question 1. In the context of the student dataset, and your response to question 2a analyze the additional knowledge you gained about your classifier performance.

### Question 3: Training Strategies 

There are other evaluation strategies, which tend to be preferred over the hold-out strategy you implemented in Question 1.
- a Select one such strategy, (i) describe how it works, and (ii) explain why it is preferable over hold-out evaluation. [no programming required]
- b Implement your chosen strategy from Question 3a, and report the accuracy score(s) of your classifier under this strategy. Compare your outcomes against your accuracy score in Question 1, and explain your observations in the context of your response to question 3a.

### Question 4: Model Comparison

In order to understand whether a machine learning model is performing satisfactorily we typically compare its performance against alternative models. 
- a Choose one (simple) comparison model, explain (i) the workings of your chosen model, and (ii) why you chose this particular model. 
- b Implement your model of choice. How does the performance of the Naive Bayes classifier compare against your additional model? Explain your observations.

### Question 5: Bias and Fairness in Student Success Prediction

As machine learning practitioners, we should be aware of possible ethical considerations around the
applications we develop. The classifier you developed in this assignment could for example be used
to classify college applicants into admitted vs not-admitted – depending on their predicted
grade.
- a Discuss ethical problems which might arise in this application and lead to unfair treatment of the applicants. Link your discussion to the set of features provided in the students data set. [no programming required]
- b Select ethically problematic features from the data set and remove them from the data set. Use your own judgment (there is no right or wrong), and document your decisions. Train your Naive Bayes classifier on the resulting data set containing only ‘unproblematic’ features. How does the performance change in comparison to the full classifier?
- c The approach to fairness we have adopted is called “fairness through unawareness” – we simply deleted any questionable features from our data. Removing all problematic features does not guarantee a fair classifier. Can you think of reasons why removing problematic features is not enough? [no programming required]


### Question 1: Naive Bayes Concepts and Implementation

- a Bayes’ Theorem has an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. However, in the real world, this assumption is extremely harsh and (almost) impossible to achieve. It's why we call it naive.
    - (1) The basis of the naive Bayes classifier is the Bayes formula. If the naive assumption is not true, the solution of the joint probability in the Bayesian formula is almost impossible. Bayesian formula no longer exists.
    - (2) In our data set, there are 30 features. If the features are not independent, we need to calculate the joint probability distribution of 29 features (another feature is a label). Each feature has multiple values, which is a 29-dimensional spatial distribution. Even with the current computing power, when the amount of data is large, it is extremely difficult to calculate the joint probability simply by statistics. Besides, even if we can count the joint probability, due to the sparseness of the data, the data cannot contain all the combinations of features. Many of the statistical probabilities are 0, and it is difficult for the classifier to deal with this situation.

In [31]:
data = load_data("student.csv")
x_train, x_test, y_train, y_test = split_data(data)
probInTrain = train(x_train, y_train)
y_predict = predict(probInTrain, x_test)
evaluate(y_test, y_predict)

0.32515337423312884

- b We can use the above code to achieve the function of Bayesian classifier. The load_data() function realizes the reading of the CSV file. The split_data() function returns 4 subsets. I chose 25% of the data to test my model. I tried Laplace smoothing and no smoothing in train(). Using the supervised NB model in train(), we can get the predicting x_test. In the last function, the output is the accuracy of my model.

- c the accuracy is 0.32515337423312884. However, accuracy is not constant. Since the split_data() splits the data randomly, the training and testing data are different every time. I tried a lot of times and got a conclusion the accuracy of my model is between 0.3 and 0.4. The following code can show correct predictions and incorrect predictions. After comparing the two data tables, I found 'absences' is a really important feature. When the values of 'absences' are 'one_to_three' and 'more_than_ten', most of predictions are correct! And, when 'absences' is 'four_to_six', the predictions are almost incorrect. However, when the 'absences' is 'none', only half of the predictions are correct. Besides, there are similar situations for 'guardian'. My model cannot handle the cases that 'guardian' is 'other'. The error rate of 'other' is as high as 96%.

In [573]:
corr = x_test[y_predict == y_test]
incorr = x_test[y_predict != y_test]
pandas.set_option('display.max_columns', None)
pandas.set_option('display.max_rows', None)
pandas.set_option('max_colwidth',100)
corr
#incorr

Unnamed: 0,school,reason,traveltime,studytime,failures,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences
599,GP,course,medium,low,low,no,yes,yes,yes,yes,no,4,4,4,1,2,5,none
176,GP,home,low,low,low,no,no,yes,yes,no,no,5,3,4,1,1,5,more_than_ten
605,GP,home,low,medium,none,no,no,yes,yes,yes,no,4,4,4,1,1,3,none
475,MS,other,medium,medium,none,no,no,yes,yes,yes,yes,4,4,5,1,1,4,one_to_three
219,GP,home,medium,low,low,no,no,yes,no,yes,no,2,2,3,3,4,5,more_than_ten
193,GP,reputation,low,medium,none,no,yes,yes,yes,yes,yes,4,3,5,1,4,2,more_than_ten
251,GP,course,high,low,low,no,yes,yes,yes,yes,yes,5,5,5,1,1,1,one_to_three
535,GP,other,low,medium,none,no,yes,no,yes,yes,no,4,4,4,1,1,3,one_to_three
506,GP,reputation,low,very_high,none,no,no,yes,yes,yes,yes,5,2,3,1,3,3,one_to_three
299,MS,reputation,medium,low,none,no,yes,no,yes,yes,yes,4,4,5,1,2,5,none


### Question 3: Training Strategies 
- a K-fold cross-validation and bagging are two very common data processing methods. Since we have a small data set, K-fold cross-validation may be better than hold-out strategy
    - (i) In K-fold cross-validation, we divide the total data into K parts. Every time, we use one part data for testing and other parts for training. After K times, we can get the final model. 
    - (ii) There are many advantages to K-fold cross-validation. At first, due to sample variability between training and test set, our model usually gives a better prediction on training data but fail to generalize on test data which is called overfitting. However, for K-fold cross-validation, we create much different training data and test data. The model is always more accurate. Besides, when we split the dataset into training and test set, we use only a subset of data. The training result depends largely on the quality of random sets. Using K-fold cross-validation can reduced bias. Usually, we choose K equal to 5 or 10. Since our data set is really small, large K maybe lead to overfitting, I let K equal to 5 in my code.

In [549]:
from sklearn.model_selection import KFold
features = data.iloc[:, : -1]
label = data.iloc[:, -1]

kFold = KFold(n_splits = 5)

accSum = 0
for trainNum, testNum in kFold.split(features):
    x_train = features.iloc[trainNum]
    y_train = label.iloc[trainNum]
    x_test = features.iloc[testNum]
    y_test = label.iloc[testNum]
    probInTrain = train(x_train, y_train)
    y_predict = predict(probInTrain, x_test)
    acc = evaluate(y_test, y_predict)
    print(acc)
    accSum += acc
print("Average of 5 times accuracies is {}".format(accSum / 5))

0.35384615384615387
0.38461538461538464
0.3384615384615385
0.3384615384615385
0.34108527131782945
Average of 5 times accuracies is 0.351293977340489


- b I implement the 5-Kold cross-validation strategy using above code. The average of 5 times accuracies is 0.351293977340489. This result is a little better than the hold-out strategy in Q1. At first, the accuracies in Q1 and Q3 are both in [0.3, 0.4]. One reasonable explanation is that Naive Bayes has limited ability to process this data set. Besides, the accuracy in Q3 is larger than in Q1. Since we divide data into 5 parts and train the model by 5 different data, the accuracy in Q3 is more universal. Finally, the result is random because we chose different data every time under the hold-out strategy. But the result in Q3 is constant because KFold.split () function splits the data in the same way.b I implement the 5-Kold cross-validation strategy using above code.

### Question 5: Bias and Fairness in Student Success Prediction

- a  We are classifying university applicants. I think unethical information can be divided into three categories.
    - 1. Although there are differences between men and women in some subjects. Because of gender equality, sex should not be a feature. 
    - 2. Secondly, in order to prevent discrimination, some objective information of the student's family, such as whether to divorce, family address, parent's work and education level, the guardian should not be a feature.
    - 3. All human beings are created equal, and information about the student ’s financial situation should not be used as a classification feature.
- b 'sex', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'guardian', 'schoolsup' and 'famsup' will be dropped in my model. Among them, the educational level of the parents may have an impact on student achievement, but we should treat students fairly. The process and results are shown in the following code:

In [571]:
data = load_data("student.csv")
data = data.drop(columns = ['sex', 'address', 'famsize', 'Pstatus', 'Medu', 'Fedu', 'Mjob', 'Fjob', 'guardian', 'schoolsup', 'famsup' ])
x_train, x_test, y_train, y_test = split_data(data)
probInTrain = train(x_train, y_train)
y_predict = predict(probInTrain, x_test)
evaluate(y_test, y_predict)

0.3987730061349693

 - - The accuracy is 0.3987730061349693 which is better than in Q1. But, since our original data set is very small, the data set is smaller after we dropped 11 columns. The accidental result cannot be eliminated. 

- c The principle of the Naive Bayes classifier is that the features are not related. But we are dealing with a real problem now. Even after removing all the unethical data, the remaining data is more or less related to the unethical data. Therefore, it is still not a fair classifier. Secondly, whether the data is ethical depends on my subjective judgment, and personal judgment cannot be unbiased in many cases. Finally, the classifier removes unethical data, but there are still a lot of interference data in the collection, which may produce white noise. For example, the quality of family relationships and grade are clearly irrelevant.

