# Section 1-Teaching Evaluation Dataset (TAE)

This is a classification dataset which consists of 6 (including the class attribute) attributes and 151 instances. The data consist of evaluations of teaching performance, where scores are "low", "medium", or "high". You can find this dataset at https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation. There are no Missing Attribute Values.
The following are the class attributes for this dataset-

1. Whether of not the TA is a native English speaker (binary); 1=English speaker, 2=non-English speaker 
2. Course instructor (categorical, 25 categories) 
3. Course (categorical, 26 categories) 
4. Summer or regular semester (binary) 1=Summer, 2=Regular 
5. Class size (numerical) 
6. Class attribute (categorical) 1=Low, 2=Medium, 3=High





# Section 2-Loading of the dataset
This dataset is loaded using the urllib.request module in python.Once the datatset is loaded, I put it in the dataframe that is then used in section 3 when we binarize it using the mean. The dataframe is shown for reference.

In [111]:
import pandas as pd
import numpy as np
import statistics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import urllib.request

url="https://archive.ics.uci.edu/ml/machine-learning-databases/tae/tae.data"

raw_data=urllib.request.urlopen(url)
dataset=np.loadtxt(raw_data,delimiter=",")
feature_names= ["TA Language","Course Instructor","Course","Summer or Regular Semester","Class size","Class attribute"]
df = pd.DataFrame(dataset, columns = ["TA Language","Course Instructor","Course","Summer or Regular Semester","Class size","Class attribute"])
print(df)

     TA Language  Course Instructor  Course  Summer or Regular Semester  \
0            1.0               23.0     3.0                         1.0   
1            2.0               15.0     3.0                         1.0   
2            1.0               23.0     3.0                         2.0   
3            1.0                5.0     2.0                         2.0   
4            2.0                7.0    11.0                         2.0   
5            2.0               23.0     3.0                         1.0   
6            2.0                9.0     5.0                         2.0   
7            2.0               10.0     3.0                         2.0   
8            1.0               22.0     3.0                         1.0   
9            2.0               15.0     3.0                         1.0   
10           2.0               10.0    22.0                         2.0   
11           2.0               13.0     1.0                         2.0   
12           2.0         

# Section 3-Binarizing the dataset
This section deals with binarizing the dataset based on whether it is larger or smaller than its mean. Mean is calculated for each feature first and is printed (for reference) and later the same mean is used  to binarize each feature in the dataframe.

In [112]:
for column in df:
    meaan=(df[column].mean())
    print("Mean for feature",column,"is",meaan)
    df[column] = np.where(df[column] >= meaan, '1', '0') #binarizing the dataset based on the mean of each feature
print(' ')
print(df)


Mean for feature TA Language is 1.80794701987
Mean for feature Course Instructor is 13.642384106
Mean for feature Course is 8.1059602649
Mean for feature Summer or Regular Semester is 1.84768211921
Mean for feature Class size is 27.8675496689
Mean for feature Class attribute is 2.01986754967
 
    TA Language Course Instructor Course Summer or Regular Semester  \
0             0                 1      0                          0   
1             1                 1      0                          0   
2             0                 1      0                          1   
3             0                 0      0                          1   
4             1                 0      1                          1   
5             1                 1      0                          0   
6             1                 0      0                          1   
7             1                 0      0                          1   
8             0                 1      0                          

# Section 3- Train-test split on Binarized dataset.
In this section we divide the dataset into tarin and test using 70/30 as our parameter. Copies of the dataframe is also made for future use. X_test is printed just to show how the train test split has worked on our binarized dataset. Shape of training and testing data is also printed for reference.

In [113]:
df1=df.copy()
df2=df1.copy()
df3=df1.iloc[:,0:5]
df4=df2.iloc[:,5]
X_train, X_test, y_train, y_test = train_test_split(df3,df4,test_size = 0.3)

print(X_test)

    TA Language Course Instructor Course Summer or Regular Semester Class size
95            1                 0      0                          1          0
45            1                 0      0                          1          0
14            1                 0      1                          1          1
50            1                 0      0                          1          1
21            1                 1      0                          1          0
113           1                 0      0                          1          0
90            0                 1      1                          1          0
89            1                 1      1                          1          0
44            1                 1      0                          0          0
77            1                 1      1                          1          1
37            1                 0      0                          1          1
0             0                 1      0            

In [114]:
print("Shape for X_train is",X_train.shape)
print("Shape for y_train is",y_train.shape)
print("Shape for X_test is",X_test.shape)
print("Shape for X_test is",y_test.shape)
probabilitydist=model.predict_proba(X_test)

Shape for X_train is (105, 5)
Shape for y_train is (105,)
Shape for X_test is (46, 5)
Shape for X_test is (46,)


# Section 4- Training of Bernouilli Naive Bayes
In this section, we train the bernouilli naive bayes using default parameter settings and use the predict proba to show the probability distribution of the test dataset on the model.

In [115]:
# Bernoulli NB
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
model=clf.fit(X_train, y_train)
print(model)
print(' ')
prediction_frame=pd.DataFrame(model.predict_proba(X_test))
sorteddf=prediction_frame.sort_values(by=0,ascending=True)
sorteddf.columns = ['No', 'Yes']
print(sorteddf)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
 
          No       Yes
35  0.272545  0.727455
11  0.282784  0.717216
21  0.499738  0.500262
13  0.499738  0.500262
6   0.521211  0.478789
37  0.563410  0.436590
25  0.563410  0.436590
15  0.575925  0.424075
8   0.575925  0.424075
34  0.575925  0.424075
38  0.575925  0.424075
42  0.575925  0.424075
14  0.598235  0.401765
44  0.598235  0.401765
31  0.755378  0.244622
30  0.755378  0.244622
33  0.755378  0.244622
0   0.755378  0.244622
22  0.755378  0.244622
45  0.755378  0.244622
1   0.755378  0.244622
5   0.755378  0.244622
4   0.764688  0.235312
12  0.764688  0.235312
29  0.764688  0.235312
28  0.764688  0.235312
3   0.765785  0.234215
18  0.765785  0.234215
10  0.765785  0.234215
32  0.765785  0.234215
20  0.765785  0.234215
27  0.765785  0.234215
41  0.774817  0.225183
40  0.780846  0.219154
24  0.789457  0.210543
7   0.789457  0.210543
17  0.789457  0.210543
19  0.789457  0.210543
26  0.790470  0.209530
23  0.7

# Section 4a- Calculating probability of each class

In [116]:
probb=clf.class_log_prior_
cProb0,cProb1=np.exp(probb)
# print(cProb0)
# print(cProb1)
ratiosofcProb=cProb0/cProb1
e_cProb=np.log(ratiosofcProb)
print(e_cProb)


0.78015855755


# Section 4b- Getting probability of each feature given a class

This seaction deals with providing the probability of each feature given a class. (No/Yes)
feature_log_proba is a Bernouilli Naive Bayes attribute which provides Empirical log probability of features given a class, P(x_i|y).


In [117]:
featureProb=model.feature_log_prob_
print(featureProb)
featureProbinexp=np.exp(featureProb) # converting to exponential form
probof1_0=featureProbinexp[0] #prob of no
probof1_1=featureProbinexp[1] #prob of yes

mainRatio=probof1_0/probof1_1
loggratio_10=np.log(mainRatio)

probof0_0=1-probof1_0
probof0_1=1-probof1_1
logratio_00=np.log(probof0_0/probof0_1)
print('')
print(logratio_00)

[[-0.17693071 -0.64050345 -0.90286771 -0.09937247 -0.69314718]
 [-0.51082562 -0.6649763  -0.9903987  -0.22314355 -0.72213472]]

[-0.90286771 -0.02658231 -0.05556985 -0.74871703 -0.02817088]
