# Section 1-Teaching Evaluation Dataset (TAE)

This is a classification dataset which consists of 6 (including the class attribute) attributes and 151 instances. The data consist of evaluations of teaching performance, where scores are "low", "medium", or "high". You can find this dataset at https://archive.ics.uci.edu/ml/datasets/Teaching+Assistant+Evaluation. There are no Missing Attribute Values.
The following are the class attributes for this dataset-

1. Whether of not the TA is a native English speaker (binary); 1=English speaker, 2=non-English speaker 
2. Course instructor (categorical, 25 categories) 
3. Course (categorical, 26 categories) 
4. Summer or regular semester (binary) 1=Summer, 2=Regular 
5. Class size (numerical) 
6. Class attribute (categorical) 1=Low, 2=Medium, 3=High





# Section 2-Loading of the dataset
This dataset is loaded using the urllib.request module in python.Once the datatset is loaded, I put it in the dataframe that is then used in section 3 when we binarize it using the mean. The dataframe is shown for reference.

In [2]:
import pandas as pd
import numpy as np
import statistics
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
import urllib.request

url="https://archive.ics.uci.edu/ml/machine-learning-databases/tae/tae.data"

raw_data=urllib.request.urlopen(url)
dataset=np.loadtxt(raw_data,delimiter=",")
feature_names= ["TA Language","Course Instructor","Course","Summer or Regular Semester","Class size","Class attribute"]
df = pd.DataFrame(dataset, columns = ["TA Language","Course Instructor","Course","Summer or Regular Semester","Class size","Class attribute"])
print(df)

     TA Language  Course Instructor  Course  Summer or Regular Semester  \
0            1.0               23.0     3.0                         1.0   
1            2.0               15.0     3.0                         1.0   
2            1.0               23.0     3.0                         2.0   
3            1.0                5.0     2.0                         2.0   
4            2.0                7.0    11.0                         2.0   
5            2.0               23.0     3.0                         1.0   
6            2.0                9.0     5.0                         2.0   
7            2.0               10.0     3.0                         2.0   
8            1.0               22.0     3.0                         1.0   
9            2.0               15.0     3.0                         1.0   
10           2.0               10.0    22.0                         2.0   
11           2.0               13.0     1.0                         2.0   
12           2.0         

# Section 3-Binarizing the dataset
This section deals with binarizing the dataset based on whether it is larger or smaller than its mean. Mean is calculated for each feature first and is printed (for reference) and later the same mean is used  to binarize each feature in the dataframe.

In [3]:
for column in df:
    meaan=(df[column].mean())
    print("Mean for feature",column,"is",meaan)
    df[column] = np.where(df[column] >= meaan, '1', '0') #binarizing the dataset based on the mean of each feature
print(' ')
print(df)


Mean for feature TA Language is 1.80794701987
Mean for feature Course Instructor is 13.642384106
Mean for feature Course is 8.1059602649
Mean for feature Summer or Regular Semester is 1.84768211921
Mean for feature Class size is 27.8675496689
Mean for feature Class attribute is 2.01986754967
 
    TA Language Course Instructor Course Summer or Regular Semester  \
0             0                 1      0                          0   
1             1                 1      0                          0   
2             0                 1      0                          1   
3             0                 0      0                          1   
4             1                 0      1                          1   
5             1                 1      0                          0   
6             1                 0      0                          1   
7             1                 0      0                          1   
8             0                 1      0                          

# Section 3- Train-test split on Binarized dataset.
In this section we divide the dataset into tarin and test using 70/30 as our parameter. Copies of the dataframe is also made for future use. X_test is printed just to show how the train test split has worked on our binarized dataset. Shape of training and testing data is also printed for reference.

In [4]:
df1=df.copy()
df2=df1.copy()
df3=df1.iloc[:,0:5]
df4=df2.iloc[:,5]
X_train, X_test, y_train, y_test = train_test_split(df3,df4,test_size = 0.3)

print(X_test)

    TA Language Course Instructor Course Summer or Regular Semester Class size
85            1                 0      1                          0          0
134           1                 0      1                          1          0
79            0                 0      0                          0          0
149           1                 1      0                          1          1
126           0                 0      0                          1          1
41            0                 1      0                          1          1
81            1                 0      0                          1          1
62            1                 0      1                          0          0
127           1                 1      0                          1          1
131           0                 1      1                          1          1
107           1                 1      0                          1          0
110           1                 0      0            

In [5]:
print("Shape for X_train is",X_train.shape)
print("Shape for y_train is",y_train.shape)
print("Shape for X_test is",X_test.shape)
print("Shape for X_test is",y_test.shape)

Shape for X_train is (105, 5)
Shape for y_train is (105,)
Shape for X_test is (46, 5)
Shape for X_test is (46,)


# Section 4- Training of Bernouilli Naive Bayes
In this section, we train the bernouilli naive bayes using default parameter settings and use the predict proba to show the probability distribution of the test dataset on the model.

In [11]:
# Bernoulli NB
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
model=clf.fit(X_train, y_train)
print(model)
print(' ')
prediction_frame=pd.DataFrame(model.predict_proba(X_test))
print(prediction_frame)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)
 
           0         1
0   0.370361  0.629639
1   0.792224  0.207776
2   0.112737  0.887263
3   0.754405  0.245595
4   0.504320  0.495680
5   0.414443  0.585557
6   0.815350  0.184650
7   0.370361  0.629639
8   0.754405  0.245595
9   0.430187  0.569813
10  0.713192  0.286808
11  0.781400  0.218600
12  0.713192  0.286808
13  0.370361  0.629639
14  0.355441  0.644559
15  0.824869  0.175131
16  0.792224  0.207776
17  0.355441  0.644559
18  0.815350  0.184650
19  0.781400  0.218600
20  0.098440  0.901560
21  0.726210  0.273790
22  0.824869  0.175131
23  0.781400  0.218600
24  0.766166  0.233834
25  0.781400  0.218600
26  0.081212  0.918788
27  0.815350  0.184650
28  0.520443  0.479557
29  0.781400  0.218600
30  0.792224  0.207776
31  0.713192  0.286808
32  0.824869  0.175131
33  0.277256  0.722744
34  0.815350  0.184650
35  0.726210  0.273790
36  0.781400  0.218600
37  0.754405  0.245595
38  0.754405  0.245595
39  0.3

# Section 4a- Sorting of probability distribution
Probability distribution is sorted. Also, the output of PD shows P(NO) first and then P(Yes) according to my code.

In [8]:
sorteddf=prediction_frame.sort_values(by=0,ascending=True)
sorteddf.columns = ['No', 'Yes']
print(sorteddf)

          No       Yes
26  0.081212  0.918788
20  0.098440  0.901560
43  0.112737  0.887263
2   0.112737  0.887263
33  0.277256  0.722744
14  0.355441  0.644559
17  0.355441  0.644559
39  0.364257  0.635743
13  0.370361  0.629639
0   0.370361  0.629639
7   0.370361  0.629639
5   0.414443  0.585557
9   0.430187  0.569813
4   0.504320  0.495680
28  0.520443  0.479557
31  0.713192  0.286808
42  0.713192  0.286808
10  0.713192  0.286808
12  0.713192  0.286808
40  0.713192  0.286808
21  0.726210  0.273790
35  0.726210  0.273790
37  0.754405  0.245595
8   0.754405  0.245595
3   0.754405  0.245595
38  0.754405  0.245595
24  0.766166  0.233834
29  0.781400  0.218600
36  0.781400  0.218600
45  0.781400  0.218600
25  0.781400  0.218600
23  0.781400  0.218600
19  0.781400  0.218600
11  0.781400  0.218600
30  0.792224  0.207776
44  0.792224  0.207776
16  0.792224  0.207776
41  0.792224  0.207776
1   0.792224  0.207776
34  0.815350  0.184650
18  0.815350  0.184650
6   0.815350  0.184650
27  0.81535

# Section 4b- Getting probability of each feature given a class

This seaction deals with providing the probability of each feature given a class. (No/Yes)
feature_log_proba is a Bernouilli Naive Bayes attribute which provides Empirical log probability of features given a class, P(x_i|y).


In [9]:
featureProb=model.feature_log_prob_
feature_prob=pd.DataFrame(featureProb,columns= ["TA Language","Course Instructor","Course","Summer or Regular Semester","Class size"])
print(feature_prob)

   TA Language  Course Instructor    Course  Summer or Regular Semester  \
0    -0.121361          -0.693147 -0.916291                   -0.058841   
1    -0.444686          -0.528067 -0.955511                   -0.331357   

   Class size  
0   -0.664976  
1   -0.773190  
