# A Model to Predict the Website perceived Usability 

Author: Ibiyo Motunrayo O
Usability is the extent to which a product can be used by specified user to achieve a specified goal with effectiveness, efficiency, and satisfaction in a specified context of use.

This project uses data obtained from users to predict the users perceived usability. Furthermore this project aims to find the most significant features in the dataset that affects the prediction accuracy and Fbeta score.


In [None]:
import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")

%matplotlib inline
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from IPython.display import display # Allows the use of display() for DataFrames


#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')

import matplotlib.patches as mpatches


# Import supplementary visualization code visuals.py
import visuals as vs

### Import the data

In [None]:
data = pd.read_csv('Journaldatapython.csv')
data.head()

In [None]:
data = data.drop('UsabilityIndex',1)
data.head()

### Explorative Data Analysis

In [None]:
#Total number of records
n_records = data.shape[0]

#number of data with usability of user perceived usability that is yes
yes_records = len(data[data['Class'] == 'Yes'])

#number of data with usability of user perceived usability that is No
no_records = len(data[data['Class'] == 'No'])

yes_percent = (yes_records/n_records)*100

print('Total number of records: {}'.format(n_records))
print('Perceived usability with class Yes:{}'.format(yes_records))
print('Perceived usability with class No:{}'.format(no_records))
print ('percentage of perceived userbility of Yes class.{}'.format(yes_percent))

**FeatureSet Exploration**
Gender: 
Male, Female
Age range: 
under 16, 16 - 20, 21 - 25, 26 above
Department: 
Computer Science, Project Management Technology, Chemistry, Information and Media Technology, Medicine and Surgery, Civil Engineering, Biochemistry, Electrical Electronics Engineering, estate management and valuation, Pharmacy, Law, Art, Agricultural Economics, Geology, Agricultural Extension and Rural Development,Microbiology, Political Science, Education, computer engineering, quantity surveying, Physics, soil science and land management, zoology,mathematics, Telecommunications Engineering, Water acquaculture and Fisheries Technology, Sociology, Building Technology, Statistics, Food science and technology, Psychology, crop science, medical laboratory sciences, Library and Information Science, Entreprenuerial and Bussiness Studies, Criminology, Veterinary medicine, environmental studies, Information and Communication Technology, Mechanical Engineering, Economics, ITE
Level: 
Undergraduate, Post graduate, others	
ComputerLiteracy: 
None, Novice, Intermediate, Expert	
University

**Options**
**1- Strongly disagree
2- Disagree
3 Neutral
4-Agree
5 Strongly agree **

Speed (SP)
SP1:-I need not wait too long to open a page	
SP2:-I am able to quickly complete my tasks using site. 
SP3:-I need not wait too long to download a file.
Navigation
NAV1:-I can easily navigate this site.
NAV2:-I can easily know where I am at this website
NAV3:-The website does not open too many new windows when I am moving around	
NAV4:-I don't need to scroll left or right on the website.	
Ease of use
EU1:-The website is easy to use.
EU2:-I can use the website without a guide.
EU3:-The websites require few steps to accomplish tasks	
Content and Content Relevance
CCR1:-The information provided on this website is sufficient for me.
CCR2:-Content like academic news, publication date is up-to date.
CCR3:-The website offers easy access to require details like contact nos., email address, postal address etc. of the university
Accessibility
ACC1:-The website provides alternative text presentation.	
ACC2:-The website is capable of full functionality via only keyboard.	
ACC3:-The navigation is designed to assist user in finding content and determine where they are	
Aesthetics and presentation
AP1:-The website’s interface design is attractive.
AP2:-The website has a clean and simple presentation.	
AP3:-I am comfortable with the colours used at this website.	
Reliability
REL1:-The website is reachable exclusively over HTTPS. 	
REL2:-The university’s website shows a warning message related to malicious software etc.


UsabilityIndex: Continuous 
Class: No, Yes

In [None]:
data.describe()

In [None]:
data.columns

In [None]:
data.info()

### Data Preprocessing

In this section the raw data is cleaned and normalized for the learning algorithm. This section is very important as a clean data will increase the performance of the model. 
The following actions will be carried out on the dataset
1. Removal of some features that could inhibit the goal of the prediction. e.g University and usability Index.
   University is removed because the goal is to understand users perceived usability. 
   Usability Index increases the risk of making the problem to simple to solve, as a siumple rule using the usability index    alone might give an accuracy of 100%

2. Derieve area of specialization from the feature department.
3. The dataset is checked for the presence of null values.
4. Convert Categorical Data to Numeric Data.
   the classifying algorithms used in this project do not accept categorical data, it is preferred that data is in numeric form before passing it to the learning algorithm. In this step One hot Code, label Encoders and custom functions are used to convert categorical data to numeric.
   

In [None]:
#Drop the feature university from the dataset.
data = data.drop('University', 1)

In [None]:
def count_plot(column_name):
    figure = plt.figure(10, figsize=(20, 4))
    axs = figure.add_subplot(121)
    axs.title.set_text('Bar Chart')
    sns.countplot(data[column_name])
    
    plt.show()   

In [None]:
int_col = ['Gender', 'Age range', 'Department', 'Level', 'ComputerLiteracy',
       'SP1', 'SP2', 'SP3', 'NAV1', 'NAV2', 'NAV3', 'NAV4',
       'EU1', 'EU2', 'EU3', 'CCR1', 'CCR2', 'CCR3', 'ACC1', 'ACC2', 'ACC3',
       'AP1', 'AP2', 'AP3', 'REL1', 'REL2', 'Class']

for col in int_col:
    count_plot(col)


In [None]:
department = data['Department'].value_counts()

department

In [None]:
def department_group (department):
    #remove all leading spaces and lower the column
    return department.lstrip().lower()
    

In [None]:
data['Department'] = data['Department'].apply(department_group)

In [None]:
count_plot('Department')

In [None]:
def discipline (department):
    if department in ('computer science','electrical electronics engineering', 'engineering', 'information and media technology',
                      'civil engineering', 'chemistry','computer engineering','electrical and electronics engineering',
                    'statistics', 'mathematics', 'food science and technology','engineering','science', 'physics',
                     'telecommunications engineering','mechanical engineering', 'information and communication technology',
                    'criminology','microbiology' ):
        return 'Science and Engineering'
    elif department in ('geology','quantity surveying','estate management and valuation','soil science and land management',
                       'zoology','agricultural economics','crop science','library and information science','environmental studies',
                       'building','agricultural extension and rural development','water acquaculture and fisheries technology',
                       ):
        return 'Agricultural and Environmental Science'
    elif department in ('medicine and surgery','pharmacy','biochemistry','dentistry and dental surgery','veterinary medicine',
                       'medical laboratory sciences'):
        return 'Medical Science'
    else:
        return 'Art,Social Science and Education'

In [None]:
data['Department'] = data['Department'].apply(discipline)

In [None]:
department = data['Department'].value_counts()

department

In [None]:
count_plot('Department')

## OneHot code all fields

In [None]:
X = data.drop('Class',1) #Features that will be used for training the model
y = data['Class'] #the class

In [None]:
X.head()

In [None]:
X = X.fillna(0)
X.isnull().any()

In [None]:
X_final = pd.get_dummies(data = X)

In [None]:
X_final.columns

In [None]:
#changes the 'yes' or 'no' on the ticket to 1 and 0 respectively
from sklearn.preprocessing import LabelEncoder
labelEncoder = LabelEncoder()
y = labelEncoder.fit_transform(y)

y

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size =0.2, random_state=40)

In [None]:
print("Training set has {} sample user response". format(X_train.shape[0]))
print("Test set has {} sample user response".format(X_test.shape[0]))

### Training the model

In [None]:
# Build a training function
from time import time
from sklearn.metrics import fbeta_score, accuracy_score
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test):
    '''
    inputs of the function:
        - learner: the learning algorithm to be trained and predicted on
        - sample_size: the size of samples to be drawn from the training set        
        - X_train: Features for the training set
        -y_train: usability for the training set
        -X_test: features testing test
        -y_test: income testing set
    '''
    results = {}
    
    #train the model
    start = time()
    learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time()
    
    results['train_time'] = end - start
    
       
    #predict for test and training test
    start = time()
    prediction_test = learner.predict(X_test)
    prediction_train = learner.predict(X_train[:sample_size])
    end = time()
    
    results['pred_time'] = end - start
    
    results['acc_train'] = accuracy_score(y_train[:sample_size], prediction_train)
    
    results ['acc_test'] = accuracy_score (y_test, prediction_test)
    
    results['f_train'] = fbeta_score(y_train[:sample_size],prediction_train,0.5)
    
    results['f_test'] = fbeta_score(y_test,prediction_test, 0.5)
    
    #Success
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
    
    return results
    

In [None]:
def evaluate(results):
    """
    Visualization code to display results of various learners.
    
    inputs:
      - learners: a list of supervised learners
      - stats: a list of dictionaries of the statistic results from 'train_predict()'
      - accuracy: The score for the naive predictor
      - f1: The score for the naive predictor
    """
    
    # Create figure
    fig, ax = plt.subplots(2, 3, figsize = (11,10))

    # Constants
    bar_width = 0.3
    colors = ['#A00000','#00A0A0','#00A000']
    
    # Super loop to plot four panels of data
    for k, learner in enumerate(results.keys()):
        for j, metric in enumerate(['train_time', 'acc_train', 'f_train', 'pred_time', 'acc_test', 'f_test']):
            for i in np.arange(3):
                
                # Creative plot code
                ax[j//3, j%3].bar(i+k*bar_width, results[learner][i][metric], width = bar_width, color = colors[k])
                ax[j//3, j%3].set_xticks([0.45, 1.45, 2.45])
                ax[j//3, j%3].set_xticklabels(["50%", "75%", "100%"])
                ax[j//3, j%3].set_xlabel("Training Set Size")
                ax[j//3, j%3].set_xlim((-0.3, 3.0))
    
    # Add unique y-labels
    ax[0, 0].set_ylabel("Time (in seconds)")
    ax[0, 1].set_ylabel("Accuracy Score")
    ax[0, 2].set_ylabel("F-score")
    ax[1, 0].set_ylabel("Time (in seconds)")
    ax[1, 1].set_ylabel("Accuracy Score")
    ax[1, 2].set_ylabel("F-score")
    
    # Add titles
    ax[0, 0].set_title("Model Training")
    ax[0, 1].set_title("Accuracy Score on Training Subset")
    ax[0, 2].set_title("F-score on Training Subset")
    ax[1, 0].set_title("Model Predicting")
    ax[1, 1].set_title("Accuracy Score on Testing Set")
    ax[1, 2].set_title("F-score on Testing Set")
    
    # Add horizontal lines for naive predictors
    #ax[0, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    #ax[1, 1].axhline(y = accuracy, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    #ax[0, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    #ax[1, 2].axhline(y = f1, xmin = -0.1, xmax = 3.0, linewidth = 1, color = 'k', linestyle = 'dashed')
    
    # Set y-limits for score panels
    ax[0, 1].set_ylim((0, 1))
    ax[0, 2].set_ylim((0, 1))
    ax[1, 1].set_ylim((0, 1))
    ax[1, 2].set_ylim((0, 1))

    # Create patches for the legend
    patches = []
    for i, learner in enumerate(results.keys()):
        patches.append(mpatches.Patch(color = colors[i], label = learner))
    plt.legend(handles = patches, bbox_to_anchor = (-.80, 2.53), \
               loc = 'upper center', borderaxespad = 0., ncol = 3, fontsize = 'x-large')
    
    # Aesthetics
    plt.suptitle("Performance Metrics for Three Supervised Learning Models", fontsize = 16, y = 1.10)
    plt.tight_layout()
    plt.show()

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

clf_A = GaussianNB()
clf_B = RandomForestClassifier(random_state=50)
clf_C = SVC(random_state=50)
clf_D = DecisionTreeClassifier(random_state = 50)
clf_E = AdaBoostClassifier(random_state =50)

samples_50 = int((1/2) * len(y_train))
samples_75 = int((75/100) * len(y_train))
samples_100 = len(y_train)

results = {}

for clf in [ clf_B, clf_C, clf_E]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_50, samples_75, samples_100]):
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)
        
#print (results)
# Run metrics visualization for the supervised learning models chosen
evaluate(results)
    