# Homework 5: A comparative study of ML algorithms  (Part 1)
<h2>Predictive Analytics using Python</h2>
<h3>Simon Business School</h3>

__Instructor__: Yaron Shaposhnik


__Submission guidelines:__
1. Answer each question in the following cell.
2. __Click "Submit" to submit and execute the automatic grading__ (without this step the submission will not be recorded). 
3. You may submit multiple solutions but your last submission will determine your grade.
4. You may save the notebook and resume the work later.
5. __Refrain from adding any identifiable information to the notebook__. The notebook may be made public.  



## Overview

__Goal:__ The goal of this assignment is to conduct a study that compares the performance of various classification algorithms on multiple datasets. 

__Datasets:__ The folder data contains 98 publicly available datasets from the UCI machine learning repository ([link](http://archive.ics.uci.edu/ml/index.php)). These datasets were collected and converted to a standard format by Dunn and Bertsimas (for more details see [link1](https://github.com/JackDunnNZ/uci-data) and [link2](http://jack.dunn.nz/papers/OptimalClassificationTrees.pdf)):
* Each dataset is stored in a separate folder
* Each folder contains a datafile and the configuration file config.ini specifying the data format
- Data files are stored in csv format and their names either end with ".orig" or at ".custom". When both files exist in a folder, we will use the file ending with ".custom"
- Each config.ini file contains information about a dataset, such as
    - separator: the character used to separate columns in the respective csv file
    - header_lines: the number of rows to be skipped in the datafile as these contain some information about the file but not data
    - target_index: the column number of the output variable
    - value_indices: the column numbers of the input variables
    - categoric_indices: column numbers of categorical data
    
__Remarks__:
1. Notice that column numbering in the configuration files begins with 1 (versus 0 in Python)
2. You may use the package [configparser](https://docs.python.org/3.7/library/configparser.html) to read and parse config.ini files
3. The character "?" denotes a null value. After reading a data file, you may drop all lines that contain null values.
4. Out of the 98 datasets, use only the 54 datasets whose name is stored in the file "datasets_selection" (the other datasets pertain to regression problems).


__Assignment__: compare the performance of the following classification algorithms on the 54 datasets: 
- Support vector machine, 
- Logistic Regression, 
- K-nearest neighbors, 
- Decision trees, 
- Quadratic discriminant analysis, 
- Random forests, and 
- AdaBoost


This jupyter notebook will guide you through the analysis. You will then discuss your key findings, the limitations of the analysis, and compare the use of ML methods in this project to typical ML applications. 

__Tip:__ Start early. The assignment requires substantial amount of files processing prior to running the learning algorithms and analyzing the results.

In [1]:
import pandas as pd
import numpy as np

# the public folder below contains files that are used in this assignment
public = '../resource/asnlib/publicdata/'

# helper function 
def display_example(EXAMPLE_FOLDER, hw, exercise, question):
    sample_file = EXAMPLE_FOLDER + '%s_%s_%s_sample.csv'%(hw, exercise,question)
    print(pd.read_csv(sample_file, index_col=0))

    
utils = public + 'utils.py'        # grading code
solution = public + '[solution]/'  # solutions 
%run $utils
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Part 1: Exploring a new data format

In this part of the assignment, you will explore the format of the datasets.

The folder `data` (located in the public folder) contains 98 publicly available datasets from the UCI machine learning repository ([link](http://archive.ics.uci.edu/ml/index.php)). These datasets were collected and converted to a standard format by Dunn and Bertsimas (for more details see [link1](https://github.com/JackDunnNZ/uci-data) and [link2](http://jack.dunn.nz/papers/OptimalClassificationTrees.pdf)). Each dataset is stored in a separate folder.

1.Store the names of the subfolders in `data` as a list named `datasets_folders`.

In [2]:
datasets_folders=!ls $public/data

In [3]:
#q1

###
### AUTOGRADER TEST - DO NOT REMOVE
###


2.One of the datasets in `data` is called `abalone`. Store the __relative__ location of the respective folder into the variable `dataset_folder` (see the definition of the variable `public` above).

In [4]:
dataset_folder= '../resource/asnlib/publicdata/data/abalone/'

In [5]:
#q2

###
### AUTOGRADER TEST - DO NOT REMOVE
###


3.Each subfolder of `data` contains a datafile (whose extension is .orig or .custom) and the configuration file `config.ini` specifying the data format. 
Data files are formatted as comma separated values and their names either end with ".orig" or at ".custom". If both files exist in a folder, use the file ending with ".custom". 

The code below illustrates the file structure and the data format.

In [6]:
# the files inside the folder `abalone`
!ls $dataset_folder

abalone.data.orig  config.ini


In [7]:
# the first 5 rows of the data file
f = dataset_folder + 'abalone.data.orig'
!head $f

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


In [8]:
# the content of the configuration file:
f = dataset_folder + 'config.ini'
!head $f

[info]
name = abalone.data
info_url = http://archive.ics.uci.edu/ml/datasets/Abalone
data_url = http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
target_index = 9
id_indices =
value_indices = 1,2,3,4,5,6,7,8
categoric_indices = 1
separator = comma
header_lines = 0


Each `config.ini` file contains information about a dataset: 
- separator: the character used to separate columns in the respective csv file
- header_lines: the number of rows to be skipped in the datafile as these contain some information about the dataset, which is not data
- target_index: the column number of the output variable
- value_indices: the column numbers of the input variables
- categoric_indices: column numbers of categorical data


4.Implement the function `load_config_file(dataset_folder)` which reads the configuration file of the dataset stored in the folder `dataset_folder` into a dictionary. Ignore the fields: info_url and data_url. Include a few additional fields:
- config_file: the location of the configuration file
- folder: the location of the dataset's folder
- data_file: the location of the data file (.custom file if available, otherwise the .orig file)

Note that: 
1. Column numbering in the configuration files begins with 1 (versus 0 in Python). Use Python's convention to store the values of indexes (columns `target_index`,`id_indices`,`categoric_indices`, and `value_indices`).
2. You may use the package [configparser](https://docs.python.org/3.7/library/configparser.html) to read and parse config.ini files
3. Change the value of separator to hold the actual charachter/string/re used to separate columns (that is, modify "" to "\s+", "comma" to "," and keep semicolon as is). For example, the configuration file may specify _separator=comma_ but you should initialize the respective field as _separator=","_). 


In [9]:
import configparser # package for parsing configuration files

def load_config_file(dataset_folder):
    config_file,data_file = get_conf_data_dir(dataset_folder)
    config = configparser.ConfigParser()
    config.read(config_file)
    dist = {
        "name":config.get('info','name'),
        "target_index":config.getint('info','target_index')-1,
        "id_indices":format_indices(config.get('info','id_indices')),
        "value_indices":format_indices(config.get('info','value_indices')),
        "categoric_indices":format_indices(config.get('info','categoric_indices')),
        "separator":separator_replace(config.get('info','separator')),
        "header_lines":config.getint('info','header_lines'),
        "config_file":config_file,
        "folder":dataset_folder,
        "data_file":data_file,
    }
    return dist
### 
### YOUR CODE HERE
###

def format_indices(indices):
    """
    Transform the indices format
    """
    result=[]
    if indices == None or indices=='':
        return result
    else:
        for i in indices.split(','):
            result.append(int(i)-1)
        return result
    
def separator_replace(separator):
    """
    Replace the delimiter
    """
    if separator ==  "" :
        return  "\\s+"
    elif separator == "comma":
        return ","
    else:
        return ";"

def get_conf_data_dir(data_dir):
    """
    Gets the configuration file path
    """
    config_file=''
    data_file=''
    data_file_cus=''
    data_file_ori=''
    files = os.listdir(data_dir)
    for item in files:
        if item.endswith('ini'):
            config_file = data_dir+item
        elif item.endswith('custom'):
            data_file_cus = data_dir+item
        if item.endswith('orig'):
            data_file_ori = data_dir+item
    if data_file_cus=='':
        data_file=data_file_ori
    else:
        data_file = data_file_cus
    return(config_file,data_file)
# get_conf_data_dir("../resource/asnlib/publicdata/data/acute-inflammations-2/")
# print(os.listdir("../resource/asnlib/publicdata/data/acute-inflammations-2/"))
# for folder in datasets_folders[1:]:
#     c = load_config_file(public+'data/'+folder+'/')
# #     print(c)
#     try:
#         grade_dictionary(solution, c, 5, 1.4, '%s'%folder)
#     except Exception as e:
#         s = 'Encountered error while parsing dataset: ' + folder + ';\n' + str(e)
#         raise(Exception(s))

For example, the for the abalone dataset, your dictionary should contain the following values:

In [10]:
f = public+'5_1_4_complete.json'
!cat $f

{
    "name": "abalone.data",
    "target_index": 8,
    "id_indices": [],
    "value_indices": [
        0,
        1,
        2,
        3,
        4,
        5,
        6,
        7
    ],
    "categoric_indices": [
        0
    ],
    "separator": ",",
    "header_lines": 0,
    "config_file": "../resource/asnlib/publicdata/data/abalone/config.ini",
    "folder": "../resource/asnlib/publicdata/data/abalone/",
    "data_file": "../resource/asnlib/publicdata/data/abalone/abalone.data.orig"
}

Note that the corresponding `config.ini` file is the following: 

In [11]:
f = dataset_folder + 'config.ini'
!cat $f

[info]
name = abalone.data
info_url = http://archive.ics.uci.edu/ml/datasets/Abalone
data_url = http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
target_index = 9
id_indices =
value_indices = 1,2,3,4,5,6,7,8
categoric_indices = 1
separator = comma
header_lines = 0


In [12]:
#q3

# validation code
import json
dataset_dict1 = load_config_file(dataset_folder)
with open(public+'5_1_4_complete.json') as f:
    dataset_dict2 = json.load(f)

assert(dataset_dict1['name']==dataset_dict2['name'])
assert(dataset_dict1['target_index']==dataset_dict2['target_index'])
assert(dataset_dict1['id_indices']==dataset_dict2['id_indices'])
assert(dataset_dict1['categoric_indices']==dataset_dict2['categoric_indices'])
assert(dataset_dict1['separator']==dataset_dict2['separator'])
assert(dataset_dict1['header_lines']==dataset_dict2['header_lines'])
from filecmp import cmp, dircmp
assert(cmp(dataset_dict1['config_file'],dataset_dict2['config_file']))
assert(dircmp(dataset_dict1['folder'],dataset_dict2['folder']))
assert(cmp(dataset_dict1['data_file'],dataset_dict2['data_file']))

In [13]:
#q4

###
### AUTOGRADER TEST - DO NOT REMOVE
###
for folder in datasets_folders[1:]:
    c = load_config_file(public+'data/'+folder+'/')
    try:
        grade_dictionary(solution, c, 5, 1.4, '%s'%folder)
    except Exception as e:
        s = 'Encountered error while parsing dataset: ' + folder + ';\n' + str(e)
        raise(Exception(s))


Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!
Success!


5.Implement the function `load_data(dataset_dict)` which takes as input the dataset dictionary defined above, and adds a new field whose name is `data_original`. The field holds a dataframe of the corresponding datafile. Remove rows containing missing values (explicit missing values like `np.nan` or those denoted by "?" in the data). 

In [14]:
def load_data(dataset_dict):
    '''
    import chardet
    pd.set_option('display.max_columns', None)
    pd.set_option('display.max_rows', None)
    pd.set_option('max_colwidth',100)   
    df = pd.read_csv(dataset_dict['data_file'],header=None,sep=dataset_dict['separator'])
    clean_df = df.replace(to_replace='\\u003F',value=np.nan,regex=True)
    #df.replace(to_replace=r"-",value=np.nan,regex=True,inplace=True)
    dataset_dict["data_original"]=clean_df.dropna()
    '''
    data = None
    filename = dataset_dict['data_file']
    sep = dataset_dict['separator']
    skiprows = dataset_dict['header_lines']
    data = pd.read_csv(filename,sep=sep,header=None,skiprows=skiprows)
    #remove ? and nan
    for i in range(data.shape[1]):
        data = data[~data[i].isin(['?'])]
    data = data.dropna()
    #convert to numeric
    data = data.apply(pd.to_numeric,errors='ignore')
    dataset_dict['data_original'] = data
    return dataset_dict

###
### YOUR CODE HERE
###

fold="../resource/asnlib/publicdata/data/thyroid-disease-thyroid-0387/"
load_data(load_config_file(fold))

{'name': 'thyroid-disease-thyroid-0387.data',
 'target_index': 29,
 'id_indices': [30],
 'value_indices': [0,
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  17,
  18,
  19,
  20,
  21,
  22,
  23,
  24,
  25,
  26,
  27,
  28],
 'categoric_indices': [1,
  2,
  3,
  4,
  5,
  6,
  7,
  8,
  9,
  10,
  11,
  12,
  13,
  14,
  15,
  16,
  18,
  20,
  22,
  24,
  26,
  28],
 'separator': ',',
 'header_lines': 0,
 'config_file': '../resource/asnlib/publicdata/data/thyroid-disease-thyroid-0387/config.ini',
 'folder': '../resource/asnlib/publicdata/data/thyroid-disease-thyroid-0387/',
 'data_file': '../resource/asnlib/publicdata/data/thyroid-disease-thyroid-0387/thyroid-disease-thyroid-0387.data.orig.custom',
 'data_original':       0  1  2  3  4  5  6  7  8  9     ...         21 22    23 24   25 26  27  \
 167   40  F  f  f  f  f  f  f  f  f    ...        3.9  t  0.83  t    5  t  28   
 5256  35  F  f  f  f  f  f  t  f  f    ...       73.0  t  1.16  

For example, here are the first 5 rows in the parsed dataset `abalone`

display_example(public, 5, 1.5, 'abalone')

In [15]:
#q5

###
### AUTOGRADER TEST - DO NOT REMOVE
###
for folder in datasets_folders:
    dataset_dict = load_config_file(public+'data/'+folder+'/')
    load_data(dataset_dict)    
    try:
        print('Loading',folder,'...')
        grade_dataframe(solution, dataset_dict['data_original'], 5, 1.5, '%s'%folder)
    except Exception as e:
        print('Encountered error while parsing dataset',folder,'; expected')
        display_example(public, 5, 1.5, '%s'%folder)
        raise(e)
    
    
    


Loading thyroid-disease-thyroid-0387 ...
Success!
Loading post-operative-patient ...
Success!
Loading nursery ...
Success!
Loading skin-segmentation ...
Success!
Loading zoo ...
Success!
Loading ozone-level-detection-eight ...
Success!
Loading mammographic-mass ...
Success!
Loading heart-disease-cleveland ...
Success!
Loading balloons-a ...
Success!
Loading blood-transfusion-service-center ...
Success!
Loading thyroid-disease-new-thyroid ...
Success!
Loading mushroom ...
Success!
Loading wall-following-robot-navigation-24 ...
Success!
Loading breast-cancer-wisconsin-prognostic ...
Success!
Loading pima-indians-diabetes ...
Success!
Loading arrhythmia ...
Success!
Loading balloons-d ...
Success!
Loading ecoli ...
Success!
Loading teaching-assistant-evaluation ...
Success!
Loading balance-scale ...
Success!
Loading wine ...
Success!
Loading acute-inflammations-1 ...
Success!
Loading monks-problems-3 ...
Success!
Loading wall-following-robot-navigation-4 ...
Success!
Loading car-evaluatio

6.Implement the function `collect_stats(dataset_dict)` which takes as input a dataset loaded into a dictionary (as specified above), and returns a dictionary whose keys are: 'Name','n','p','p_cat','k'. The corresponding values contain the name of the dataset, the number of observations, number of features, number of categorical features, and the number of classes in each dataset. 

In [16]:
def collect_stats(dataset_dict):
    stats={}
    df = dataset_dict["data_original"]
    stats["Name"]=dataset_dict["name"]
    stats['n']=df.shape[0]
    stats['p']=np.size(dataset_dict['value_indices'])
    stats['p_cat'] = np.size(dataset_dict['categoric_indices'])
    datatmp = df[dataset_dict['target_index']]
    count_list = []
    for i in datatmp:
        count_list.append(i)
    stats['k'] = len(set(count_list))
#     stats['k']=df[]
   # print(stats)
    return stats
# fold="../resource/asnlib/publicdata/data/thyroid-disease-thyroid-0387/"
# collect_stats(load_data(load_config_file(fold)))
###
### YOUR CODE HERE
###

For example, for the `abalone` dataset the following statistics are collected:

In [17]:
f = public + '5_1.6_abalone_complete.json'
!cat $f

{
    "Name": "abalone.data",
    "n": 4177,
    "p": 8,
    "p_cat": 1,
    "k": 28
}

Use the function `collect_stats` to collect statisticis on the datasets whose names (as appear in the config.ini file; e.g. `breast-cancer-wisconsin.data` is the name of the file located in the folder `breast-cancer-wisconsin-original`) appear in the file `datasets_selection` (included in the pubilc folder). Save these as the dataframe `df_stats` (the rest of the datasets define regression problems but we will focus on classification problems in this assignment).

Initialize the variable `classification_folders` to a list that holds the __folder names__ of the classification problems. For example, the first row in `datasets_selection` contains the name `acute-inflammations-1.data`, which is located in the folder `data/acute-inflammations-1`. The first element in the list `classification_folders` should be `acute-inflammations-1`.

In [18]:
df_stats = None#pd.DataFrame(columns = ["Name", "k", "n", "p","p_cat"])
classification_folders = []

classification_folders = open(os.path.join(public,'datasets_selection'),'r').readlines()
classification_folders = [k.replace('.data','').replace('\n','') for k in classification_folders]

classification_folders[classification_folders.index('breast-cancer-wisconsin')] = 'breast-cancer-wisconsin-original'

df_stats_dict = {'Name':[],'k':[],'n':[],'p':[],'p_cat':[]}
for cf in classification_folders:
    tmp = load_config_file(os.path.join(public,'data',cf)+'/')
    tmp = load_data(tmp)
    tmp = collect_stats(tmp)
    df_stats_dict['Name'].append(tmp['Name'])
    df_stats_dict['n'].append(tmp['n'])
    df_stats_dict['p'].append(tmp['p'])
    df_stats_dict['k'].append(tmp['k'])
    df_stats_dict['p_cat'].append(tmp['p_cat'])
df_stats = pd.DataFrame(df_stats_dict)
'''
# get classification_folders
for item in df[0]:
    classification_folders.append(public_folder%(item.split('.')[0]))

# get df_stats
for item in classification_folders:
    if os.path.exists(item):
        dataset_dict = load_data(load_config_file(item))
        stats = collect_stats(dataset_dict)
        df_stats.append([stats], ignore_index=True)
###
### YOUR CODE HERE
###
'''

"\n# get classification_folders\nfor item in df[0]:\n    classification_folders.append(public_folder%(item.split('.')[0]))\n\n# get df_stats\nfor item in classification_folders:\n    if os.path.exists(item):\n        dataset_dict = load_data(load_config_file(item))\n        stats = collect_stats(dataset_dict)\n        df_stats.append([stats], ignore_index=True)\n###\n### YOUR CODE HERE\n###\n"

For example, the first 3 classification datasets are: 'acute-inflammations-1',
 'acute-inflammations-2', and 'balance-scale'

The first few rows in the table `df_stats` are presented below

In [19]:
display_example(public, 5, 1, 6)

                                    Name  k     n  p  p_cat
0             acute-inflammations-1.data  2   120  6      5
1             acute-inflammations-2.data  2   120  6      5
2                     balance-scale.data  3   625  4      0
3           banknote-authentication.data  2  1372  4      0
4  blood-transfusion-service-center.data  2   748  4      0


In [20]:
#q6

###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [40]:
#q7

###
### AUTOGRADER TEST - DO NOT REMOVE
###
grade_dataframe(solution, df_stats, 5, 1, 6, sort_cols=['Name'])


Success!
