__Datasets:__ The folder data contains 98 publicly available datasets from the UCI machine learning repository ([link](http://archive.ics.uci.edu/ml/index.php)). These datasets were collected and converted to a standard format by Dunn and Bertsimas (for more details see [link1](https://github.com/JackDunnNZ/uci-data) and [link2](http://jack.dunn.nz/papers/OptimalClassificationTrees.pdf)):
* Each dataset is stored in a separate folder
* Each folder contains a datafile and the configuration file config.ini specifying the data format
- Data files are stored in csv format and their names either end with ".orig" or at ".custom". If both files exist in a folder, use the file ending with ".custom"
- Each config.ini file contains information about a dataset: 
    - separator: the character used to separate columns in the respective csv file
    - header_lines: the number of rows to be skipped in the datafile as these contain some information about the file but not data
    - target_index: the column number of the output variable
    - value_indices: the column numbers of the input variables
    - categoric_indices: column numbers of categorical data
    



__Remarks__:
1. Notice that column numbering in the configuration files begins with 1 (versus 0 in Python)
2. You may use the package [configparser](https://docs.python.org/3.7/library/configparser.html) to read and parse config.ini files
3. The character "?" denotes a null value. After reading a data file, you may drop all lines that contain null values.
4. Out of the 98 datasets, use only the 54 datasets whose name is stored in the file "datasets_selection".



__Assignment__: compare the performance of the following classification algorithms on the 54 datasets: 
- Support vector machine, 
- Logistic Regression, 
- K-nearest neighbors, 
- Decision trees, 
- Quadratic discriminant analysis, 
- Random forests, and 
- AdaBoost


Submit your solution as a jupyter notebook and include in your submission other files that may be needed to replicate your analysis. In addition, submit a report (at most 4 pages long) that discusses your methodology, key findings, as well as the limitations of your analysis. Compare the use of ML methods in this project against typical ML applications. 


__Tip:__ Start early. The assignment requires substantial amount of files processing prior to running the learning algorithms and analyzing the results. 


## Datasets

In [1]:
!pwd

/c/Users/yaron.shaposhnik/Dropbox/Projects/Teaching/2018/BA/Homeworks/Mini-project 1 sklearn


In [2]:
!ls

Mini-project 1 (MSBA).ipynb
__notes__.txt
data
dataset_stats.csv
datasets_selection
homework 4.ipynb
notes_on_dataset.txt
results.csv
results_accuracy.csv
solution to homework 4 exercise 1.ipynb
solution to homework 4 exercise 2.ipynb
stats.csv
stats_all.csv
temp notes.txt
temp.csv


In [3]:
!ls data

abalone
acute-inflammations-1
acute-inflammations-2
arrhythmia
balance-scale
balloons-a
balloons-b
balloons-c
balloons-d
banknote-authentication
blood-transfusion-service-center
breast-cancer-wisconsin-diagnostic
breast-cancer-wisconsin-original
breast-cancer-wisconsin-prognostic
car-evaluation
chess-king-rook-vs-king
chess-king-rook-vs-king-pawn
climate-model-simulation-crashes
cnae-9
congressional-voting-records
connectionist-bench
connectionist-bench-sonar
contraceptive-method-choice
credit-approval
cylinder-bands
dermatology
echocardiogram
ecoli
fertility
flags
glass-identification
haberman-survival
hayes-roth
heart-disease-cleveland
heart-disease-hungarian
heart-disease-switzerland
heart-disease-va
hepatitis
hill-valley
hill-valley-noise
horse-colic
image-segmentation
indian-liver-patient
ionosphere
iris
lenses
letter-recognition
libras-movement
lung-cancer
magic-gamma-telescope
mammographic-mass
monks-problems-1
monks-problems-2
monks-problems-3
mushroom
nursery
optical-recogniti

In [4]:
!ls data/abalone/

abalone.data.orig
config.ini


In [5]:
!cat data/abalone/config.ini

[info]
name = abalone.data
info_url = http://archive.ics.uci.edu/ml/datasets/Abalone
data_url = http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
target_index = 9
id_indices =
value_indices = 1,2,3,4,5,6,7,8
categoric_indices = 1
separator = comma
header_lines = 0


In [6]:
import configparser
config = configparser.ConfigParser()
config.read('data/abalone/config.ini')

['data/abalone/config.ini']

In [7]:
config['info']['name']

'abalone.data'

In [8]:
config['info']['value_indices']

'1,2,3,4,5,6,7,8'

In [9]:
config['info']['target_index']

'9'

In [10]:
!head data/abalone/abalone.data.orig

M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15
M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7
F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7
I,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8
F,0.53,0.415,0.15,0.7775,0.237,0.1415,0.33,20
F,0.545,0.425,0.125,0.768,0.294,0.1495,0.26,16
M,0.475,0.37,0.125,0.5095,0.2165,0.1125,0.165,9
F,0.55,0.44,0.15,0.8945,0.3145,0.151,0.32,19


In [11]:
config['info']['separator']

'comma'

In [12]:
# only work with datasets whose name is listed in the file "datasets_selection"
!cat datasets_selection

acute-inflammations-1.data
acute-inflammations-2.data
balance-scale.data
banknote-authentication.data
blood-transfusion-service-center.data
breast-cancer-wisconsin-diagnostic.data
breast-cancer-wisconsin.data
breast-cancer-wisconsin-prognostic.data
car-evaluation.data
chess-king-rook-vs-king-pawn.data
climate-model-simulation-crashes.data
congressional-voting-records.data
connectionist-bench.data
connectionist-bench-sonar.data
contraceptive-method-choice.data
credit-approval.data
cylinder-bands.data
dermatology.data
echocardiogram.data
fertility.data
haberman-survival.data
hayes-roth.data
heart-disease-cleveland.data
hepatitis.data
image-segmentation.data
indian-liver-patient.data
ionosphere.data
iris.data
mammographic-mass.data
monks-problems-1.data
monks-problems-2.data
monks-problems-3.data
optical-recognition-handwritten-digits.data
ozone-level-detection-eight.data
ozone-level-detection-one.data
parkinsons.data
pima-indians-diabetes.data
planning-relax.data
qsar-biodegradation.data

In [13]:
config['info']['name']

'abalone.data'

In [13]:
import os
import pandas as pd
from sklearn import svm
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model
from sklearn import neighbors         
from sklearn import tree
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier

In [14]:
Results=pd.DataFrame(index=[0],columns=["FileName","SVM_Score","LogisticRegression_Score","KNN_Score","DecisionTress_Score","QDA_Score","Randomforest_Score","AdaBoostScore"])
os.chdir('/Users/phoebezhu/Desktop/1.Python/Mini-project 1 sklearn (MSBA)')
file_selection=open('datasets_selection',mode='r',encoding='utf-8')
files_to_read=file_selection.read().split('\n')
file_selection.close()
os.chdir('/Users/phoebezhu/Desktop/1.Python/Mini-project 1 sklearn (MSBA)/data')

In [15]:
folder_names=!ls
for i in range(len(folder_names)):
    os.chdir(os.path.join(r'/Users/phoebezhu/Desktop/1.Python/Mini-project 1 sklearn (MSBA)/data',folder_names[i]))
    y=!ls  #----eg: abalone.data.orig    config.ini----
    work_file=[]
    for j in range(len(y)) :
        if y[j].endswith(".custom"):
            work_file=y[j]
        
        elif y[j].endswith(".orig"):
            work_file=y[j]
                
    if work_file.split('.orig')[0] in files_to_read:
        import configparser
        config = configparser.ConfigParser()
        config.read('config.ini')
        x_values=config['info']['value_indices'].split(',')
        x_values=[int(z)-1 for z in x_values]
        separator=config['info']['separator']
        if int(config['info']['header_lines'])==0:
            header_info=None
       
        else:
            header_info=int(config['info']['header_lines'])-1#-------------------
    
    
        if separator=='' and work_file.split('.')[0] in ('seeds','connectionist-bench','monks-problems-1','monks-problems-2','monks-problems-3',
                'planning-relax','statlog-project-landsat-satellite',
                'thyroid-disease-ann-thyroid','statlog-project-german-credit'):
            use_separator='\s+'
        
        elif separator=='':
            use_separator='\t'
    
        elif separator=='comma':
            use_separator=','
            
        elif separator==' ':
            use_separator='\s+'
            
        elif separator==';':
            use_separator=';'
            
        file=pd.read_table(work_file,sep=use_separator,header=header_info)    #--------------header-----
        file.dropna(axis=0,inplace=True)
        predictors=file[file.columns[x_values]]
        y_values=int(config['info']['target_index'])
        target=file[file.columns[y_values-1]]
        lb=LabelEncoder()
        target_trans= lb.fit_transform(target)
        for column in predictors.columns:
             if predictors[column].dtype == type(object):
                    predictors[column] = lb.fit_transform(predictors[column])

    

#-------Support vector machine-------

        clf = svm.SVC()
        clf.fit(predictors, target_trans)  
        svm_score=clf.score(predictors, target_trans)

#-------Logistic Regression------- 
        clf = linear_model.LogisticRegression() 
        clf.fit(predictors, target_trans)  
        logistic_score=clf.score(predictors, target_trans)

#-------K-nearest neighbors----------
        clf = neighbors.KNeighborsClassifier(1)                  
        clf.fit(predictors, target_trans)  
        KNN_score=clf.score(predictors, target_trans)  

#-------Decision trees--------------
        clf = tree.DecisionTreeClassifier(max_depth=4)              
        clf.fit(predictors, target_trans)  
        trees_score=clf.score(predictors, target_trans)  

#----------Quadratic discriminant analysis---------
        qda = QuadraticDiscriminantAnalysis(store_covariances=True)
        qda.fit(predictors, target_trans)  
        QDA_score=qda.score(predictors, target_trans) 

#----------Random forests---------
        clf = RandomForestClassifier()
        clf.fit(predictors, target_trans)  
        randomforest_score=clf.score(predictors, target_trans) 

#----------AdaBoost-----------
        bdt = AdaBoostClassifier(tree.DecisionTreeClassifier(max_depth=4),
                         algorithm="SAMME",
                         n_estimators=200)

        bdt.fit(predictors, target_trans)
        adaboost_score=bdt.score(predictors, target_trans)

        d={"FileName":work_file,"SVM_Score":svm_score,"LogisticRegression_Score":logistic_score,
        "KNN_Score":KNN_score,"DecisionTress_Score":trees_score,"QDA_Score":QDA_score,
        "Randomforest_Score":randomforest_score,"AdaBoostScore":adaboost_score}

        df=pd.DataFrame(data=d,index=[0])

        Results=Results.append(df)

    else:
        pass

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  X2 = np.dot(Xm, R * (S ** (-0.5)))
  X2 = np.dot(Xm, R * (S ** (-0.5)))
  u = np.asarray([np.sum(np.log(s)) for s in self.scalings_])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Se

In [16]:
Results

Unnamed: 0,AdaBoostScore,DecisionTress_Score,FileName,KNN_Score,LogisticRegression_Score,QDA_Score,Randomforest_Score,SVM_Score
0,,,,,,,,
0,1.0,1.0,acute-inflammations-1.data.orig.custom,1.0,1.0,0.925,1.0,1.0
0,1.0,1.0,acute-inflammations-2.data.orig.custom,1.0,1.0,0.833333,1.0,1.0
0,1.0,0.8272,balance-scale.data.orig,1.0,0.8768,0.9168,0.9872,0.9184
0,1.0,0.962099,banknote-authentication.data.orig,1.0,0.990525,0.985423,1.0,1.0
0,0.879679,0.799465,blood-transfusion-service-center.data.orig,0.893048,0.770053,0.509358,0.919786,0.898396
0,1.0,0.982425,breast-cancer-wisconsin-diagnostic.data.orig,1.0,0.959578,0.973638,0.996485,1.0
0,1.0,0.972818,breast-cancer-wisconsin.data.orig,1.0,0.964235,0.945637,0.998569,0.997139
0,1.0,0.868687,breast-cancer-wisconsin-prognostic.data.orig,1.0,0.782828,0.989899,0.979798,1.0
0,0.997685,0.825231,car-evaluation.data.orig,1.0,0.696181,0.037616,0.998843,0.961227


In [20]:
Results.describe() 

Unnamed: 0,AdaBoostScore,DecisionTress_Score,KNN_Score,LogisticRegression_Score,QDA_Score,Randomforest_Score,SVM_Score
count,54.0,54.0,54.0,54.0,54.0,54.0,54.0
mean,0.984117,0.863871,0.989049,0.84124,0.784218,0.98534,0.939369
std,0.045454,0.124191,0.035599,0.136574,0.224951,0.022588,0.084882
min,0.723693,0.531313,0.7875,0.492424,0.037616,0.909091,0.65852
25%,0.997775,0.784417,1.0,0.758417,0.724484,0.985056,0.915711
50%,1.0,0.907475,1.0,0.876862,0.833098,0.994558,0.974943
75%,1.0,0.964136,1.0,0.959895,0.952991,1.0,1.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0
