<a href="https://colab.research.google.com/github/ParsaHejabi/ComputationalIntelligence-ComputerAssignments/blob/main/HW2/Adult_Income_Dataset_SVM.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SVM on Adult Income Dataset

## Download Dataset with wget and unzip the downloaded file

In [None]:
!rm -f ./adult.zip ./adult.dat ./adult.csv
!wget https://sci2s.ugr.es/keel/dataset/data/missing/adult.zip
!unzip adult.zip

--2020-11-10 09:40:48--  https://sci2s.ugr.es/keel/dataset/data/missing/adult.zip
Resolving sci2s.ugr.es (sci2s.ugr.es)... 150.214.190.154
Connecting to sci2s.ugr.es (sci2s.ugr.es)|150.214.190.154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 634870 (620K) [application/zip]
Saving to: ‘adult.zip’


2020-11-10 09:40:50 (1.02 MB/s) - ‘adult.zip’ saved [634870/634870]

Archive:  adult.zip
  inflating: adult.dat               


## Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn import svm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

## Convert dat file to CSV!

In [None]:
csv_header = "Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Class\n"

with open("adult.dat", "r") as f:
    lines = f.readlines()
with open("adult.csv", "w") as f:
  f.write(csv_header)
  for line_number, line in enumerate(lines):
    if (line_number >= 19):
      f.write(line)

## Prepare Data

### Information about data columns

**Inputs**
* Age real [17.0, 90.0]

* Workclass {Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked}

* Fnlwgt real [12285.0, 1490400.0]

* Education {Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool}

* Education-num real [1.0, 16.0]

* Marital-status {Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse}

* Occupation {Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces}

* Relationship {Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried}

* Race {White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black}

* Sex {Female, Male}

* Capital-gain real [0.0, 99999.0]

* Capital-loss real [0.0, 4356.0]

* Hours-per-week real [1.0, 99.0]

* Native-country {United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands}

**Output**
* Class {>50K, <=50K}

### Open CSV file in Panda

In [None]:
data = pd.read_csv('adult.csv')

assert (data.shape == (48842, 15)),"CSV file did not successfully converted!"
data.replace('?', np.nan, inplace=True)
data

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Sanitize the data

Method for sanitizing data and dealing with `na` values:

* Use `sanitize_data('MFE')` to replace `na` values with the most frequent value.

* use `sanitize_data('dropNA')` to drop `na` values.

In [None]:
def sanitize_data(method='MFE'):
  if (method == 'MFE'):
    columns_with_nan_value = data.columns[data.isna().any()].tolist()
    for column in columns_with_nan_value:
      data.loc[data[column].isnull(), column] = data[column].value_counts().idxmax()

    assert (data.columns[data.isna().any()].tolist() == [])
    assert (data.shape == (48842, 15)), "Sanitization error!"

  elif (method == 'dropNA'):
    # data[data.isna().any(axis=1)]
    data.dropna(inplace=True)
    
    assert (data.columns[data.isna().any()].tolist() == [])
    assert (data.shape == (45222, 15)), "Sanitization error!"

#### For now lets use MFE

In [None]:
sanitize_data(method='MFE')

data

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,Private,103497,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0,0,30,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
48838,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
48839,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
48840,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


### Encode Labels and features and divide them

In [None]:
data['Class'] = data['Class'].map({'<=50K': 0, '>50K': 1})
Y = data['Class']

Y

0        0
1        0
2        1
3        1
4        0
        ..
48837    0
48838    1
48839    0
48840    0
48841    1
Name: Class, Length: 48842, dtype: int64

In [None]:
X = data.drop(columns=['Class'])

X = X.apply(LabelEncoder().fit_transform)

X

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country
0,8,3,19329,1,6,4,6,3,2,1,0,0,39,38
1,21,3,4212,11,8,2,4,0,4,1,0,0,49,38
2,11,1,25340,7,11,2,10,0,4,1,0,0,39,38
3,27,3,11201,15,9,2,6,0,2,1,98,0,39,38
4,1,3,5411,15,9,4,9,3,4,0,0,0,29,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,10,3,21582,7,11,2,12,5,4,0,0,0,37,38
48838,23,3,10584,11,8,2,6,0,4,1,0,0,39,38
48839,41,3,10316,11,8,6,0,4,4,0,0,0,39,38
48840,5,3,16813,11,8,4,0,3,4,1,0,0,19,38


### Split test and train

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(34189, 14)
(14653, 14)
(34189,)
(14653,)


### Standardization of data

In [None]:
x_train_std = preprocessing.StandardScaler().fit_transform(x_train)

x_train_std

array([[-0.99623881, -0.0890132 ,  0.16928101, ..., -0.2079176 ,
        -0.03320714,  0.26021947],
       [ 1.84184356, -0.0890132 ,  1.20209691, ..., -0.2079176 ,
        -1.89611155,  0.26021947],
       [-0.70515344, -0.0890132 ,  0.96122819, ..., -0.2079176 ,
        -0.03320714, -5.04378777],
       ...,
       [-0.77792478, -0.0890132 ,  0.93492643, ..., -0.2079176 ,
        -0.03320714,  0.26021947],
       [ 0.53195939,  2.60972079, -1.13183798, ..., -0.2079176 ,
        -0.03320714,  0.26021947],
       [ 1.47798685, -0.0890132 , -0.06240097, ..., -0.2079176 ,
        -2.70606999,  0.26021947]])

In [None]:
x_test_std = preprocessing.StandardScaler().fit_transform(x_test)

x_test_std

array([[ 1.28073647, -0.09139615, -1.63028856, ..., -0.19834483,
        -0.03004513,  0.25846651],
       [-0.99258453, -0.09139615, -0.91876336, ..., -0.19834483,
        -0.03004513,  0.25846651],
       [ 0.32740831, -0.09139615, -0.16305109, ..., -0.19834483,
        -0.03004513,  0.25846651],
       ...,
       [-0.03925637, -2.79690697,  0.56660213, ..., -0.19834483,
        -0.03004513,  0.25846651],
       [ 1.64740115,  1.71227772,  0.32728695, ..., -0.19834483,
         0.79131919,  0.25846651],
       [ 1.06073766, -0.09139615, -1.21007074, ..., -0.19834483,
        -0.03004513,  0.25846651]])

## Model creation

In [None]:
clf = svm.SVC()
clf.fit(x_train_std, y_train)
predictions = clf.predict(x_test_std)

## Results

In [None]:
print("SVM Model mean accuracy: ", clf.score(x_test_std, y_test))
print("SVM Model accuracy_score: ", accuracy_score(y_test, predictions))
print("SVM Model precision_score: ", precision_score(y_test, predictions))
print("SVM Model recall_score: ", recall_score(y_test, predictions))
print("SVM Model f1_score: ", f1_score(y_test, predictions))

SVM Model mean accuracy:  0.8605746263563775
SVM Model accuracy_score:  0.8605746263563775
SVM Model precision_score:  0.7728894173602854
SVM Model recall_score:  0.5701754385964912
SVM Model f1_score:  0.6562342251388188
