## Classification Project 

### Everett Stenberg
### Antawn Weg
In this project, my group explores the uses of using logistic regressors and SVMs to classify the income of an individual given their census data. 

In [1]:
# Get the initial imports out 
import pandas 
import numpy
from sklearn.linear_model import LogisticRegression
from sklearn import metrics, svm
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler 
from time import time

Here we will make some initial initializations

In [2]:
dataframe = pandas.read_csv("data.csv",header=None)
feature_set = [] #find unique features to be one-hot encoded 
new_dataset = [] #beginnings of our final prepared data 

At this point, we wanted to reduce the number of one-hot encodings of the place of origin, since it may lead to overfitting or bad predictions. Here, we assigned general regions to the countries in order to reduce the feature size we will eventually have. 

In [3]:
US = [' United-States']

NA = [' Puerto-Rico',' Haiti',' Mexico',' Cuba',' Canada',' Amer-Indian-Eskimo',
      ' Haiti',' Dominican-Republic',' Jamaica',' Outlying-US(Guam-USVI-etc)']

ASIA = [' Iran',' Philippines',' Cambodia',' Thailand',' Laos',
        ' Taiwan',' China',' India',' Japan',' Vietnam',' Hong']

EU = [' England',' Germany',' Italy',' Poland',' France',' Portugal',' Yugoslavia',' Greece',' Hungary',
      ' Holand-Netherlands',' Scotland',' Ireland']

SA = [' Columbia',' Ecuador',' El-Salvador',' Honduras',' Guatemala',' Peru',' Nicaragua',' Trinadad&Tobago']

OTHER = [ ' Other',' South', ' ?']

regions = {'US':US,'NA':NA,'ASIA':ASIA,'EU':EU,'SA':SA,'OTHER':OTHER}

feature_set += list(regions.keys())

Similarly, we wanted to make education a number, which further reduced the size of features and provided more meaningful data than simply one-hot encoding a category of education. Approximate years of education has a cumulative, scalar affect, so I felt this better suited the data

In [4]:
education = {
    ' Bachelors': 17,
    ' HS-grad': 13,
    ' 11th': 12,
    ' Masters': 19,
    ' 9th': 10,
    ' Some-college': 15,
    ' Assoc-acdm': 15,
    ' Assoc-voc': 15,
    ' 7th-8th': 8.5,
    ' Doctorate': 23,
    ' Prof-school': 19,
    ' 5th-6th': 6.5,
    ' 10th': 11,
    ' 1st-4th': 3.5,
    ' Preschool': 0,
    ' 12th': 13}


## Find all one-hot encoding features 

Here, we'll go through our data and check all values of features that will need to be one hot encoded. If we find a unique value (i.e. a new member of the discrete set of relationship types), then well add it to the list of features 

In [5]:
for row in dataframe.iterrows():
    datapoint = list(row[1])
    if not datapoint[1] in feature_set:
        feature_set.append(datapoint[1])
    elif not datapoint[5] in feature_set:
        feature_set.append(datapoint[5])
    elif not datapoint[6] in feature_set:
        feature_set.append(datapoint[6])
    elif not datapoint[7] in feature_set:
        feature_set.append(datapoint[7])
    elif not datapoint[8] in feature_set:
        feature_set.append(datapoint[8])
        
    for r in regions:
        if datapoint[13] in regions[r]:
            datapoint[13] = r
        
    # Add to preparing data
    new_dataset.append(datapoint)

## Clean and Format the Dataset  

Here, we either properly cast the feature for storage in the dataset from the original, or we do a one-hot encoded feature lookup and apply the proper encoding

In [6]:
for i, dp in enumerate(new_dataset):
    
    # initialize a new dict  
    formatted_dp = {}
    # Add manual features 
    formatted_dp["age"] = int(dp[0])
    formatted_dp["fnlwgt"] = int(dp[2])
    formatted_dp["ed_num"] = int(dp[4])
    formatted_dp["gender"] = int((dp[9] == "Male"))
    formatted_dp["capital_gain"] = int(dp[10])
    formatted_dp["capital_loss"] = int(dp[11])
    formatted_dp["hours_per_week"] = int(dp[12])
    formatted_dp["over50k"] = int(dp[-1].strip()[0] == '>')
    # Translate education level
    formatted_dp["education_years"] = education[dp[3]] 
    
    # Add one-hot encoding features 
    for feat in feature_set:
        if feat in dp:
            formatted_dp[feat] = 1
        else:
            formatted_dp[feat] = 0   
    new_dataset[i] = formatted_dp

## Data encoding:

### Continuous Values 
    fnlwgt : The final number the census decided on (no idea??)
    ed_num : No idea what this represents 
    gender : 0 == male, 1 == female 
    
### One Hot Encoded  
    Occupation, Race, Region of Origin , Maritial Status, Relationship Status, Familty Status 

In [7]:
# A look at our dataset
new_dataset[7]

{'age': 52,
 'fnlwgt': 209642,
 'ed_num': 9,
 'gender': 0,
 'capital_gain': 0,
 'capital_loss': 0,
 'hours_per_week': 45,
 'over50k': 1,
 'education_years': 13,
 'US': 1,
 'NA': 0,
 'ASIA': 0,
 'EU': 0,
 'SA': 0,
 'OTHER': 0,
 ' State-gov': 0,
 ' Self-emp-not-inc': 1,
 ' Private': 0,
 ' Married-civ-spouse': 1,
 ' Prof-specialty': 0,
 ' Exec-managerial': 1,
 ' Married-spouse-absent': 0,
 ' Husband': 1,
 ' Never-married': 0,
 ' White': 1,
 ' Black': 0,
 ' Asian-Pac-Islander': 0,
 ' Adm-clerical': 0,
 ' Sales': 0,
 ' Craft-repair': 0,
 ' Transport-moving': 0,
 ' Farming-fishing': 0,
 ' Machine-op-inspct': 0,
 ' Divorced': 0,
 ' Separated': 0,
 ' Federal-gov': 0,
 ' Tech-support': 0,
 ' Local-gov': 0,
 ' Own-child': 0,
 ' ?': 0,
 ' Not-in-family': 0,
 ' Protective-serv': 0,
 ' Other-service': 0,
 ' Unmarried': 0,
 ' Married-AF-spouse': 0,
 ' Handlers-cleaners': 0,
 ' Wife': 0,
 ' Self-emp-inc': 0,
 ' Other-relative': 0,
 ' Widowed': 0,
 ' Amer-Indian-Eskimo': 0,
 ' Other': 0,
 ' Armed-Forc

Now, well compile our data, clean and format it, and turn it into train and test sets - which are numpy arrays

In [8]:
##### reorder the data correctly since it was in a dict 
ordered_data = []
lengths = {}

for d in new_dataset:  
    
    # Create the datapoint
    classification = d['over50k']
    d_as_l = [d[f] for f in d if not f == 'over50k']
    final_datapoint = [classification] + d_as_l
    
    # Add to our datalist
    ordered_data.append(final_datapoint)

    
# Create a numpy array
data_arr = numpy.array([numpy.array(l) for l in ordered_data])

# Shuffle the data
numpy.random.shuffle(data_arr)
x = int(.7 * len(data_arr))
# Pick 70% for training 
x_train, y_train, x_test, y_test = data_arr[:x,1:], data_arr[:x,0], data_arr[x:,1:], data_arr[x:,0]

### Lets make a logistics model to try to predict with

In [13]:
# Create a Logistic Classifier for the dataset 
l_model = LogisticRegression(max_iter=1000)
l_model.fit(x_train,y_train)
# Get the output
lreg_out = l_model.predict(x_test)
print(f"accuracy: {metrics.accuracy_score(lreg_out,y_test)}")
print(f"Confusion matrix:\n{metrics.confusion_matrix(y_test,lreg_out)}\n")

accuracy: 0.7930187327259699
Confusion matrix:
[[7140  247]
 [1775  607]]



I do not consider this to be a bad score! We can accurately predict 4 out of 5 times if someones median income is above or below 50k per year given only limited data features.

### Now, lets get an SVM Classifier

In [11]:
# We'll try several different kernels
svm_models = {'lin':svm.SVC(kernel='linear',gamma='auto'),'rbf':svm.SVC(kernel='rbf',gamma='auto'),'poly':svm.SVC(kernel='poly',gamma='auto')}

for model in svm_models:
    t1 = time()
    print(f'running {model} model',end=';\t')
    pipe = make_pipeline(StandardScaler(),svm_models[model])
    pipe.fit(x_train,y_train)
    print(f"Model fit in {(time()-t1):.2f}s , accuracy: ",end=' ')
    svm_out = pipe.predict(x_test)
    print(f'{model}: {metrics.accuracy_score(svm_out,y_test):.3f}')
    print(f"Confusion matrix:\n{metrics.confusion_matrix(y_test,svm_out)}\n")

running lin model;	Model fit in 71.12s , accuracy:  lin: 0.852
Confusion matrix:
[[6937  450]
 [ 999 1383]]

running rbf model;	Model fit in 43.98s , accuracy:  rbf: 0.852
Confusion matrix:
[[6960  427]
 [1023 1359]]

running poly model;	Model fit in 40.98s , accuracy:  poly: 0.839
Confusion matrix:
[[6920  467]
 [1105 1277]]



Our results seem pretty good! An ~85% percent accuracyrate is not too bad for the SVM model given our limited data! It clearly does a much better job of predicting the data versus the log regressor