# DS7331 Project 2
#### Group 2: Hollie Gardner, Cleveland Johnson, Shelby Provost
[Dataset Source](https://archive-beta.ics.uci.edu/ml/datasets/census+income)<br/>
[Github Repo](https://github.com/ShelbyP27/DS7331-Project)

In [67]:
#import libraries
import pandas as pd
import numpy as np
import os

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# data preprocessing 
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

#prediction models
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt

## Data Preparation: Part 1
*Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.*


### Loading and Prepping Data 


In [68]:
# Importing the census dataset using pandas
# Reading the CSV file after converting file to csv and removing superfluous spaces via Excel.
df = pd.read_csv('https://raw.githubusercontent.com/ShelbyP27/DS7331-Project/main/adult-data.csv')

# Getting a first look at the dataset
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [69]:
#Cleaning up data set
df = df.replace(to_replace='?',value=np.nan) # replace '?' with NaN (not a number)
df.dropna(inplace=True) # Removing na values
df.duplicated(subset=None, keep='first') #Remove duplicates
df['income'] = df['income'].map({'<=50K': 0, '>50K': 1}).astype(int) #One hot code respone

In [70]:

# One-hot encode Categorical 
if 'sex' in df:
    df['IsMale'] = df.sex == 'Male'
    df.IsMale = df.IsMale.astype(np.int64)
    del df['sex']
    
if 'marital-status' in df:
    tmp_df = pd.get_dummies(df['marital-status'], prefix = 'Marital')
    df = pd.concat((df, tmp_df), axis =1)
    del df['marital-status']
    
if'relationship' in df:
    tmp_df = pd.get_dummies(df['relationship'], prefix = 'Rel')
    df = pd.concat((df, tmp_df), axis =1)
    del df['relationship']

if 'race' in df:
    tmp_df = pd.get_dummies(df['race'], prefix = 'Race')
    df = pd.concat((df, tmp_df), axis =1)
    del df['race']

if 'workclass' in df:
    tmp_df = pd.get_dummies(df['workclass'], prefix = 'Work')
    df = pd.concat((df, tmp_df), axis =1)
    del df['workclass']

if 'occupation' in df:
    tmp_df = pd.get_dummies(df['occupation'], prefix = 'Occupation')
    df = pd.concat((df, tmp_df), axis =1)
    del df['occupation']

if 'education' in df:
    tmp_df = pd.get_dummies(df['education'], prefix = 'Education')
    df = pd.concat((df, tmp_df), axis =1)
    del df['education']

    
#Replace Native Country with Immigrant atribute
if 'native-country' in df:
    df['immigrant'] = np.where(df['native-country']!= 'United-States', 1, 0)
    del df['native-country']
df.head()
    

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,income,IsMale,Marital_Divorced,Marital_Married-AF-spouse,...,Education_Assoc-acdm,Education_Assoc-voc,Education_Bachelors,Education_Doctorate,Education_HS-grad,Education_Masters,Education_Preschool,Education_Prof-school,Education_Some-college,immigrant
0,39,77516,13,2174,0,40,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
1,50,83311,13,0,0,13,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
2,38,215646,9,0,0,40,0,1,1,0,...,0,0,0,0,1,0,0,0,0,0
3,53,234721,7,0,0,40,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,28,338409,13,0,0,40,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1


In [72]:
#Other options for PCA and LDA for Dimension Reduction
#data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

# PCA
#pca = PCA(n_components=1)
#pca.fit_transform(data_scaled)

# Dump components relations with features:
#print(pd.DataFrame(pca.components_,columns=data_scaled.columns,index = ['PC-1']))

#total_var = pca.explained_variance_ratio_.sum() * 100
#print("Total Variance Explained:", total_var,"%")

#from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
#data_scaled = pd.DataFrame(preprocessing.scale(df),columns = df.columns) 

#lda = LDA(n_components=1)
#X_lda = lda.fit(X, y).transform(X) # fit data and then transform it

#print the components
#print ('lda:', lda.scalings_.T)

#total_var = lda.explained_variance_ratio_.sum() * 100
#print("Total Variance Explained:", total_var,"%")


In [73]:
# Separate features from the response
if 'income' in df:
    y = df['income'].values
    del df['income']
    X = df.values

# Train / Test split
sc = StandardScaler()
sc.fit(X)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state=1)


In [81]:
#PCA for dimension reduction
#https://analyticsindiamag.com/principal-component-analysis-in-python/#:~:text=pca%20%3D%20PCA(n_components%20%3D%20number,explained_variance%20%3D%20pca.explained_variance_ratio_
from sklearn.decomposition import PCA
pca = PCA(n_components = 30)
pca.fit(x_train)
variance = pca.explained_variance_ratio_

print('Variance Explained:', variance*100,"%")


Variance Explained: [9.95232633e+01 4.75309942e-01 1.42389817e-03 1.55743850e-06
 1.21233683e-06 5.61936655e-08 5.62882845e-09 2.32987542e-09
 2.11375484e-09 1.99889284e-09 1.72827994e-09 1.54554245e-09
 1.47964899e-09 1.21458328e-09 1.20221359e-09 1.06260452e-09
 1.02849555e-09 9.38935512e-10 8.14531277e-10 7.69183120e-10
 7.50301864e-10 6.96035957e-10 6.40198709e-10 5.64005774e-10
 4.94967647e-10 4.52127370e-10 4.41754337e-10 4.03318240e-10
 3.94532137e-10 3.74967676e-10] %


## Data Preparation: Part 2
*Describe the final dataset that is used for classification/regression (include a description of newly formed variables you created*

--- INSERT DESCRIPTION HERE --- 

## Modeling and Evaluation: Part 1

*Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measures appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.*

--- INSERT EXPLANATION ---- 

## Modeling and Evaluation: Part 2

*Choose the method you will use for diving your data into training and testing splits (i.e. are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate*

--- INSERT EXPLANATION ---- 

## Modeling and Evaluation: Part 3

*Create three different classification/regression models (e.g. random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chose metric.*


### Model 1: Logistic Regression

In [None]:
#Logistic Regression

lr = LogisticRegression(C=1.0, random_state=1, solver='lbfgs')

lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

print('accuracy', mt.accuracy_score(y_test, y_pred))
print('confusion matrix\n', mt.confusion_matrix(y_test, y_pred))

In [None]:
weights = lr.coef_.T
variable_names = df.columns
for coef, name in zip(weights, variable_names):
    print (name, "has weight of", coef[0])

In [None]:
from matplotlib import pyplot as pyplot
%matplotlib inline
plt.style.use('ggplot')

weights = pd.Series(lr.coef_[0], index=df.columns)
weights.plot(kind='bar')
plt.show()



In [None]:
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook

graph1 = {'x': df.columns,
          'y': weights,
       'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)


In [None]:
import math
# Age Interpretation 
print(math.exp(-0.006799438255839271))
# Marital_Married-civ-spouse interpretation
print(math.exp(0.00037492480181653277))

### Initial Logistic Regression ### 
<p>The features of importance are notated as those furthest from 0 as this would simulate the greates changes within the model. In this initial logistic regaression model with all attributes included, the features that appear to be of importance are age, hours-per-week, capital loss, education-num, Marital_Never_married, Marital_Married-civ-spouse, Work_Private, Rel_Husband, capital-gain, and Rel_Not-in-family. For example, if the age of an individual increases by one unit, the esimated odds of having an income greater than 50k change by a factor of 0.99. In other words, the odds decrease by 1%. In another example, if a person is a married civilian spouse, the estimated odds of having an income greater than 50k is .03% higher than someone who is not a married civilian spouse. 

In [None]:
Xnew = df[['age','hours-per-week', 'capital-loss','education-num','Marital_Never-married','Marital_Married-civ-spouse','Work_Private', 'Rel_Husband','capital-gain','Rel_Not-in-family']].values

sc.fit(Xnew)
x_train, x_test, y_train, y_test = train_test_split(Xnew, y, test_size = .2, random_state=1)
lr.fit(x_train, y_train)
y_pred = lr.predict(x_test)

print('accuracy', mt.accuracy_score(y_test, y_pred))
print('confusion matrix\n', mt.confusion_matrix(y_test, y_pred))


graph1 = {'x': ['age','hours-per-week', 'capital-loss','education-num','Marital_Never-married','Marital_Married-civ-spouse','Work_Private', 'Rel_Husband','capital-gain','Rel_Not-in-family'],
        'y': weights,
        'type': 'bar'}

fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}

plotly.offline.iplot(fig)


In [None]:
# Capital Loss interpretation
math.exp(-0.001613973)

#### Simplified Logistic Regression Model ###
<p>In this simplified logistic regression model, the features that appear to be the most important are age, capital-loss, and Marital_Married-civ-spouse. For an example, each one unit increase in capital loss, the odds of having an income over the age of 50k changes by 0.998. In other words, the odds decrease 0.2%. 

### Model 2: SVM

In [None]:
#SVM

train_svm = SVC(kernel = 'rbf', gamma=.1, C=1.0)

train_svm.fit(x_train, y_train)

y_pred = train_svm.predict(x_test)

print('Accuracy: %.3f' % accuracy_score(y_test, y_pred))
print('Classification Error: %.3f' % (1 - (accuracy_score(y_test, y_pred))))

In [None]:
# Support vecors 
train_svm.support_vectors_

### Non-linear SVM Model ###
The chosen support vectors for this model show the coordinates of the observations that are closest to the hyperplane, or decision boundary. These points influence the position and orientation of the decision boundary to maximize the accuracy of the classifier. 

### Explanation ###
<p>The initial logistic regression model with all attributes included, resulted in a prediction accuracy of 78.6%. While the speed of this model prediction was quick, a second logistic model was ran with the top 10 attributes of importance by weight. The second logistic regression model, including the atributes age, hours-per-week, capital-loss, education-num, Marital_Never-married,Marital_Married-civ-spouse, Work_Private, Rel_Husband, capital-gain, and Rel_Not-in-family, resulted in a predition accuracy of 79.9%. In addition to the prediction accuracy being better in the second logistic regression model, the speed of the model prediction is quicker as well since there are less varibles being included in the model. Finally, the non-linear SVM model, with the same attributes as the second logistic regression model, resulted in a prediction accuracy of 83.6%. The non-linear SVM model resulted in the best prediction accuracy, however the trade-off is in the increased time it takes to run this model as opposed to the logistic regression models. In terms of efficiency in prediction, the non-linear SVM model is better than the logistic regression models since the SVM model relies on the few support vectors. </p>

### Model 3: KNN

### Model 4: Random Forest

## Modeling and Evaluation: Part 4
*Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that that might use this model.*

--- INSERT EXPLANATION --- 

## Modeling and Evaluation: Part 5
*Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical methods.*

--- INSERT EXPLANATION --- 

## Modeling and Evaluation: Part 6
*Which attributes from your analysis are more important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task. *

--- INSERT EXPLANATION --- 

## Deployment

*How useful is your model for interested parties (i.e. the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would you deploy your mocel for interested parties? What other data should be collected? How often would the model need to be updated, etc?*

--- INSERT ANSWER --- 


## Exceptional Work

*One idea: Grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm? 