# DS7331 Project 3
#### Group 2: Hollie Gardner, Cleveland Johnson, Shelby Provost
[Dataset Source](https://archive-beta.ics.uci.edu/ml/datasets/census+income)<br/>
[Github Repo](https://github.com/ShelbyP27/DS7331-Project)

In [1]:
#import libraries
import pandas as pd
import numpy as np
import os
import sklearn as sk
#import lazypredict

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# data preprocessing 
from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.pipeline import Pipeline

#pca and gridsearch
from sklearn.decomposition import PCA
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#prediction models
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
from sklearn.neighbors import KNeighborsClassifier

#exceptional work (working on this)
import math 
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from pandas.plotting import register_matplotlib_converters


## Data Preparation

### Loading and Prepping Data 


In [2]:
# Importing the census dataset using pandas
# Reading the CSV file after converting file to csv and removing superfluous spaces via Excel.
df = pd.read_csv('https://raw.githubusercontent.com/ShelbyP27/DS7331-Project/main/adult-data.csv')

# Getting a first look at the dataset
#df.head()

In [3]:
#Cleaning up data set
df = df.replace(to_replace='?',value=np.nan) # replace '?' with NaN (not a number)
df.dropna(inplace=True) # Removing na values
df.duplicated(subset=None, keep='first') #Remove duplicates
df['income'] = df['income'].map({'<=50K': 0, '>50K': 1}).astype(int) #One-hot respone

In [4]:

# One-hot encoding the Categorical variables
if 'sex' in df:
    df['IsMale'] = df.sex == 'Male'
    df.IsMale = df.IsMale.astype(np.int64)
    del df['sex']
    
if 'marital-status' in df:
    tmp_df = pd.get_dummies(df['marital-status'], prefix = 'Marital')
    df = pd.concat((df, tmp_df), axis =1)
    del df['marital-status']
    
if'relationship' in df:
    tmp_df = pd.get_dummies(df['relationship'], prefix = 'Rel')
    df = pd.concat((df, tmp_df), axis =1)
    del df['relationship']

if 'race' in df:
    tmp_df = pd.get_dummies(df['race'], prefix = 'Race')
    df = pd.concat((df, tmp_df), axis =1)
    del df['race']

if 'workclass' in df:
    tmp_df = pd.get_dummies(df['workclass'], prefix = 'Work')
    df = pd.concat((df, tmp_df), axis =1)
    del df['workclass']

if 'occupation' in df:
    tmp_df = pd.get_dummies(df['occupation'], prefix = 'Occupation')
    df = pd.concat((df, tmp_df), axis =1)
    del df['occupation']

if 'education' in df:
    tmp_df = pd.get_dummies(df['education'], prefix = 'Education')
    df = pd.concat((df, tmp_df), axis =1)
    del df['education']

    
#Replace Native Country with Immigrant atribute
if 'native-country' in df:
    df['immigrant'] = np.where(df['native-country']!= 'United-States', 1, 0)
    del df['native-country']

df2 = df
    

In [5]:
# Separate features from the response
if 'income' in df:
    y = df['income'].values
    del df['income']
    X = df.values

# Train / Test split with scaled_X
scaled_X = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(scaled_X, y, test_size = .2, random_state=1)


In [6]:
Xnew = df[['age','hours-per-week', 'capital-loss','education-num','Marital_Never-married','Marital_Married-civ-spouse','Work_Private', 'Rel_Husband','capital-gain','Rel_Not-in-family']].values

newscaled_X = StandardScaler().fit_transform(Xnew)
x_train, x_test, y_train, y_test = train_test_split(newscaled_X, y, test_size = .2, random_state=1)

## Business Understanding
*Describe the purpose of the data set you selected (i.e., why was this data collected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?*

--- INSERT EXPLANATION----

## Data Understanding
*Part One: Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?*

---- INSERT REVIEW OF ABOVE DATA PREP PROCESS----

*Part Two: Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.*

----INSERT REVIEW OF ATTRIBUTE SELECTION PROCESS. MAY NEED TO REVISIT #1 & #2----

## Modeling and Evaluation
*Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results.*
 

### Option A: Cluster Analysis

 - Perform cluster analysis using several clustering methods
 - How did you determine a suitable number of clusters for each method?
 - Use internal and/or external validation measures to describe and compare the clusterings and the clusters (some visual methods would be good).
 - Describe your results. What findings are the most interesting and why? 


### Option B: Association Rule Mining
 - Create frequent itemsets and association rules.
 - Use tables/visualization to discuss the found results.
 - Use several measure for evaluating how interesting different rules are.
 - Describe your results. What findings are the most compelling and why? 

### Option C: Collaborative Filtering
 - Create user-item matrices or item-item matrices using collaborative filtering
 - Determine performance of the recommendations using different performance measures and explain what each measure
 - Use tables/visualization to discuss the found results. Explain each visualization in detail.
 - Describe your results. What findings are the most compelling and why? 

## Deployment
 - Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?
 - How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)?
 - How would your deploy your model for interested parties?
 - What other data should be collected?
 - How often would the model need to be updated, etc.? 

## Exceptional Work
 - You have free reign to provide additional analyses or combine analyses 