# <span style="color:#900C3F;">Census Income Data Analysis</span>

### <span style="color:#5E5C5B;">In this notebook, using census income data, we construct various classification models, and implement them above on the data, and analyze the performance of each classification model.</span> 




### <span style="color:#900C3F;">Steps taken:</span> 

1. **Setting up the environment.**
2. **Preparing Data for Analysis.**
3. **Implementation and Model evaluation.**
4. **Comparing the performance of classification models.**

## <span style="color:#900C3F;">1. Setting up the environment</span>

**Importing requried libraries**

Mainly used:
- **Apache spark** in this project, so we import the some libraries required, **pyspark** for starting a **spark session.** 
- **pandas** for data analysis and manipulation.
- **sklearn** for statistical modeling including classification.
- **plotly** for visualization.



In [None]:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import pprint
from sklearn import metrics, svm
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
import plotly.graph_objects as go
import plotly.express as px

creating a **spark session** 

In [None]:
spark = SparkSession.builder.master("local[*]").getOrCreate()

## <span style="color:#900C3F;">2. Preparing Data for Analysis</span>

**Preparing** dataset taken from http://archive.ics.uci.edu/ml/index.php for the analysis.




In [None]:
#reading the file with spark session's method read.csv, converting the csv file to a dataframe
def readCSVtoDF(csvData):
    return spark.read.csv(csvData)

rawData = 'Data/adult.data'

CensusDF = readCSVtoDF(rawData)

AttributeList = CensusDF.columns

Attributes = {'_c0':'age', '_c1':'workclass', '_c2':'fnlwg', '_c3':'education', '_c4':'education_number', 
              '_c5':'marital_status', '_c6':'occupation', '_c7':'relationship', '_c8':'race', '_c9':'sex', 
              '_c10':'capital_gain', '_c11':'capital_loss', '_c12':'hours_per_week', '_c13':'native_country', 
              '_c14':'income'}

categorical_features = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex',
                       'native_country', 'income']
numerical_features = ['age', 'fnlwg', 'education_number', 'capital_gain', 'capital_loss', 'hours_per_week']

#changing the name of attributes with original attribute names
for col in AttributeList:
    CensusDF = CensusDF.withColumnRenamed(col, Attributes[col])
    
#converting dataframe to Pandas Dataframe
def convertDFtoPandas(df):
    return df.select('*').toPandas()
data = convertDFtoPandas(CensusDF)
dataUncleaned = convertDFtoPandas(CensusDF)

### <span>Dataset</span>

**Census Income data** which is extracted by Barry Becker from the 1994 Census database has **multivarient dataset characteristics** and has two types of **attribute charactristics** which are Integer and Categorical.

**Number of instances and Number of attributes**

In [None]:
data.shape

**Categorical Features**

In [None]:
categorical_features

**Numerical Features**

In [None]:
numerical_features

**Uncleaned dataset**

For **Tree Decision** classifiers, we use uncleaned dataset and there are some missing values.

In [None]:
dataUncleaned[:20]

**Preprocessing data** for categorical features which are listed above has some steps. First, we make **ordinal encoding** for workclass, marital_status and occupation columns. Second, we replaced **missing values** with the most frequent values. Third, we **scaled** the numerical data which is listed above. Fourth, we **cleaned** the data. We make a **normalization data process** by using **sklearn.preprocessing** package.

In [None]:
####### prepocessing data for categorical features #######
#workclass
encoder_workclass = OrdinalEncoder()
data.workclass = encoder_workclass.fit_transform(data.workclass.values.reshape(-1, 1))
data.loc[data['workclass'] == 8, 'workclass'] = data['workclass'].mode()

#marital_status
encoder_marital_status = OrdinalEncoder()
data.marital_status = encoder_marital_status.fit_transform(data.marital_status.values.reshape(-1, 1))
data.loc[data['marital_status'] == 7, 'marital_status'] = data['marital_status'].mode()

#occupation
data.loc[data['occupation'].isnull(), 'occupation'] = data['occupation'].mode()
encoder_occupation= OrdinalEncoder()
data.occupation= encoder_occupation.fit_transform(data.occupation.values.reshape(-1,1))
data.loc[data['occupation']== 14, 'occupation'] = data['occupation'].mode()

#relationship
encoder_relationship = OrdinalEncoder()
data.relationship = encoder_relationship.fit_transform(data.relationship.values.reshape(-1, 1))
#no null in this one

#race
encoder_race = OrdinalEncoder()
data.race = encoder_race.fit_transform(data.race.values.reshape(-1, 1))
#no null in this one

#sex
encoder_sex = OrdinalEncoder()
data.sex = encoder_sex.fit_transform(data.sex.values.reshape(-1, 1))
#no null in this one


#native_country
encoder_native_country= OrdinalEncoder()
data.native_country= encoder_native_country.fit_transform(data.native_country.values.reshape(-1,1))
data.loc[data['native_country']== 41, 'native_country'] = data['native_country'].mode()

#income
encoder_income = OrdinalEncoder()
data.income= encoder_income.fit_transform(data.income.values.reshape(-1, 1))
#no null in this one

#education--- should be changed later

encoder_education= OrdinalEncoder()
data.education= encoder_education.fit_transform(data.education.values.reshape(-1, 1))
#no null in this one


age_scaler = MinMaxScaler(feature_range=(0,1))     #ages to be between o and 1 
data.age =age_scaler.fit_transform(data.age.values.reshape(-1, 1))

fnlwg_scaler = MinMaxScaler(feature_range=(0,1))    
data.fnlwg =fnlwg_scaler.fit_transform(data.fnlwg.values.reshape(-1, 1))


education_number_scaler = MinMaxScaler(feature_range=(0,1))    
data.education_number=education_number_scaler.fit_transform(data.education_number.values.reshape(-1, 1))



diff1 =int(max(data['capital_gain']))-int(min(data['capital_gain']))
data['capital_gain'] = data['capital_gain'].apply(lambda x: int(x)/diff1)


diff2 =int(max(data['capital_loss']))-int(min(data['capital_loss']))
data['capital_loss'] = data['capital_loss'].apply(lambda x: int(x)/diff2)


diff3 =int(max(data['hours_per_week']))-int(min(data['hours_per_week']))
data['hours_per_week'] = data['hours_per_week'].apply(lambda x: int(x)/diff3)


def clean_dataset(df):
    assert isinstance(df, pd.DataFrame), "df needs to be a pd.DataFrame"
    df.dropna(inplace=True)
    indices_to_keep = ~df.isin([np.nan, np.inf, -np.inf]).any(1)
    return df[indices_to_keep].astype(np.float64)
data = clean_dataset(data)

In [None]:
data[:20]

## <span style="color:#900C3F;">3. Implementation and Model evaluation</span>

We use different methods for 6 classifiers; namely, **Desicion tree using gain ratio, Desicion tree using gini index, Naïve Bayes, Artificial neural networks with 1 hidden layer, Artificial neural networks with 2 hidden layers, Support vector machines**. While implementing these 6 algorithms on our data, we mainly use **sklearn** library. 

In [None]:
classifiersList = ['Decision tree using gain ratio', 'Decision tree using gini index', 'Naïve Bayes',
                  'Artificial neural networks with 1 hidden layer', 'Artificial neural networks with 2 hidden layers',
                  'Support vector machines']
accuracyComparison_holdout = []  #all the hold out accuracies for each classifier will be appended

### <span>Decision tree using gain ratio</span>

In [None]:
accuracy_gainRatio = []
usedMethods_gainRatio = ['Holdout Method', 'Bagging Ensemble Method']

In [None]:
#calculating the entropy of each column
def entropy(col): 
    values,counts = np.unique(col,return_counts = True) 
    entropy = np.sum([(-counts[i]/np.sum(counts))*np.log2(counts[i]/np.sum(counts)) for i in range(len(values))])
    return entropy

#calculating the gain
def Gain(dataframe,split_attribute,target_name="income"):   #it will always be income
  
    total_entropy = entropy(dataframe[target_name])    #Calculating the entropy of target name
    Weighted_Entropy =  entropy(dataframe[split_attribute])
    
    #Calculate the information gain
    Information_Gain = total_entropy - Weighted_Entropy
    return Information_Gain  

#ID3 algorithm for decision tree using gain 

def ID3(data,originaldata,features,target_attribute_name="income",parent_node = None):
    
    #if all the target column values are same, return that value! no need to grow the tree
    
    if len(np.unique(data[target_attribute_name])) <= 1:            
    
        return np.unique(data[target_attribute_name])[0]
      
   
    
    elif len(features) ==0:               # when the feature space is empty
        return parent_node
    
    
    #growing the tree
    
    else:
      
        #the default value for the parent node is that value that appears the most in that specific feature.
        parent_node = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return_counts=True)[1])]
        
        #calculating Gain for each feature
        item_values = [Gain(data,feature,target_attribute_name) for feature in features]
        
        #choosing the highest feature
        best_feature_index = np.argmax(item_values)
        best_feature = features[best_feature_index] #to be assigned as the root 
        
        #Tree structure
        
        tree = {best_feature:{}}
        
        
        # remove the best feature
        features = [i for i in features if i != best_feature]
        
        #Grow a branch under the root node for each possible value of the root node feature
        
        for value in np.unique(data[best_feature]):
            value = value
            #Split the dataset along the value of the feature with the largest information gain and therwith create sub_datasets
            sub_data = data.where(data[best_feature] == value).dropna()
            
            #Call the ID3 algorithm for each of those sub_datasets with the new parameters --> Here the recursion comes in!
            subtree = ID3(sub_data,originaldata,features,target_attribute_name,parent_node)
            
            #Add the sub tree, grown from the sub_dataset to the tree under the root node
            tree[best_feature][value] = subtree
            
        return(tree)    
    
# for future predictions a query of specified 
#features will be given and prediction function will go through the tree to find the result    
def predict(query,tree,default = 1): 
    
    for key in list(query.keys()): 
        if key in list(tree.keys()):
            try:
                result = tree[key][query[key]] 
            except:
                return default
            result = tree[key][query[key]]
            
            if isinstance(result,dict):
                return predict(query,result)

            else:
                return result

def test(data,tree):
    #Create new query instances by simply removing the target feature column from the original dataset and 
    #convert it to a dictionary
    queries = data.iloc[:,:-1].to_dict(orient = "records")
    
    #Create a empty DataFrame in whose columns the prediction of the tree are stored
    predicted = pd.DataFrame(columns=["predicted"]) 
    predicted.sort_index(inplace=True)
    
    #Calculate the prediction accuracy
    for i in range(len(data)):
        predicted.loc[i,"predicted"] = predict(queries[i],tree) 
        

    columns = ['income']
    testdf=pd.DataFrame(data["income"], columns=columns)    
    accuracy = (np.sum(predicted["predicted"] == testdf["income"])/len(data)*100)
    accuracyComparison_holdout.append(accuracy)
    accuracy_gainRatio.append(accuracy) 
    print()
    return accuracy

#### <span>Holdout Method</span>

In [None]:
#holdout
def holdout_entropy(dataset):
    training_data,testing_data = train_test_split(dataset,test_size=0.2, random_state=100)
    training_data_hold = training_data.reset_index(drop=True)
    testing_data_hold = testing_data.reset_index(drop=True)
    tree = ID3(training_data_hold,training_data_hold,training_data_hold.columns[:-1])
    return test(testing_data_hold,tree)
resultGain = holdout_entropy(dataUncleaned)
our_error_ID3 = 100 - resultGain
resultGain

#### <span>Bagging Ensemble Method</span>

In [None]:
#bagging
def bagging_entropy(dataset,n):
    result=0
    for i in range(n):
        rnge =int(len(dataset)/n)
        start = i*rnge
        end= start+ rnge
        dataset = dataset[start:end]
        training_data,testing_data = train_test_split(dataset,test_size=0.2, random_state=n*100)
        training_data= training_data.reset_index(drop=True)
        testing_data= testing_data.reset_index(drop=True)
        tree = ID3(training_data,training_data,training_data.columns[:-1])
        accuracy_for_each_bag = test(testing_data,tree)
        result+=accuracy_for_each_bag
        accuracy = result/n
        accuracy_gainRatio.append(accuracy)
    print(result/n)
bagging_entropy(dataUncleaned,3)

#### <span>Which method gives the most accurate value for Tree Decision using gain ratio Classifier?</span>

In [None]:
def accuracyPlot_gainRatio(methodList, accuracyList):
    fig = go.Figure(data=[go.Pie(labels=methodList, values=accuracyList)])
    return fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['darkseagreen', 'darkgreen']))
accuracyPlot_gainRatio(usedMethods_gainRatio, accuracy_gainRatio).show()

### <span>Decision tree using gini index</span>

In [None]:
accuracy_giniIndex = []
usedMethods_giniIndex = ['Holdout Method', 'Bagging Ensemble Method']

In [None]:
## decision tree by gini 

def gini(col): 
    values,counts = np.unique(col,return_counts = True) 
    gini = np.sum([(1-(counts[i]/np.sum(counts))*(counts[i]/np.sum(counts))) for i in range(len(values))])
    return gini

#calculating the gini difference
def ginidiff(dataframe,split_attribute,target_name="income"):   #it will always be income
  
    total_gini = gini(dataframe[target_name])    #Calculating the entropy of target name
    Weighted_gini =  gini(dataframe[split_attribute])
    
    #Calculate the information gain
    gini_diff = total_gini - Weighted_gini
    return gini_diff  

#c45 algorithm for decision tree using gain 
def C45(data,originaldata,features,target_attribute_name="income",parent_node = None):
    
    #if all the target column values are same, return that value! no need to grow the tree
    
    if len(np.unique(data[target_attribute_name])) <= 1:            
    
        return np.unique(data[target_attribute_name])[0]
      
   
    
    elif len(features) ==0:               # when the feature space is empty
        return parent_node
    
    
    #growing the tree
    
    else:
      
        #the default value for the parent node is that value that appears the most in that specific feature.
        parent_node = np.unique(data[target_attribute_name])[np.argmax(np.unique(data[target_attribute_name],return_counts=True)[1])]
        
        #calculating Gain for each feature
        item_values = [ginidiff(data,feature,target_attribute_name) for feature in features]
        
        #choosing the highest feature
        best_feature_index = np.argmax(item_values)
        best_feature = features[best_feature_index] #to be assigned as the root 
        
        #Tree structure
        
        tree = {best_feature:{}}
        
        
        # remove the best feature
        features = [i for i in features if i != best_feature]
        
        #Grow a branch under the root node for each possible value of the root node feature
        
        for value in np.unique(data[best_feature]):
            value = value
            #Split the dataset along the value of the feature with the largest information gain and therwith create sub_datasets
            sub_data = data.where(data[best_feature] == value).dropna()
            
            #Call the C45 algorithm for each of those sub_datasets with the new parameters --> Here the recursion comes in!
            subtree = C45(sub_data,originaldata,features,target_attribute_name,parent_node)
            
            #Add the sub tree, grown from the sub_dataset to the tree under the root node
            tree[best_feature][value] = subtree
            
        return(tree)    

# for future predictions a query of specified 
#features will be given and prediction function will go through the tree to find the result    
def predict_gini(query,tree,default = 1): 
    
    for key in list(query.keys()): 
        if key in list(tree.keys()):
            try:
                result = tree[key][query[key]] 
            except:
                return default
            result = tree[key][query[key]]
            
            if isinstance(result,dict):
                return predict_gini(query,result)

            else:
                return result
            
def test_gini(data,tree):
    #Create new query instances by simply removing the target feature column from the original dataset and 
    #convert it to a dictionary
    queries = data.iloc[:,:-1].to_dict(orient = "records")
    
    #Create a empty DataFrame in whose columns the prediction of the tree are stored
    predicted = pd.DataFrame(columns=["predicted"]) 
    predicted.sort_index(inplace=True)
    
    #Calculate the prediction accuracy
    for i in range(len(data)):
        predicted.loc[i,"predicted"] = predict_gini(queries[i],tree) 
        

    columns = ['income']
    testdf=pd.DataFrame(data["income"], columns=columns)
    accuracy = (np.sum(predicted["predicted"] == testdf["income"])/len(data)*100)
    accuracyComparison_holdout.append(accuracy)
    accuracy_giniIndex.append(accuracy)
    return accuracy

#### <span>Holdout Method</span>

In [None]:
def holdout_gini(dataset):
    training_data,testing_data = train_test_split(dataset,test_size=0.2, random_state=100)
    training_data_gini = training_data.reset_index(drop=True)
    testing_data_gini = testing_data.reset_index(drop=True)
    tree = C45(training_data_gini,training_data_gini,training_data_gini.columns[:-1])
    return test_gini(testing_data_gini,tree)
resultGini = holdout_gini(dataUncleaned)
our_error_C45 = 100 - resultGini
resultGini

#### <span>Bagging Ensemble Method</span>

In [None]:
#bagging
def bagging_gini(dataset,n):
    result=0
    for i in range(n):
        rnge =int(len(dataset)/n)
        start = i*rnge
        end= start+ rnge
        dataset = dataset[start:end]
        training_data,testing_data = train_test_split(dataset,test_size=0.2, random_state=n*100)
        training_data= training_data.reset_index(drop=True)
        testing_data= testing_data.reset_index(drop=True)
        tree = C45(training_data,training_data,training_data.columns[:-1])
        accuracy_for_each_bag = test(testing_data,tree)
        result+=accuracy_for_each_bag
        accuracy = result/n
        accuracy_giniIndex.append(accuracy)
    print(result/n)
bagging_gini(dataUncleaned,3)

#### <span>Which method gives the most accurate value for Tree Decision using gini index Classifier?</span>

In [None]:
def accuracyPlot_giniIndex(methodList, accuracyList):
    fig = go.Figure(data=[go.Pie(labels=methodList, values=accuracyList)])
    return fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['indianred', 'darkred']))
accuracyPlot_giniIndex(usedMethods_giniIndex, accuracy_giniIndex).show()

### <span>Naïve Bayes</span>

In [None]:
accuracy_naiveBayes = []
usedMethods_naiveBayes = ['Holdout Method', 'Cross-Validation Method', 'Bagging Ensemble Method', 'Boosting Ensemble Method']

#### <span>Holdout Method</span>

In [None]:
def nb_holdout(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=100)
    model_name = 'Naive Bayes Classifier'
    nb_model = GaussianNB()
    nb_model.fit(X_train_hold,Y_train_hold)
    y_pred = nb_model.predict(X_test_hold) 
    accuracy = metrics.accuracy_score(Y_test_hold, y_pred)*100
    accuracyComparison_holdout.append(accuracy)
    accuracy_naiveBayes.append(accuracy)
    print("Accuracy by Naïve Bayes via hold_out method:", accuracy,"%")
    return accuracy

resultnb = nb_holdout(data)
our_error_NB = 100 - resultnb


#### <span>Cross-Validation Method</span>

In [None]:
def nb_crossvalidation(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    cv = KFold(n_splits=10, random_state=1, shuffle=True)
    model_name = 'Naive Bayes Classifier'
    nb_model = GaussianNB()
    scores = cross_val_score(nb_model, x, y, scoring='accuracy', cv=cv, n_jobs=-1)
    # report performance
    accuracy = (np.mean(scores))*100
    accuracy_naiveBayes.append(accuracy)
    print("Accuracy by Naïve Bayes via crossvalidation method: " , (np.mean(scores))*100, '%')
nb_crossvalidation(data)

#### <span>Bagging Ensemble Method</span>

In [None]:
def nb_bagging(dataset, n):
    result =0
    for i in range (n):
        rnge =int(len(dataset)/n)
        start = i*rnge
        end= start+ rnge
        dataset = dataset[start:end]
        x = dataset.drop('income', axis=1).values 
        y = dataset['income'].values
        X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=n*100)
        model_name = 'Naive Bayes Classifier'
        nb_model = GaussianNB()
        nb_model.fit(X_train_hold,Y_train_hold)
        y_pred = nb_model.predict(X_test_hold)   
        result +=metrics.accuracy_score(Y_test_hold, y_pred)
        accuracy = (result/n)*100
        accuracy_naiveBayes.append(accuracy)
    print("Accuracy by Naïve Bayes via bagging method:",(result/n)*100,'%')
nb_bagging(data,4)

#### <span>Boosting Ensemble Method</span>

In [None]:
def nb_boosting(dataset):
    x = data.drop('income', axis=1).values 
    y = data['income'].values
    model_name = 'Naive Bayes Classifier'
    nb_model = GaussianNB()
    AdaBoost = AdaBoostClassifier(base_estimator= nb_model,n_estimators=400,learning_rate=1, algorithm='SAMME')
    AdaBoost.fit(x,y)
    prediction = AdaBoost.score(x,y)
    accuracy = prediction*100
    accuracy_naiveBayes.append(accuracy)
    print('"Accuracy by  Naïve Bayes via boosting method:" ',prediction*100,'%')
nb_boosting(data)

#### <span>Which method gives the most accurate value for Naïve Bayes Classifier?</span>

In [None]:
def accuracyPlot_naiveBayes(methodList, accuracyList):
    fig = go.Figure(data=[go.Pie(labels=methodList, values=accuracyList)])
    return fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['aliceblue', 'royalblue', 'darkblue', 'lightskyblue']))
accuracyPlot_naiveBayes(usedMethods_naiveBayes, accuracy_naiveBayes).show()

### <span>Artificial neural networks with 1 hidden layer</span>

In [None]:
accuracy_ann1 = []
usedMethods_ann1 = ['Holdout Method', 'Cross-Validation Method', 'Bagging Ensemble Method']

#### <span>Holdout Method</span>

In [None]:
def ann1_holdout(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=100)
    model = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000) #1 hidden layer with 10 hidden units
    model.fit(X_train_hold, Y_train_hold)   #trainig the model
    y_pred = model.predict(X_test_hold)   #prediction
    accuracy = metrics.accuracy_score(Y_test_hold, y_pred)*100
    accuracyComparison_holdout.append(accuracy)
    accuracy_ann1.append(accuracy)
    print("Accuracy by ANN with 1 layer via hold_out method:", accuracy,"%")
ann1_holdout(data)

#### <span>Cross-Validation Method</span>

In [None]:
def ann1_crossvalidation(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    cv = KFold(n_splits=10, random_state=1, shuffle=True)
    mlp = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000)
    scores = cross_val_score(mlp, x, y, scoring='accuracy', cv=cv, n_jobs=-1)
    accuracy = (np.mean(scores))*100
    accuracy_ann1.append(accuracy)
    # report performance
    print("Accuracy by neural network via cross_validation method (1 layer):", (np.mean(scores))*100,'%')
ann1_crossvalidation(data)

#### <span>Bagging Ensemble Method</span>

In [None]:
def ann1_bagging(dataset, n):
    result =0
    for i in range (n):
        rnge =int(len(dataset)/n)
        start = i*rnge
        end= start+ rnge
        dataset = dataset[start:end]
        x = dataset.drop('income', axis=1).values 
        y = dataset['income'].values
        X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=n*100)
        model = MLPClassifier(hidden_layer_sizes=(10), max_iter=50 )
        model.fit(X_train_hold,Y_train_hold)
        y_pred = model.predict(X_test_hold)   
        result +=metrics.accuracy_score(Y_test_hold, y_pred)
        accuracy = (result/n)*100
        accuracy_ann1.append(accuracy)
    print("Accuracy by ANN with 1 layer via bagging method:",(result/n)*100,'%')
ann1_bagging(data,4)

#### <span>Which method gives the most accurate value for Artificial neural networks with 1 hidden layer Classifier?</span>

In [None]:
def accuracyPlot_ann1(methodList, accuracyList):
    fig = go.Figure(data=[go.Pie(labels=methodList, values=accuracyList)])
    return fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['lightgoldenrodyellow', 'wheat', 'yellow']))
accuracyPlot_ann1(usedMethods_ann1, accuracy_ann1).show()

### <span>Artificial neural networks with 2 hidden layers</span>

In [None]:
accuracy_ann2 = []
usedMethods_ann2 = ['Holdout Method', 'Cross-Validation Method', 'Bagging Ensemble Method']

#### <span>Holdout Method</span>

In [None]:
def ann2_holdout(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=100)
    model = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=1000) #2 hidden layers with 10 hidden units
    model.fit(X_train_hold, Y_train_hold)   #trainig the model
    y_pred = model.predict(X_test_hold)   #prediction
    accuracy = metrics.accuracy_score(Y_test_hold, y_pred)*100
    accuracyComparison_holdout.append(accuracy)
    accuracy_ann2.append(accuracy)
    print("Accuracy by ANN with 2 layers via hold_out method:", accuracy,"%")
ann2_holdout(data)

#### <span>Cross-Validation Method</span>

In [None]:
def ann2_crossvalidation(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    cv = KFold(n_splits=10, random_state=1, shuffle=True)
    mlp = MLPClassifier(hidden_layer_sizes=(10), max_iter=1000)
    scores = cross_val_score(mlp, x, y, scoring='accuracy', cv=cv, n_jobs=-1)
    # report performance
    accuracy=(np.mean(scores))*100
    accuracy_ann2.append(accuracy)
    print("Accuracy by neural network via cross_validation method (2layers):", (np.mean(scores))*100,'%')
ann2_crossvalidation(data)

#### <span>Bagging Ensemble Method</span>

In [None]:
def ann2_bagging(dataset, n):
    result =0
    for i in range (n):
        rnge =int(len(dataset)/n)
        start = i*rnge
        end= start+ rnge
        dataset = dataset[start:end]
        x = dataset.drop('income', axis=1).values 
        y = dataset['income'].values
        X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=n*100)
        model = MLPClassifier(hidden_layer_sizes=(10,10), max_iter=1000)
        model.fit(X_train_hold,Y_train_hold)
        y_pred = model.predict(X_test_hold)   
        result +=metrics.accuracy_score(Y_test_hold, y_pred)
        accuracy=(result/n)*100
        accuracy_ann2.append(accuracy)
    print("Accuracy by ANN with 1 layer via bagging method:",(result/n)*100,'%')
ann2_bagging(data,4)

##### <span>Which method gives the most accurate value for Artificial neural networks with 2 hidden layers Classifier?</span>

In [None]:
def accuracyPlot_ann2(methodList, accuracyList):
    fig = go.Figure(data=[go.Pie(labels=methodList, values=accuracyList)])
    return fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['lightslategrey', 'darkslategrey', 'lightgrey']))
accuracyPlot_ann2(usedMethods_ann1, accuracy_ann1).show()

### <span>Support vector machines</span>

In [None]:
accuracy_svm = []
usedMethods_svm = ['Holdout Method', 'Cross-Validation Method', 'Bagging Ensemble Method', 'Boosting Ensemble Method']

#### <span>Holdout Method</span>

In [None]:
def svm_holdout(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=100)
    clf = svm.SVC(kernel='linear') # Linear Kernel
    clf.fit(X_train_hold, Y_train_hold)   #trainig the model
    y_pred = clf.predict(X_test_hold)   #prediction
    accuracy = metrics.accuracy_score(Y_test_hold, y_pred)*100
    accuracyComparison_holdout.append(accuracy)
    accuracy_svm.append(accuracy)
    print("Accuracy by SVM via hold_out method:", accuracy,"%")
svm_holdout(data)

#### <span>Cross-Validation Method</span>

In [None]:
def svm_crossvalidation(dataset):
    x = dataset.drop('income', axis=1).values 
    y = dataset['income'].values
    cv = KFold(n_splits=10, random_state=1, shuffle=True)
    clf = svm.SVC(kernel='linear') # Linear Kernel
    scores = cross_val_score(clf, x, y, scoring='accuracy', cv=cv, n_jobs=-1)
    # report performance
    accuracy = (np.mean(scores))*100
    accuracy_svm.append(accuracy)
    print("Accuracy by SVM via crossvalidation method: " , (np.mean(scores))*100, '%')
svm_crossvalidation(data)

#### <span>Bagging Ensemble Method</span>

In [None]:
def svm_bagging(dataset, n):
    result =0
    for i in range (n):
        rnge =int(len(dataset)/n)
        start = i*rnge
        end= start+ rnge
        dataset = dataset[start:end]
        x = dataset.drop('income', axis=1).values 
        y = dataset['income'].values
        X_train_hold, X_test_hold, Y_train_hold, Y_test_hold = train_test_split(x, y, test_size=0.2, random_state=n*100)
        clf = svm.SVC(kernel='linear') # Linear Kernel
        clf.fit(X_train_hold, Y_train_hold)   #trainig the model
        y_pred = clf.predict(X_test_hold)   #prediction
        result +=metrics.accuracy_score(Y_test_hold, y_pred)
        accuracy = (result/n)*100
        accuracy_svm.append(accuracy)
    print("Accuracy by SVM via bagging method:",(result/n)*100,'%')
svm_bagging(data,4)

#### <span>Boosting Ensemble Method</span>

In [None]:
dataset2 = data[:10]
def svm_boosting(dataset):
    x = data.drop('income', axis=1).values 
    y = data['income'].values
    model = svm.SVC(kernel='linear') # Linear Kernel
    AdaBoost = AdaBoostClassifier(base_estimator= model,n_estimators=400,learning_rate=1, algorithm='SAMME')
    AdaBoost.fit(x,y)
    prediction = AdaBoost.score(x,y)
    accuracy = prediction*100
    accuracy_svm.append(accuracy)
    print('"Accuracy by SVM via boosting method:" ',prediction*100,'%')
svm_boosting(dataset2)

#### <span>Which method gives the most accurate value for Support vector machines Classifier?</span>

In [None]:
def accuracyPlot_svm(methodList, accuracyList):
    fig = go.Figure(data=[go.Pie(labels=methodList, values=accuracyList)])
    return fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['peachpuff', 'papayawhip', 'orange', 'orangered']))
accuracyPlot_svm(usedMethods_svm, accuracy_svm).show()

## <span style="color:#900C3F;">4. Comparing the performance of classification models</span>

### <span>Comparison of accuracy among all classifiers used with Holdout Method</span>

In [None]:
fig = go.Figure(data=[go.Pie(labels=classifiersList, values=accuracyComparison_holdout)])
fig = fig.update_traces(hoverinfo='label+value', textinfo='percent', marker=dict(colors= ['aliceblue', 'lightskyblue', 'royalblue', 'mediumblue', 'cornflowerblue', 'darkblue']))
fig.show()

### <span>Comparison of error accuracy between our results and Ronny-Barry results given from the website for ID3, C45 and Naïve Bayes algorithm</span>

In [None]:
# Comparing between error accuracies we got after training and testing, and the error accuracies done by
#Ronny Kohavi and Barry Becker as given from the website (https://archive.ics.uci.edu/ml/datasets/census+income)

#ID3: Decision tree by gain value
#our_error_ID3 = 
Ronny_Barry_ID3 = 15.64

#C4.5: Decision tree by gini index value
#our_error_C45 = 
Ronny_Barry_C45 = 15.54

#NB: Naive bayes
#our_error_NB = 
Ronny_Barry_NB = 16.12
list = ['ID3 Error', 'C45 Error', 'Naïve Bayes']
our_error = [our_error_ID3, our_error_C45, our_error_NB]
Ronny_Barry_error = [Ronny_Barry_ID3, Ronny_Barry_C45,Ronny_Barry_NB]

In [None]:
import plotly.graph_objects as go
fig = go.Figure(data=[
    go.Bar(name='Our Errors', x=list, y=our_error),
    go.Bar(name='Ronny-Barry Errors', x=list, y=Ronny_Barry_error)
])
fig = fig.update_layout(barmode='group')
fig.show()

## Conclusion

The Census Income data is preprocessed, splitted into testing & training sets by several methods such as Holdout method and Bagging ensemble method. Then, 6 models where some of them are constructed with ready models and other is manually implemented on the data in order to predict whether a person makes over 50K a year or not.


Thank you! 

#### References: 

- https://archive.ics.uci.edu/ml/datasets/Census+Income

- http://myweb.sabanciuniv.edu/rdehkharghani/files/2016/02/The-Morgan-Kaufmann-Series-in-Data-Management-Systems-Jiawei-Han-Micheline-Kamber-Jian-Pei-Data-Mining.-Concepts-and-Techniques-3rd-Edition-Morgan-Kaufmann-2011.pdf

- https://scikit-learn.org/stable/

- https://plotly.com/python/basic-charts/