Machine learning has now permeated multiple disciplines, even politics. The current landscape in the US is rife with data scientists and other quantitative experts making predictions about ongoing and upcoming elections. Consider the Congressional Voting Records dataset from the UCI machine learning repository. (​https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records​). The dataset contains two files: one with a “.names” suffix and one with a “.data” suffix. The actual data is in the “.data” suffix and “.names” describes the metadata (i.e., describes what the different columns mean). Note that each row of the “.data” file contains one instance and includes both features and the class label (please take care to note the order). The machine learning problem here is to take the votes of US congressmen/congresswomen as input and predict whether they are a Republican or a Democrat.

### Importing libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics


### Reading the data as a dataframe

In [2]:
column_names = ['Class Name', 'Handicapped Infants','Water Project Cost Sharing',
               'Adoption of the budget resolution', 'Physician fee freeze',
                'El salvador aid', 'Religious groups in schools', 'Anti-satellite test ban',
               'Aid to nicaraguan contras', 'MX Missile', 'Immigration', 
                'Synfuels corporation cutback', 'Education spending', 
                'Superfund right to sue', 'Crime', 'Duty free exports', 
                'Export administration act south africa']

# read_csv() - reads comma-separated values (csv) file into a dataframe.
dataframe = pd.read_csv('../data/info.data', names = column_names, header=None);
dataframe.head()


Unnamed: 0,Class Name,Handicapped Infants,Water Project Cost Sharing,Adoption of the budget resolution,Physician fee freeze,El salvador aid,Religious groups in schools,Anti-satellite test ban,Aid to nicaraguan contras,MX Missile,Immigration,Synfuels corporation cutback,Education spending,Superfund right to sue,Crime,Duty free exports,Export administration act south africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y


### Descriptive statistics about the dataset

The below method generates descriptive statistics including those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

They analyze both numeric and object series, as well as DataFrame column sets of mixed data types. The output will vary depending on what is provided.

In [3]:
# describe() - generates descriptive statistics for the dataframe.
dataframe.describe()

Unnamed: 0,Class Name,Handicapped Infants,Water Project Cost Sharing,Adoption of the budget resolution,Physician fee freeze,El salvador aid,Religious groups in schools,Anti-satellite test ban,Aid to nicaraguan contras,MX Missile,Immigration,Synfuels corporation cutback,Education spending,Superfund right to sue,Crime,Duty free exports,Export administration act south africa
count,435,435,435,435,435,435,435,435,435,435,435,435,435,435,435,435,435
unique,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
top,democrat,n,y,y,n,y,y,y,y,y,y,n,n,y,y,n,y
freq,267,236,195,253,247,212,272,239,242,207,216,264,233,209,248,233,269


### Replacing missing values ? with NaN

In [4]:
# replace() - replaces the values of the dataframe are replaced with other values dynamically.
dataframe = dataframe.replace('?', np.NaN)
dataframe.head()

Unnamed: 0,Class Name,Handicapped Infants,Water Project Cost Sharing,Adoption of the budget resolution,Physician fee freeze,El salvador aid,Religious groups in schools,Anti-satellite test ban,Aid to nicaraguan contras,MX Missile,Immigration,Synfuels corporation cutback,Education spending,Superfund right to sue,Crime,Duty free exports,Export administration act south africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y


### Case 1: Dropping NaN values

In [5]:
#dropna() - drops any rows that contains atleast one NaN value
dropped_nan_dataframe = dataframe.dropna(how = 'any');
dropped_nan_dataframe.head()

Unnamed: 0,Class Name,Handicapped Infants,Water Project Cost Sharing,Adoption of the budget resolution,Physician fee freeze,El salvador aid,Religious groups in schools,Anti-satellite test ban,Aid to nicaraguan contras,MX Missile,Immigration,Synfuels corporation cutback,Education spending,Superfund right to sue,Crime,Duty free exports,Export administration act south africa
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
19,democrat,y,y,y,n,n,n,y,y,y,n,y,n,n,n,y,y
23,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,y,y
25,democrat,y,n,y,n,n,n,y,y,y,y,n,n,n,n,y,y


### Case 2: Treat missing values as a value 

In [6]:
missing_values_dataframe = dataframe.copy();
for column in dataframe.columns:
    missing_values_dataframe[column] = missing_values_dataframe[column].fillna('m')
        
missing_values_dataframe.head()

Unnamed: 0,Class Name,Handicapped Infants,Water Project Cost Sharing,Adoption of the budget resolution,Physician fee freeze,El salvador aid,Religious groups in schools,Anti-satellite test ban,Aid to nicaraguan contras,MX Missile,Immigration,Synfuels corporation cutback,Education spending,Superfund right to sue,Crime,Duty free exports,Export administration act south africa
0,republican,n,y,n,y,y,y,n,n,n,y,m,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,m
2,democrat,m,y,y,m,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,m,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,m,y,y,y,y


### Case 3: Impute missing values 

In [7]:
imputed_dataframe = dataframe;
# Loop through every column in the dataframe 
# Compute the mode and assign the value to the missig Get the mode(s) of each element along the selected axis.
for column in dataframe.columns:
    imputed_dataframe[column] = dataframe[column].fillna(dataframe[column].mode()[0])

imputed_dataframe.head()

Unnamed: 0,Class Name,Handicapped Infants,Water Project Cost Sharing,Adoption of the budget resolution,Physician fee freeze,El salvador aid,Religious groups in schools,Anti-satellite test ban,Aid to nicaraguan contras,MX Missile,Immigration,Synfuels corporation cutback,Education spending,Superfund right to sue,Crime,Duty free exports,Export administration act south africa
0,republican,n,y,n,y,y,y,n,n,n,y,n,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
2,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,n,y,y,y,y


### Label encoding

In [8]:
label_encoder = preprocessing.LabelEncoder()

def label_encoding(df):
    encoded_dataframe = df.copy();
    for column in df.columns:
        encoded_dataframe[column] = label_encoder.fit_transform(df[column])
        
    return encoded_dataframe;
        

### Cross validation 

In [9]:
kf = model_selection.KFold(n_splits = 5, shuffle = False);


def cross_validation(df, model):
    precision_scores = []
    recall_scores = []
    f1_scores = []
    for train_index, test_index in kf.split(df):
        
        X = label_encoding(df.drop(["Class Name"], axis=1))
        y = df["Class Name"]

        X_train, X_test = X.iloc[train_index], X.iloc[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]

        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        
        precision_scores.append(metrics.precision_score(y_test, y_pred, average="weighted"))
        recall_scores.append(metrics.recall_score(y_test, y_pred, average="weighted"))
        f1_scores.append(metrics.f1_score(y_test, y_pred, average="weighted"))
        
    print("Precision score =", np.mean(precision_scores))
    print("Recall score =",  np.mean(recall_scores))
    print("F1 score =",  np.mean(f1_scores))


## Naive Bayes Classifier

In [10]:
naive_bayes_model = GaussianNB()

### Naive bayes Classifier - Dropped NaN values 

In [11]:
cross_validation(dropped_nan_dataframe, naive_bayes_model)

Precision score = 0.9549755906036499
Recall score = 0.9527289546716003
F1 score = 0.9524824274738493


### Naive bayes Classifier - Treat missing values as a value

In [12]:
cross_validation(missing_values_dataframe, naive_bayes_model)

Precision score = 0.9227801115023349
Recall score = 0.9172413793103449
F1 score = 0.9177918684144426


### Naive bayes Classifier - Impute missing values

In [13]:
cross_validation(imputed_dataframe, naive_bayes_model)

Precision score = 0.9267076847968791
Recall score = 0.9241379310344827
F1 score = 0.9244727496192995


## Decision Tree Classifier

In [14]:
decision_tree_model = DecisionTreeClassifier() 

### Decision Tree Classifier - Dropped NaN values 

In [15]:
cross_validation(dropped_nan_dataframe, decision_tree_model)

Precision score = 0.9536189045691241
Recall score = 0.952358926919519
F1 score = 0.9523687131509758


### Decision Tree Classifier - Treat missing values as a value

In [16]:
cross_validation(missing_values_dataframe, decision_tree_model)

Precision score = 0.9435727973988012
Recall score = 0.9402298850574713
F1 score = 0.9402131261091465


### Decision Tree Classifier - Impute missing values


In [17]:
cross_validation(imputed_dataframe, decision_tree_model)

Precision score = 0.9426383047013068
Recall score = 0.9402298850574713
F1 score = 0.9401777624656882
