<h2>CS 4780/5780 Final Project: </h2>
<h3>Election Result Prediction for US Counties</h3>

Names and NetIDs for your group members:

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


<h3>Introduction:</h3>

<p> The final project is about conducting a real-world machine learning project on your own, with everything that is involved. Unlike in the programming projects 1-5, where we gave you all the scaffolding and you just filled in the blanks, you now start from scratch. The programming project provide templates for how to do this, and the most recent video lectures summarize some of the tricks you will need (e.g. feature normalization, feature construction). So, this final project brings realism to how you will use machine learning in the real world.  </p>

The task you will work on is forecasting election results. Economic and sociological factors have been widely used when making predictions on the voting results of US elections. Economic and sociological factors vary a lot among counties in the United States. In addition, as you may observe from the election map of recent elections, neighbor counties show similar patterns in terms of the voting results. In this project you will bring the power of machine learning to make predictions for the county-level election results using Economic and sociological factors and the geographic structure of US counties. </p>
<p>

<h3>Your Task:</h3>
Plase read the project description PDF file carefully and make sure you write your code and answers to all the questions in this Jupyter Notebook. Your answers to the questions are a large portion of your grade for this final project. Please import the packages in this notebook and cite any references you used as mentioned in the project description. You need to print this entire Jupyter Notebook as a PDF file and submit to Gradescope and also submit the ipynb runnable version to Canvas for us to run.

<h3>Due Date:</h3>
The final project dataset and template jupyter notebook will be due on <strong>December 15th</strong> . Note that <strong>no late submissions will be accepted</strong>  and you cannot use any of your unused slip days before.
</p>

![image.png; width="100";](attachment:image.png)

<h2>Part 1: Basics</h2><p>

<h3>1.1 Import:</h3><p>
Please import necessary packages to use. Note that learning and using packages are recommended but not required for this project. Some official tutorial for suggested packacges includes:
    
https://scikit-learn.org/stable/tutorial/basic/tutorial.html
    
https://pytorch.org/tutorials/
    
https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
<p>

In [120]:
import os
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
import sklearn.tree
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import  f1_score




In [None]:
!kaggle competitions download -c cs-4780-final-project-county-prediction-basic


Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python2.7/dist-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python2.7/dist-packages/kaggle/api/kaggle_api_extended.py", line 146, in authenticate
    self.config_file, self.config_dir))
IOError: Could not find kaggle.json. Make sure it's located in /root/.kaggle. Or use the environment method.


<h3>1.2 Weighted Accuracy:</h3><p>
Since our dataset labels are heavily biased, you need to use the following function to compute weighted accuracy throughout your training and validation process and we use this for testing on Kaggle.
<p>

In [23]:
def weighted_accuracy(pred, true):
    assert(len(pred) == len(true))
    num_labels = len(true)
    num_pos = sum(true)
    num_neg = num_labels - num_pos
    frac_pos = num_pos/num_labels
    weight_pos = 1/frac_pos
    weight_neg = 1/(1-frac_pos)
    num_pos_correct = 0
    num_neg_correct = 0
    for pred_i, true_i in zip(pred, true):
        num_pos_correct += (pred_i == true_i and true_i == 1)
        num_neg_correct += (pred_i == true_i and true_i == 0)
    weighted_accuracy = ((weight_pos * num_pos_correct) 
                         + (weight_neg * num_neg_correct))/((weight_pos * num_pos) + (weight_neg * num_neg))
    return weighted_accuracy

<h2>Part 2: Baseline Solution</h2><p>
Note that your code should be commented well and in part 2.4 you can refer to your comments. (e.g. # Here is SVM, 
# Here is validation for SVM, etc). Also, we recommend that you do not to use 2012 dataset and the graph dataset to reach the baseline accuracy for 68% in this part, a basic solution with only 2016 dataset and reasonable model selection will be enough, it will be great if you explore thee graph and possibly 2012 dataset in Part 3.

<h3>2.1 Preprocessing and Feature Extraction:</h3><p>
Given the training dataset and graph information, you need to correctly preprocess the dataset (e.g. feature normalization). For baseline solution in this part, you might not need to introduce extra features to reach the baseline test accuracy.
<p>

In [24]:
# You may change this but we suggest loading data with the following code and you may need to change
# datatypes and do necessary data transformation after loading the raw data to the dataframe.

#Renamed columns

train = pd.read_csv("drive/MyDrive/CS_4780_Project/train_2016.csv", sep=',',header=0, encoding='unicode_escape')
test=pd.read_csv("drive/MyDrive/CS_4780_Project/test_2016_no_label.csv",sep=',',header=0, encoding='unicode_escape')
#Determining which party won each county, 0= GOP win, 1- DEM win
labels= train["DEM"]>train["GOP"]

#label dataset
labels=labels.astype(int)
#using FIPS code as index
train=train.set_index("FIPS")
test=test.set_index("FIPS")
#dropped county since it is not a numerical feature
train=train.drop('County', axis=1)
test=test.drop('County', axis=1)
#------------------------
train=train.drop('DEM', axis=1)
train=train.drop('GOP', axis=1)

#----------------

#Converting income numbers to floats
train["MedianIncome"]=train["MedianIncome"].str.replace(",","")

train["MedianIncome"]=pd.to_numeric(train ["MedianIncome"], downcast="float")

test["MedianIncome"]=test["MedianIncome"].str.replace(",","")

test["MedianIncome"]=pd.to_numeric(test["MedianIncome"], downcast="float")



print(train)
# Make sure you comment your code clearly and you may refer to these comments in the part 2.4
# TODO

       MedianIncome  MigraRate  ...  BachelorRate  UnemploymentRate
FIPS                            ...                                
18019       51837.0        4.9  ...          20.9               4.2
6035        49793.0      -18.4  ...          12.0               6.9
40081       44914.0       -1.3  ...          15.1               5.3
31153       74374.0        9.2  ...          40.1               2.9
28055       26957.0      -12.8  ...           6.7              14.0
...             ...        ...  ...           ...               ...
36009       46224.0       -3.5  ...          19.1               6.0
55031       50340.0       -2.6  ...          24.0               5.2
27065       51347.0        1.6  ...          14.6               6.5
17139       57447.0       -9.2  ...          19.5               4.6
20185       46104.0       -7.9  ...          23.9               3.7

[1555 rows x 6 columns]


In [64]:
#NN preprocessing


nntrain=train.to_numpy()
nnlabels=labels.to_numpy()
nntest=test.to_numpy()

[0 0 0 ... 0 0 0]


<h3>2.2 Use At Least Two Training Algorithms from class:</h3><p>
You need to use at least two training algorithms from class. You can use your code from previous projects or any packages you imported in part 1.1.

In [None]:
#Using Decision Tree with information gain as the split criterion- this 
#ensures tree does not get too deep and thus prevents overfitting

D_tree_A=sklearn.tree.DecisionTreeClassifier(criterion="entropy", splitter= "random",)

#Neural network
#Using Multilayer perceptron neural network








<h3>2.3 Training, Validation and Model Selection:</h3><p>
You need to split your data to a training set and validation set or performing a cross-validation for model selection.

In [74]:

#Split training data into training and validation sets

x_train, x_val, y_train, y_val=sklearn.model_selection.train_test_split(train, labels, test_size=0.25)



#Train Decision trees with different depths

#average best depths of multiple groups of trees
bestdepths=[]

for x in range (1,10):
  acc_dictionary={}
#Test for best depth over a number of trees
  for i in range(1,30):
    D_tree=sklearn.tree.DecisionTreeClassifier(criterion="entropy", splitter= "random",max_depth=i)
    D_tree.fit(x_train,y_train)

    # Measure accuracy each has on validation set using weighted accuracy function
    val_preds=D_tree.predict(x_val)
    acc= weighted_accuracy(val_preds,y_val )
    acc=acc*100

    # Store dictionary of accuracies and their depths
    acc_dictionary[acc] = "{}".format(i)    
    best= acc_dictionary[max(acc_dictionary)]    
    bestdepths.append(int(best))


#Choose average of best depths

finaldp=sum(bestdepths)/len(bestdepths)


#Find best number of leaf nodes

acc_dictionary={}
for i in [20,50,100,500,1000]:
    D_tree=sklearn.tree.DecisionTreeClassifier(criterion="entropy", splitter= "random",max_depth=finaldp, max_leaf_nodes=i)
    D_tree.fit(x_train,y_train)
    
    
    # Measure accuracy each has on validation set using weighted accuracy function
    val_preds=D_tree.predict(x_val)
    acc= weighted_accuracy(val_preds,y_val )
    acc=acc*100
    
    

    # Store dictionary of accuracies and their depths
    acc_dictionary[acc] = i
best= acc_dictionary[max(acc_dictionary)]
final_max_nodes=best


    










20
50
100
500
1000


In [79]:
#Neural Network

#Split training data into training and validation sets
xnn_train, xnn_val, ynn_train, ynn_val=sklearn.model_selection.train_test_split(nntrain, labels, test_size=0.25)


#Tune activiation function
#Create a classifier with each activation function and test to see which function gives the best weighted accuracy
acc_dictionary={}
for func in ['identity', 'logistic', 'tanh', 'relu']:
  net=sklearn.neural_network.MLPClassifier(activation=func)
  net.fit(xnn_train,ynn_train)

  

  valnn_preds=net.predict(xnn_val)

 

  acc= weighted_accuracy(valnn_preds,ynn_val)

  
  acc=acc*100

  # Store dictionary of accuracies and the function that gave them
  acc_dictionary[acc] = func
        
  best= acc_dictionary[max(acc_dictionary)] 

print("best func is", best)
    
    
  




#Tune hidden layer 
acc_dictionary={}
for k in [10,25,50,75,100]:
  net=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(k,))
  
  net.fit(xnn_train, ynn_train)
  valnn_preds=net.predict(xnn_val)
  acc= weighted_accuracy(valnn_preds,ynn_val)
  print (acc)
  
  

  # Store dictionary of accuracies and the function that gave them
  acc_dictionary[acc] = k
        
best= acc_dictionary[max(acc_dictionary)] 
print("best lay num is", best)



#Tune max iter

acc_dictionary={}

for m in [50,100,200,500,1000]:
    net=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=50 ,activation='relu', max_iter=m)
    net.fit(xnn_train, ynn_train)
    valnn_preds=net.predict(xnn_val)
    acc= weighted_accuracy(valnn_preds,ynn_val)
    

    # Store dictionary of accuracies and the max_iter value that gave them
    acc_dictionary[acc] = m
          
best= acc_dictionary[max(acc_dictionary)] 

print("best max_iter is", m)

#Tune alpha

acc_dictionary={}

for alpha in[0.0010, 0.0015,0.0020,0.0100,0.0200] :
    net=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=50 ,activation='relu', max_iter=1000, alpha=alpha)
    net.fit(xnn_train, ynn_train)
    valnn_preds=net.predict(xnn_val)
    acc= weighted_accuracy(valnn_preds,ynn_val)
    
    # Store dictionary of accuracies and the learning rate that gave them
    acc_dictionary[acc] = alpha
          
best= acc_dictionary[max(acc_dictionary)] 

print("best alpha is", best)

#Tuning learning rate init
acc_dictionary={}

for lrate in[0.0001, 0.0005,0.0010, 0.0015,0.0020] :
    net=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100) ,activation='relu', max_iter=1000, learning_rate_init=lrate)
    net.fit(xnn_train, ynn_train)
    valnn_preds=net.predict(xnn_val)
    acc= weighted_accuracy(valnn_preds,ynn_val)
       
    # Store dictionary of accuracies and the learning rate that gave them
    acc_dictionary[acc] = lrate
          
best= acc_dictionary[max(acc_dictionary)] 

print("best learning rate is", best)








best func is relu
0.5
0.5
0.4488744451490171
0.49999999999999994
0.5
best lay num is 100
0.5
0.5
0.5
0.5
0.5
best max_iter is 1000
0.5
0.5
0.5
0.5
0.5
best alpha is 0.02
0.5
0.49999999999999994
0.5
0.49999999999999994
0.5
best learning rate is 0.002


In [78]:
#train and test best neural netowrk

net=sklearn.neural_network.MLPClassifier(hidden_layer_sizes=(100) ,activation='relu', max_iter=1000, alpha=0.01, learning_rate_init=0.002)
net.fit(xnn_train, ynn_train)
nnpreds=net.predict(nntest)

nnsub= pd.DataFrame({"FIPS": test.index ,
             "Result": nnpreds 
             })
nnsub=nnsub.set_index("FIPS")
nnsub.to_csv( "drive/MyDrive/CS_4780_Project/nnsoln.csv")


In [10]:
#train and test best tree

D_tree_final=sklearn.tree.DecisionTreeClassifier(criterion="entropy", splitter= "random",
                                                 max_depth=finaldp,max_leaf_nodes= final_max_nodes )
D_tree_final.fit(x_train,y_train)
preds=D_tree_final.predict(x_val)
val_accy=weighted_accuracy(preds,y_val)

submissionpreds=D_tree_final.predict(test)




submission= pd.DataFrame({"FIPS": test.index ,
             "Result": submissionpreds 
             })

submission=submission.set_index("FIPS")

submission.to_csv( "drive/MyDrive/CS_4780_Project/Basic_Soln1.2.csv")



0.9228295819935691


<h3>2.4 Explanation in Words:</h3><p>
    You need to answer the following questions in the markdown cell after this cell:

2.4.1 How did you preprocess the dataset and features?

The column with county names was removed, since the sklearn decision tree classifier can only take numerical data, and the FIPS codes already present a unique identifier for each county.

The data in the other columns was converted from strings to floats in order for the sklearn classifiers to work well with them.

2.4.2 Which two learning methods from class did you choose and why did you made the choices?

I used a decision tree and neural network. A decision tree is convenient because it

The neural network is a multilayer perceptron. This algorithm is capable of approxiamating any continuous function and so it is ideal for situations in which the ideal function mapping the features to the labels is a complex one.
There are many features that have complex relationships when it comes to which party wins a n election and so an algorithm that is capable of representing any relationship works well for this problem.


2.4.3 How did you do the model selection?

I split the training data into a training and validation set, with 3/4 used for training and 1/4 used for validation.

Decision Tree

I tested the accuracy of decision trees with different maximum leaf node limits and maximum depth limits. Each model was trained with the training set and its accuracy tested on the validation set.

 Initially, I used a for loop to test the accuracy of decision trees with different depths, during this process, I found that I was given a different value for the best depth each time, and so I opted to use nested for loops to test each depth several times and take the average of all of the best depths.


 Neural Network

For the neural network I used for loops to test a wide range of values for learning rate, alpha, maximum number of iterations, activation function and hidden layer size. Interestingly changes to these hyperparameters had very little effect on the prediction accuracy.

Changing the number of hidden layers and size of each layer also had very little effect on predictive accuracy.


2.4.4 Does the test performance reach a given baseline 68% performance? (Please include a screenshot of Kaggle Submission)

Yes, the decision tree reached around 70% accuracy on Kaggle.




<h2>Part 3: Creative Solution</h2><p>

<h3>3.1 Open-ended Code:</h3><p>
You may follow the steps in part 2 again but making innovative changes like creating new features, using new training algorithms, etc. Make sure you explain everything clearly in part 3.2. Note that reaching the 75% creative baseline is only a small portion of this part. Any creative ideas will receive most points as long as they are reasonable and clearly explained.

In [None]:
# Make sure you comment your code clearly and you may refer to these comments in the part 3.2

#Random Forest
#Use randomizezed cross validation to conduct k fold cross validation for a large number of different parameter combinations
#List of possible valuse to be considered for each parameter. Each iteration of randomsearchcv will use a combination of values
#from these lists and record the accuracy using weighted accuracy function

grid = {'n_estimators': [int(q) for q in np.linspace(start=100, stop=1500, num=100)],
               'max_features': ['auto'],
               'max_depth': [int(q) for q in (range(10,100))],             #10,100
               'bootstrap':[True, False]}

#scorer=make_scorer(sklearn.metrics.f1_score)
from sklearn.metrics import make_scorer
scorer = make_scorer(weighted_accuracy, greater_is_better=True)
searcher=sklearn.model_selection.RandomizedSearchCV(estimator = forest, param_distributions = grid, n_iter = 10, cv = 5, scoring=scorer )
searcher.fit(nntrain,nnlabels)

#give best parameters
print(searcher.best_params_)

In [139]:
forest= sklearn.ensemble.RandomForestClassifier(n_estimators=1000, max_features='auto', max_depth=18, bootstrap=True)
forest.fit(xnn_train,ynn_train)
frstpred=forest.predict(xnn_val)


0.7072500528429507


In [136]:
#Create submission
creativepreds=forest.predict(test)

creativesub= pd.DataFrame({"FIPS": test.index ,
             "Result": creativepreds 
             })

creativesub=creativesub.set_index("FIPS")

creativesub.to_csv("drive/MyDrive/CS_4780_Project/creativesol.csv")




<h3>3.2 Explanation in Words:</h3><p>

You need to answer the following questions in a markdown cell after this cell:



3.2.1 How much did you manage to improve performance on the test set compared to part 2? Did you reach the 75% accuracy for the test in Kaggle? (Please include a screenshot of Kaggle Submission)
No

![submission](/drive/MyDrive/CS_4780_Project/submit2.png)


3.2.2 Please explain in detail how you achieved this and what you did specifically and why you tried this.

I used a random forest instead of a decision tree for the creative solution. Random forests are more robust that single decision trees. Since a random forest uses many decision trees it limits overfitting, thus reducing generalization error. 

This time around, for model selection I used a random search grid with kfold cross-validation, instead of testing one parameter at a time using for loops as I did in the basic solution, I simultaneously tested different parameter combinations. This allows one to test more possible models, thus increasing the chance of finding a model that generalizes well.

In deciding what ranges of parameter values to test, special attention was paid to ensuring that n_estimaors (number of trees in the forest) was sizable and that max_depth was not too large in order to increase generalizability.

A random search grid was chosen over a grid search (which tests all possible parameter combinations) in order to give the algorithm a reasonable runtime. Gridsearch may have yielded more accurate parameters but the runtime would have been too long.


<h2>Part 4: Kaggle Submission</h2><p>
You need to generate a prediction CSV using the following cell from your trained model and submit the direct output of your code to Kaggle. The CSV shall contain TWO column named exactly "FIPS" and "Result" and 1555 total rows excluding the column names, "FIPS" column shall contain FIPS of counties with same order as in the test_2016_no_label.csv while "Result" column shall contain the 0 or 1 prdicaitons for corresponding columns. A sample predication file can be downloaded from Kaggle.

In [None]:
# TODO

# You may use pandas to generate a dataframe with FIPS and your predictions first 
# and then use to_csv to generate a CSV file.

<h2>Part 5: Resources and Literature Used</h2><p>