# ELECTION INSIGHTS: Using hypothesis testing to get political & social analytics.
Joe Ganser

**BACKGROUND**

A government election was done between two congressman in New Jersey, and a republican candidate lost. A sample of 1577 was taken out of a population of approximately 400,000 registered voters. What can we learn about the people who voted using the data?

![NJ.png](NJ.png)

I got this data set from a take home job interview and felt pretty compelled to play around with it. I experimented with it and looked at a lot of different modelling techniques. Despite modelling techniques, the most interesting insights I discovered were by using hypothesis tests. The hypothesis tests were done by exploiting the central limit theorem using permutation resampling.


**OUTLINE**

I will share the key results in this project, then go into details at the end of this notebook describing the coding methods used to come up with the results

**BASE LINE DATA STRUCTURE**

The raw data was a table for which each row described an individual voter, and whether or not that voter voted for the repbulican candidate. The "target" was the column "voted_for_republican", and the other columns were descriptive features for each voter. By looking at the target column, we can see that the two cateogories are those who voted for him and those who voted against him. In the sample taken, 769 voters (49%) voted for the republican and 808 voted against.

In [3]:
import pandas as pd
raw = pd.read_csv('Uncleaned_data1.csv').drop(['Unnamed: 0'],axis=1)
raw.head()

Unnamed: 0,voted_for_republican,very_freq_voter,freq_voter,voted_03_05_2013,voted_05_21_2013,voted_06_07_2016,age,party,education,ismarried,home_owner,renters,ethnic,sex,hhcount
0,0,0,1,0,0,1,51.0,D,some_college,N,H,Y,white,M,1
1,0,0,0,0,0,1,53.0,D,some_college,Y,,Y,white,M,1
2,1,0,0,0,0,0,59.0,R,doctorate,N,H,Y,hispanic,M,2
3,1,0,0,0,0,1,23.0,D,bachelors,N,H,N,black,M,5
4,1,1,1,1,1,1,79.0,R,HS_drop_out,N,,Y,white,F,1


**KEY RESULTS**

After performing hypothesis testing on the data, I was able to identify several key cateogorical features that indicated someone that would VOTE FOR the republican candidate, and as well as key cateogrical features indicating they would VOTE AGAINST the republican candidate. Each feature listed in the following two tables had a distribution that was statistically significantly different (p<0.01) between the supporters and non-supporters of the republican candidate.

The features of people who were very likely to support the republican candidate (ranked in order of signifigance):

In [28]:
support = pd.read_csv('support.csv').drop('Unnamed: 0',axis=1)
support.index = support.index+1
support

Unnamed: 0,Feature,Favoring,For_Repub_Count,Against_Repub_Count,Difference
1,party_R,For,179.0,97.0,82.0
2,freq_voter party_R,For,161.0,87.0,74.0
3,voted_06_07_2016 party_R,For,145.0,74.0,71.0
4,party_R home_owner_H,For,127.0,57.0,70.0
5,voted_03_05_2013 party_R,For,128.0,65.0,63.0
6,voted_05_21_2013 party_R,For,110.0,55.0,55.0
7,party_R sex_M,For,93.0,53.0,40.0
8,party_R education_doctorate,For,61.0,25.0,36.0
9,education_doctorate,For,61.0,25.0,36.0
10,home_owner_H education_doctorate,For,52.0,17.0,35.0


So what do each one of these rows represent? In the first row we see 'party_R', which was a feature indicating the voter was a republican. The column 'For_Repub_Count' indicates the number of voters had that feature that voted for the repbulican candidate. So in row one, 179 people who were registered as republicans voted for him, and 97 who had been registered as republicans voted against him. It makes sense that if a voter had the feature of being registered as republican, that was the highest indicator that they'd vote republican.

As the rows go down, we see combination features. For example, in row four, the feature was being a registered republican AND a home owner. Out of people with these attributes, there were 127 people that voted for the repbulican and 57 that voted against.

A similar data table was formed for the attributes indicating voter features of people that were more likely to vote against the republican. The first row speaks for itsself - those registered as a democrat were more likely to vote against him (even though he did have some democratic support).

In [24]:
oppose = pd.read_csv('oppose.csv').drop('Unnamed: 0',axis=1)
oppose.index = oppose.index+1
oppose

Unnamed: 0,Feature,Favoring,Against_Repub_Count,For_Repub_Count,Difference
1,party_D,Against,602.0,454.0,148.0
2,freq_voter party_D,Against,476.0,344.0,132.0
3,voted_03_05_2013 party_D,Against,353.0,221.0,132.0
4,voted_06_07_2016 party_D,Against,519.0,394.0,125.0
5,voted_05_21_2013 party_D,Against,346.0,233.0,113.0
6,education_some_college,Against,209.0,131.0,78.0
7,party_D education_some_college,Against,209.0,131.0,78.0
8,voted_03_05_2013 education_bachelors,Against,170.0,100.0,70.0
9,freq_voter education_some_college,Against,162.0,102.0,60.0


And lastly, it was also shown that the average age of the supporters was statistically different than the average age of his opposers. The average age of his supporters was slightly younger, again p<0.01
![age_hist.png](age_hist.png)

**So how did I come to these conclusions?**

After the data was prepared, each categorical feature had a 1 or 0 indicating its presence, and the hypothesis testing done on each feature. Each test asked the same question: is the average amount for this feature the same amongst those who voted for the republican as those who didn't? The null hyptohesis was that the distribution was the same amongst both groups.

By looking at the data in first sight, we can see the distributions of features amongst supporters and non-supporters aren't too dramatically different. This is demonstrated by the following pie charts. 

![Education_ethnicity_sex1.png](Education_ethnicity_sex1.png)

**These distributions look quite similar. How do we deal with this? Use combinations of features to get insights.**

For the most part - its not simply the features themselves that indicate whether or not someone decided to vote or against the repbulican, but COMBINATIONS of the features that may be present in each voter.

So after dropping the highly correlated features (i.e. greater than |+-0.6|), and putting the cateogrical variables into dummies, I put the table through scikitlearns PolynomialFeatures, which gets the presense of combined features.

In [5]:
from sklearn.preprocessing import PolynomialFeatures
data = pd.read_csv('data_dummified.csv').drop(['Unnamed: 0'],axis=1)
exclude_these = ['voted_for_republican','age','hhcount']
interaction = PolynomialFeatures(degree=2, interaction_only=True,include_bias=False)
interaction.fit_transform(data.drop(exclude_these,axis=1))
c = interaction.get_feature_names(data.drop(exclude_these,axis=1).columns)
interaction_features = pd.DataFrame(interaction.fit_transform(data.drop(exclude_these,axis=1)),columns=c)
data = pd.concat([data[exclude_these],interaction_features],axis=1)
print('New shape is: ',data.shape)
data.head()

New shape is:  (1577, 438)


Unnamed: 0,voted_for_republican,age,hhcount,freq_voter,voted_03_05_2013,voted_05_21_2013,voted_06_07_2016,party_AI,party_D,party_DS,...,ethnic_black ethnic_hispanic,ethnic_black ethnic_indian,ethnic_black ethnic_jewish,ethnic_black ethnic_white,ethnic_hispanic ethnic_indian,ethnic_hispanic ethnic_jewish,ethnic_hispanic ethnic_white,ethnic_indian ethnic_jewish,ethnic_indian ethnic_white,ethnic_jewish ethnic_white
0,0,51.0,1,1.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,53.0,1,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,59.0,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,23.0,5,0.0,0.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,79.0,1,1.0,1.0,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### PERMUTATION HYPOTHESIS TESTING

Now that the data is cleaned and ready, and here's where the hyptohesis testing comes into play. The hypothesis testing is done by a for - loop, comparing values of each column of those that voted for the republican and those who didn't. A permutation hypothesis test basically works like this:

![resampling.jpg](resampling.jpg)

To be more specific in description for how I'm using it here. The steps are as follows: 

       1. On each column, the row set is broken into two parts - those who voted for the 
       republican and those who didn't.
       2. Order that column so you start with the all the republican voters's values for 
       that column, followed by the values of those who didn't vote for him. (The first 
       769 rows would be the supporting voters, the remaining 808 are for those who
       didn't).
       3. Find the mean value of that column amongst those who voted for the republican,
       and the mean value of that column for those who didn't vote for him.
       4. Save the difference between the means - this is the experimental value.
       5. Scramble the order of the column randomly.
       6. Take the mean value of the first 769 rows, and the mean value of the next 808,
       save it. 
       7. Repeat steps 5 & 6 for 10,000 times. This creates a normal distribution of mean
       differences
       8. Get the Z-score of the mean difference observed in step 4.
       9. Find the area (p value) to the right or left of the Z-score value.
       10. If |Z| >= 2.96 and p < 0.01 then reject the null hypothesis that the 
       disitribution of that feature is the same for those who supported the republican 
       candidate and those who didn't.
       11. Perform this test on all features in the data table.
       12. List all features for which we can reject the null hypothesis.
       13. For each null rejecting categorical feature list the number of supporters that 
       had this feature, and the number of opposers that had it. 
       14. If it had more supporters than non-supporters than if a voter had this feature 
       it indicated they were more likely to vote republican than otherwise. And visa 
       versa.
       
The code to do this is as follows:

In [17]:
from scipy import stats
from sklearn.metrics import auc
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def seperate_dataframe(data,column):
    converters = list(data[data['voted_for_republican']==1][column])
    non_converters = list(data[data['voted_for_republican']==0][column])
    return converters, non_converters

def permutation_mean(convert_data, non_convert_data):
    """Generate a permutation sample from two data sets."""
    # Concatenate the data sets: data
    data = np.concatenate((convert_data,non_convert_data))
    # Permute the concatenated array: permuted_data
    permuted_data = np.random.permutation(data)
    # Split the permuted array into two: perm_sample_1, perm_sample_2
    convert_permutation = permuted_data[:len(convert_data)]
    non_convert_permutation = permuted_data[len(non_convert_data):]
    #Return the permutated Mean
    return convert_permutation.mean() - non_convert_permutation.mean()

def build_normal_distribution(x,y,number_of_permutations):
    means = []
    for i in range(0,number_of_permutations):
        means.append(permutation_mean(x,y))
    means.sort()
    hmean = np.mean(means)
    hstd = np.std(means)
    pdf = stats.norm.pdf(means, hmean, hstd)
    Z_score = (means - np.mean(means))/np.std(means)
    normal_distribution = pd.DataFrame({'means':means,'Zscore':Z_score,'PDF':pdf})
    return normal_distribution

def get_guassian_values(x,series):
    one_over_std = 1/(np.std(series)*np.sqrt(2*3.14159))
    factor_numerator = -1*(x - np.mean(series))**2
    factor_denomenator = 2*(np.std(series)**2)
    exponential_factor = factor_numerator/factor_denomenator
    return one_over_std*np.exp(exponential_factor)

def Hypothesis_Test(data,column,number_of_permutations):
    cv,ncv = seperate_dataframe(data,column)
    normal_distribution = build_normal_distribution(cv,ncv,number_of_permutations)
    actual_mean_difference = np.mean(cv) - np.mean(ncv)
    Z_test = (actual_mean_difference-np.mean(normal_distribution['means']))/np.std(normal_distribution['means'])
    Guassian_of_actual = get_guassian_values(actual_mean_difference,normal_distribution['means'])
    if Z_test > 0:
        critical_region = normal_distribution[normal_distribution['Zscore']>Z_test][['Zscore','PDF']]
    elif Z_test<0:
        critical_region = normal_distribution[normal_distribution['Zscore']<Z_test][['Zscore','PDF']]
    else:
        critical_region = normal_distribution[['Zscore','PDF']]
    try:
        p_value = auc(critical_region['Zscore'],critical_region['PDF'])
    except ValueError:
        p_value = 0
    p_value=round(p_value,6)
    if (np.abs(Z_test)>2.58) and (p_value<0.005):
        if p_value==0:
            string= str(column)+'\n'+"Reject Null Hypothesis that both converters and non_converters \n have the same distribution, with p < 0.005 and Z score: "+str(Z_test)
            #print(string)
            return string
        else:
            string = str(column) + "Reject Null Hypothesis that both converters and non_converters \n have the same distribution, with p < "+str(p_value)+' and Z score: '+str(Z_test)
            #print(string)
            return string
    else:
        string = str(column)+" Fail to reject Null Hypothesis"
        #print(string)
        return string
    
tests_on_viewing1 = {}
for feature in data.drop('voted_for_republican',axis=1).columns:
    tests_on_viewing1[feature] = [Hypothesis_Test(data,feature,10000)]
    
tests = pd.DataFrame(tests_on_viewing1).transpose().reset_index().rename(columns={'index':'Feature',0:'Test'})
def pass_fail(x):
    if 'Fail' in x:
        return 'PASS'
    else:
        return "REJECT NULL"
tests['result'] = tests['Test'].apply(lambda x:pass_fail(x))
tests['Count'] = tests['result'].apply(lambda x: 1 if 'REJECT' in x else 0)
percent_of_rejecting_null_hypothesis_test = 100*tests['Count'].sum()/len(tests)
data_significant = data[['voted_for_republican']+list(tests[tests['Count']==1]['Feature'])]
feature_measures1_a = {}
feature_measures1_na = {}
for feature in data_significant.drop('voted_for_republican',axis=1).columns:
    if len(data_significant[feature].unique())==2:
        feature_measures1_a[feature] = [data_significant[data_significant['voted_for_republican']==1][feature].sum()]
        feature_measures1_na[feature] = [data_significant[data_significant['voted_for_republican']==0][feature].sum()]

s1 = pd.DataFrame(feature_measures1_a).transpose().reset_index().rename(columns={'index':'Feature',0:'For_A_Count'})
s2 = pd.DataFrame(feature_measures1_na).transpose().reset_index().rename(columns={'index':'Feature',0:'Against_A_Count'})
signal = pd.merge(s1,s2,how='inner',on='Feature')
signal['Difference'] = signal[['For_A_Count','Against_A_Count']].apply(lambda row: row['For_A_Count'] - row['Against_A_Count'],axis=1)
signal['Favoring']=signal['Difference'].apply(lambda x: 'For' if x>0 else 'Against')
signal=signal[['Feature','Favoring','For_A_Count','Against_A_Count','Difference']]
signal.sort_values(by='Favoring',ascending=True,inplace=True)
signal

Unnamed: 0,Feature,Favoring,For_A_Count,Against_A_Count,Difference
1,education_some_college,Against,131.0,209.0,-78.0
2,education_some_college ethnic_white,Against,48.0,91.0,-43.0
4,freq_voter education_some_college,Against,102.0,162.0,-60.0
5,freq_voter party_D,Against,344.0,476.0,-132.0
17,voted_05_21_2013 party_D,Against,233.0,346.0,-113.0
8,party_D,Against,454.0,602.0,-148.0
9,party_D education_some_college,Against,131.0,209.0,-78.0
19,voted_06_07_2016 party_D,Against,394.0,519.0,-125.0
15,voted_03_05_2013 party_D,Against,221.0,353.0,-132.0
14,voted_03_05_2013 education_bachelors,Against,100.0,170.0,-70.0


Which shows us the key analytics I discussed above.