# Assignment 1. Opinion Polling

In this assignment, you will be expected to analyze a dataset on your own and answer questions about your findings. 

You’ve been hired as a consultant to predict how a state school board election will turn out.
* There are three candidates and all voters must vote for one of them: Pearle Goodman, Masako Holley, Genevieve Gallegos.
* The candidate with the final highest vote count wins the election.
* You are given the list of registered voters here:
https://github.com/sjyk/cmsc21800/blob/master/voters.csv
* The state gives you two samples of data one collected by SurveyMonkey and one collected by Qualtrics:
https://github.com/sjyk/cmsc21800/blob/master/survey_monkey.csv 
https://github.com/sjyk/cmsc21800/blob/master/qualtrics.csv

## Initial Steps
Let's first get our data analysis environment setup by loading all of the datasets:

In [33]:
import pandas as pd

voter_roll = pd.read_csv('voters.csv') #reads the data as a csv file

In [34]:
survey_monkey = pd.read_csv('survey_monkey.csv')
qualtrics = pd.read_csv('qualtrics.csv')

survey_monkey[:10]

Unnamed: 0,Voter,Vote
0,Robert Wiltse,Genevieve Gallegos
1,Ellen Delrio,Pearle Goodman
2,Armando Dawson,Genevieve Gallegos
3,Sarah Ybarra,Masako Holley
4,Roger Taylor,Genevieve Gallegos
5,Ricky Applewhite,Genevieve Gallegos
6,Rufus Harrison,Genevieve Gallegos
7,Emmitt Engelking,Genevieve Gallegos
8,Michael Curry,Masako Holley
9,Elmer Jones,Genevieve Gallegos


In [40]:
candidates = set(survey_monkey['Vote']).union(set(qualtrics['Vote']))
candidates

{'Genevieve Gallegos', 'Masako Holley', 'Pearle Goodman'}

In [44]:
survey_monkey.groupby('Vote').count()/survey_monkey['Vote'].count()

Unnamed: 0_level_0,Voter
Vote,Unnamed: 1_level_1
Genevieve Gallegos,0.59
Masako Holley,0.35
Pearle Goodman,0.06


In [43]:
qualtrics['Vote'].value_counts(normalize=True)

Masako Holley         0.50
Genevieve Gallegos    0.42
Pearle Goodman        0.08
Name: Vote, dtype: float64

In [45]:
survey_monkey = survey_monkey.merge(voter_roll)
qualtrics = qualtrics.merge(voter_roll)
qualtrics[:10]

Unnamed: 0,Voter,Vote,Gender,Age,County
0,Michael Thomas,Masako Holley,male,36-45,Mountain Farm
1,Jeri Cowley,Genevieve Gallegos,female,36-45,Mountain Farm
2,Alexander Moose,Genevieve Gallegos,male,36-45,Mountain Farm
3,Linda Abraham,Masako Holley,female,36-45,Mountain Farm
4,Alfred Blocker,Genevieve Gallegos,male,36-45,Mountain Farm
5,Leona Hill,Masako Holley,female,65+,Black
6,Brittany Silva,Masako Holley,female,46-55,Mountain Farm
7,Lawrence Knight,Masako Holley,male,46-55,Mountain Farm
8,Michael White,Genevieve Gallegos,male,56-65,Mountain Farm
9,Annette Ealy,Masako Holley,female,36-45,Bailey


In [47]:
Nqualtrics = len(qualtrics)
Nsurvey = len(survey_monkey)

Nqualtrics, Nsurvey

(50, 100)

### Q1. The SurveyMonkey data shows Genevieve Gallegos winning 59% vote of 100 people polled and the Qualtrics data shows her losing with 42% vote of 50 people polled.  Which of the following best describes the likelihood that a difference this large (>17%) happened purely by random chance and not an error in the polling process?

In [48]:
"""
Let's calculate the likelihood that a single poll could be off by 17%
"""

import numpy as np

#the maximum variance (b-a)^2/4 
MAX_VARIANCE = 0.25

#calculates the confidence interval for any size K
def ci(size):
    se = np.sqrt(MAX_VARIANCE/size)
    return {'68% +/-': se, '95% +/-': 1.96*se, '99% +/-': 2.57*se}

print('50 polled: ', ci(Nqualtrics))
print('100 polled: ', ci(Nsurvey))

50 polled:  {'68% +/-': 0.07071067811865475, '95% +/-': 0.13859292911256332, '99% +/-': 0.1817264427649427}
100 polled:  {'68% +/-': 0.05, '95% +/-': 0.098, '99% +/-': 0.1285}


Clearly, it is very unlikely. So (c) is the right answer.

### Q2. The data provider suspects that the SurveyMonkey dataset is biased. What do you think?

In [49]:
cols = ['County', 'Age', 'Gender']
for col in cols:
    print(survey_monkey.groupby(col)[col].count()/100)
    print(qualtrics.groupby(col)[col].count()/50)
    print()

County
Bailey           0.03
Black            0.08
Mountain Farm    0.79
Riverside        0.10
Name: County, dtype: float64
County
Bailey           0.06
Black            0.02
Mountain Farm    0.80
Riverside        0.12
Name: County, dtype: float64

Age
26-35    0.04
36-45    0.16
46-55    0.47
56-65    0.15
65+      0.18
Name: Age, dtype: float64
Age
26-35    0.06
36-45    0.22
46-55    0.34
56-65    0.26
65+      0.12
Name: Age, dtype: float64

Gender
female    0.23
male      0.77
Name: Gender, dtype: float64
Gender
female    0.42
male      0.58
Name: Gender, dtype: float64



Clearly it looks like the dataset is gender-biased. Let's see if this could have happened by chance.

In [51]:
voter_roll.groupby('Gender')['Gender'].count()/4239
# There are 0.521585 % men in the whole population

observed_difference = 0.77-0.521585
#calculate the worst case standard error
se = np.sqrt(MAX_VARIANCE/100)
print('Number of se\'s from the expected value', observed_difference/se)

Number of se's from the expected value 4.968300000000001


### Q3. Which of the following best describes the margin of error of the Qualtrics poll

In [52]:
print('50 polled: ', ci(Nqualtrics))

50 polled:  {'68% +/-': 0.07071067811865475, '95% +/-': 0.13859292911256332, '99% +/-': 0.1817264427649427}


### Q4.  A news report suggests that Pearle Goodman is dropping out of the election. Is it clear which candidate benefits from her departure?

In [53]:
combined_dataset = pd.concat([survey_monkey, qualtrics])

candidates = ['Genevieve Gallegos', 'Masako Holley', 'Pearle Goodman']
for cand in candidates:
    filtered = combined_dataset[combined_dataset['Vote'] == cand] #get those rows that voted for each candidate
    
    print("--- Breakdown for", cand ,"---")
    
    cols = ['County', 'Age', 'Gender']
    for col in cols:
        
        print(filtered.groupby(col)[col].count()/len(filtered))
    print("++")
    print()

--- Breakdown for Genevieve Gallegos ---
County
Bailey           0.025
Black            0.075
Mountain Farm    0.775
Riverside        0.125
Name: County, dtype: float64
Age
26-35    0.0750
36-45    0.1625
46-55    0.4125
56-65    0.1875
65+      0.1625
Name: Age, dtype: float64
Gender
female    0.0875
male      0.9125
Name: Gender, dtype: float64
++

--- Breakdown for Masako Holley ---
County
Bailey           0.066667
Black            0.050000
Mountain Farm    0.800000
Riverside        0.083333
Name: County, dtype: float64
Age
26-35    0.016667
36-45    0.200000
46-55    0.450000
56-65    0.166667
65+      0.166667
Name: Age, dtype: float64
Gender
female    0.6
male      0.4
Name: Gender, dtype: float64
++

--- Breakdown for Pearle Goodman ---
County
Mountain Farm    0.9
Riverside        0.1
Name: County, dtype: float64
Age
36-45    0.2
46-55    0.4
56-65    0.3
65+      0.1
Name: Age, dtype: float64
Gender
female    0.1
male      0.9
Name: Gender, dtype: float64
++



As you can see above Pearle Goodman has the same male-female break down as Genevieve Gallagos. So it would be reasonable to assume her votes would got to her. However, we also accepted arguments that the sample size was too small to tell.