# Python and R

In [1]:
%load_ext rpy2.ipython
%load_ext autoreload
%autoreload 2

%matplotlib inline  
from matplotlib import rcParams
rcParams['figure.figsize'] = (16, 100)

import warnings
from rpy2.rinterface import RRuntimeWarning
warnings.filterwarnings("ignore") # Ignore all warnings
# warnings.filterwarnings("ignore", category=RRuntimeWarning) # Show some warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML

# show all columns on pandas dataframes
pd.set_option('display.max_columns', None)


In [3]:
%%javascript
// Disable auto-scrolling
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [2]:
%%R

# My commonly used R imports

require('tidyverse')


R[write to console]: Loading required package: tidyverse



── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()



# Read the data



The cell below loads the data in python:

In [4]:
df = pd.read_csv('raw-polls.csv')
df.sample(5)

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
1081,54510,117136,1710,2000,2000_Sen-G_UT,UT,Sen-G,Sen-G,Harris Insights & Analytics,133,Live Phone,,11/3/00,1638.0,Scott N. Howell,3878,DEM,32.0,Orrin G. Hatch,3879,REP,64.0,,-32.0,11/7/00,31.51,65.58,-34.07,2.07,1.0,
1883,63852,117555,787,2004,2004_Pres-G_NM,NM,Pres-G,Pres-G,Zogby Interactive/JZ Analytics,395,Live Phone,,10/16/04,520.0,John Kerry,157,DEM,53.6,George W. Bush,182,REP,44.1,1.0,9.5,11/2/04,49.05,49.84,-0.79,10.29,0.0,
4563,14861,19701,38,2008,2008_Pres-G_US,US,Pres-G,Pres-G,ABC News/The Washington Post,3,Live Phone,,10/28/08,1327.0,Barack Obama,41,DEM,52.0,John McCain,44,REP,44.0,,8.0,11/4/08,52.88,45.61,7.27,0.73,1.0,for The Washington Post
4828,16423,24952,1569,2008,2008_Sen-G_NC,NC,Sen-G,Sen-G,Public Policy Polling,263,IVR,,11/1/08,2100.0,Kay R. Hagan,2709,DEM,51.0,Elizabeth H. Dole,2710,REP,44.0,3.0,7.0,11/4/08,52.65,44.18,8.47,-1.47,1.0,
6282,30303,36332,665,2012,2012_Pres-G_IA,IA,Pres-G,Pres-G,Public Policy Polling,263,IVR,D,10/24/12,690.0,Barack Obama,16,DEM,49.0,Mitt Romney,9,REP,47.0,,2.0,11/6/12,51.99,46.18,5.81,-3.81,1.0,for unspecified Democratic sponsor


The cell below loads the same data in R:

In [5]:
%%R

df <- read_csv('raw-polls.csv')

df

Rows: 10776 Columns: 31
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): race, location, type_simple, type_detail, pollster, methodology, p...
dbl (17): poll_id, question_id, race_id, year, pollster_rating_id, samplesiz...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 10,776 × 31
   poll_id questio…¹ race_id  year race  locat…² type_…³ type_…⁴ polls…⁵ polls…⁶
     <dbl>     <dbl>   <dbl> <dbl> <chr> <chr>   <chr>   <chr>   <chr>     <dbl>
 1   26013     87909    1455  1998 1998… NY      Gov-G   Gov-G   Blum &…      32
 2   26255     87926    1456  1998 1998… OH      Gov-G   Gov-G   Univer…     346
 3   26026     31266    1736  1998 1998… NV      Sen-G   Sen-G   FM3 Re…      91
 4   26013     31253    1738  1998 1998… NY      Sen-G   Sen-G   Blum &…      32
 5   63632    117103    1738  1998 1998… NY      Sen-G 

# Guided Exploration

In this section you'll make a few charts to explore the data. Here I will raise some questions for you to dig around in the data and answer. You can use summary statistics and/or charts to help answer the questions. You will have to make some methodological choices along the way. Be aware of what choices you're making! I'll ask you about them shortly.


## Question 1: How accurate are polls from the following pollsters?
Characterize the accuracy of each of these pollsters in a sentence or two. Then, write another few sentences justifying your characterization with insights from the data.
- Siena College/The New York Times Upshot
- Jayhawk Consulting
- Fox News/Beacon Research/Shaw & Co. Research
- Brown University
- American Research Group


👉 **Siena College/The New York Times Upshot** 

In [8]:

df_nyt = df[df['pollster'] == "Siena College/The New York Times Upshot"]
df_nyt



Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
8112,47325,74070,52,2016,2016_Gov-G_NC,NC,Gov-G,Gov-G,Siena College/The New York Times Upshot,448,Live Phone,,10/22/16,792.0,Roy A. Cooper,8967,DEM,51.0,Pat McCrory,8959,REP,45.0,,6.0,11/8/16,49.02,48.80,0.22,5.78,1.0,for New York Times | New York Times Upshot
8116,47325,74050,62,2016,2016_Sen-G_NC,NC,Sen-G,Sen-G,Siena College/The New York Times Upshot,448,Live Phone,,10/22/16,792.0,Deborah K. Ross,10153,DEM,47.0,Richard Burr,8963,REP,46.0,,1.0,11/8/16,45.37,51.06,-5.70,6.70,0.0,for New York Times | New York Times Upshot
8137,47325,74047,629,2016,2016_Pres-G_NC,NC,Pres-G,Pres-G,Siena College/The New York Times Upshot,448,Live Phone,,10/22/16,792.0,Hillary Rodham Clinton,9207,DEM,47.5,Donald Trump,9849,REP,40.0,8.0,7.5,11/8/16,46.17,49.83,-3.66,11.16,0.0,for New York Times | New York Times Upshot; av...
8189,47551,74389,86,2016,2016_Sen-G_PA,PA,Sen-G,Sen-G,Siena College/The New York Times Upshot,448,Live Phone,,10/24/16,824.0,Kathleen Alana McGinty,8985,DEM,47.0,Patrick J. Toomey,8966,REP,44.0,,3.0,11/8/16,47.34,48.77,-1.43,4.43,0.0,for New York Times Upshot
8203,47551,74387,640,2016,2016_Pres-G_PA,PA,Pres-G,Pres-G,Siena College/The New York Times Upshot,448,Live Phone,,10/24/16,824.0,Hillary Rodham Clinton,9207,DEM,47.5,Donald Trump,9849,REP,40.5,6.0,7.0,11/8/16,47.46,48.18,-0.72,7.72,0.0,for New York Times Upshot; average of multiple...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10435,72478,136016,6214,2020,2020_Pres-G_AZ,AZ,Pres-G,Pres-G,Siena College/The New York Times Upshot,448,Live Phone,,10/28/20,1252.0,Joseph R. Biden Jr.,13256,DEM,49.0,Donald Trump,13254,REP,43.0,3.0,6.0,11/3/20,49.36,49.06,0.31,5.69,1.0,
10476,72481,136019,6259,2020,2020_Pres-G_WI,WI,Pres-G,Pres-G,Siena College/The New York Times Upshot,448,Live Phone,,10/28/20,1253.0,Joseph R. Biden Jr.,13256,DEM,52.0,Donald Trump,13254,REP,41.0,3.0,11.0,11/3/20,49.45,48.82,0.63,10.37,1.0,
10478,72478,136021,6268,2020,2020_Sen-GS_AZ,AZ,Sen-G,Sen-GS,Siena College/The New York Times Upshot,448,Live Phone,,10/28/20,1252.0,Mark Kelly,13445,DEM,50.0,Martha McSally,13446,REP,43.0,,7.0,11/3/20,51.16,48.81,2.35,4.65,1.0,
10508,72479,136017,6220,2020,2020_Pres-G_FL,FL,Pres-G,Pres-G,Siena College/The New York Times Upshot,448,Live Phone,,10/29/20,1451.0,Joseph R. Biden Jr.,13256,DEM,47.0,Donald Trump,13254,REP,44.0,2.0,3.0,11/3/20,47.86,51.22,-3.36,6.36,0.0,


In [9]:
df_nyt.bias.describe()

count    82.000000
mean      1.422927
std       5.219059
min     -15.010000
25%      -2.075000
50%       1.515000
75%       5.125000
max      11.200000
Name: bias, dtype: float64

👉 **Jayhawk Consulting**

In [23]:
df_jayhawk = df[df['pollster'] == "Jayhawk Consulting Services"]
df_jayhawk

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
7325,36081,49015,5495,2014,2014_House-G_KS-1,KS-1,House-G,House-G,Jayhawk Consulting Services,157,Live Phone,D,10/26/14,400.0,James E. Sherow,5342,DEM,45.0,Tim Huelskamp,5335,REP,38.0,,7.0,11/4/14,32.03,67.97,-35.94,42.94,0.0,for James E. Sherow
9146,56550,90986,330,2018,2018_House-G_KS-1,KS-1,House-G,House-G,Jayhawk Consulting Services,157,Live Phone,D,10/23/18,600.0,Alan LaPolice,11713,DEM,38.0,Roger Marshall,11714,REP,42.0,,-4.0,11/6/18,31.85,68.15,-36.29,32.29,1.0,for Alan LaPolice


👉 **Fox News/Beacon Research/Shaw & Co. Research**

In [14]:
df_fox = df[df['pollster'] == 'Fox News/Beacon Research/Shaw & Co. Research']
df_fox

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
6069,30144,36088,662,2012,2012_Pres-G_FL,FL,Pres-G,Pres-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/18/12,1130.0,Barack Obama,16,DEM,45.0,Mitt Romney,9,REP,48.0,,-3.0,11/6/12,50.01,49.13,0.88,-3.88,0.0,
6080,30142,36086,688,2012,2012_Pres-G_OH,OH,Pres-G,Pres-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/18/12,1131.0,Barack Obama,16,DEM,46.0,Mitt Romney,9,REP,43.0,,3.0,11/6/12,50.67,47.69,2.98,0.02,1.0,
6307,30311,36340,698,2012,2012_Pres-G_VA,VA,Pres-G,Pres-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/24/12,1126.0,Barack Obama,16,DEM,44.0,Mitt Romney,9,REP,46.0,,-2.0,11/6/12,51.16,47.28,3.87,-5.87,0.0,
6510,43261,36642,37,2012,2012_Pres-G_US,US,Pres-G,Pres-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/29/12,1128.0,Barack Obama,16,DEM,46.0,Mitt Romney,9,REP,46.0,,0.0,11/6/12,51.02,47.18,3.85,-3.85,0.5,for FOX News
7329,36137,49088,8271,2014,2014_House-G_US,US,House-G,House-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/26/14,734.0,Generic Candidate,9973,DEM,45.0,Generic Candidate,9974,REP,44.0,,1.0,11/4/14,44.84,50.42,-5.59,6.59,0.0,for FOX News
7394,34414,42123,14,2014,2014_Sen-G_NC,NC,Sen-G,Sen-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/29/14,909.0,Kay R. Hagan,6123,DEM,43.0,Thom Tillis,6129,REP,42.0,4.0,1.0,11/4/14,47.26,48.82,-1.56,2.56,0.0,for FOX News
7401,34369,42157,21,2014,2014_Sen-G_KS,KS,Sen-G,Sen-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/29/14,907.0,Pat Roberts,6154,REP,43.0,Gregory Orman,8355,IND,44.0,3.0,-1.0,11/4/14,53.15,42.53,10.62,,0.0,for FOX News
7405,34314,42096,26,2014,2014_Sen-G_IA,IA,Sen-G,Sen-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/29/14,911.0,Bruce L. Braley,6104,DEM,44.0,Joni K. Ernst,6106,REP,45.0,,-1.0,11/4/14,43.76,52.1,-8.34,7.34,1.0,for FOX News
7408,34314,42008,1226,2014,2014_Gov-G_IA,IA,Gov-G,Gov-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/29/14,911.0,Jack G. Hatch,8648,DEM,36.0,Terry E. Branstad,8647,REP,53.0,,-17.0,11/4/14,37.27,58.99,-21.72,4.72,1.0,for FOX News
7410,34369,42063,1228,2014,2014_Gov-G_KS,KS,Gov-G,Gov-G,Fox News/Beacon Research/Shaw & Co. Research,103,Live Phone,,10/29/14,907.0,Paul Davis,8712,DEM,48.0,Sam Brownback,8711,REP,42.0,4.0,6.0,11/4/14,46.13,49.82,-3.69,9.69,0.0,for FOX News


In [20]:
df_fox.bias.describe()

count    31.000000
mean      3.073226
std       5.096175
min      -5.870000
25%      -0.290000
50%       2.630000
75%       6.060000
max      15.610000
Name: bias, dtype: float64

👉 **Brown University**

In [18]:
df_brown = df[df['pollster'] == 'Brown University']
df_brown

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
384,7278,8927,7150,2000,2000_Pres-D_RI,RI,Pres-P,Pres-D,Brown University,35,Live Phone,,2/20/00,222.0,Al Gore,222,DEM,37.0,Bill Bradley,224,DEM,24.0,,13.0,3/7/00,56.92,40.35,16.57,,1.0,among registered voters
408,64034,117851,7152,2000,2000_Pres-D_VT,VT,Pres-P,Pres-D,Brown University,35,Live Phone,,2/26/00,321.0,Al Gore,222,DEM,56.0,Bill Bradley,224,DEM,35.0,,21.0,3/7/00,54.33,43.89,10.44,,1.0,
592,6416,7883,845,2000,2000_Pres-G_RI,RI,Pres-G,Pres-G,Brown University,35,Live Phone,,10/22/00,370.0,Al Gore,222,DEM,47.0,George W. Bush,241,REP,29.0,8.0,18.0,11/7/00,60.99,31.91,29.08,-11.08,1.0,
602,6416,27199,1707,2000,2000_Sen-G_RI,RI,Sen-G,Sen-G,Brown University,35,Live Phone,,10/22/00,370.0,Robert A. Weygand,3854,DEM,28.0,Lincoln Chafee,3855,REP,52.0,2.0,-24.0,11/7/00,41.15,56.88,-15.73,-8.27,1.0,
1251,25424,88455,1409,2002,2002_Gov-G_RI,RI,Gov-G,Gov-G,Brown University,35,Live Phone,,10/20/02,418.0,Myrth York,12931,DEM,41.0,Donald Carcieri,12932,REP,34.0,,7.0,11/5/02,45.24,54.76,-9.52,16.52,0.0,
1254,25424,30664,1675,2002,2002_Sen-G_RI,RI,Sen-G,Sen-G,Brown University,35,Live Phone,,10/20/02,418.0,Jack Reed,3565,DEM,61.0,Robert G. Tingle,3566,REP,14.0,,47.0,11/5/02,78.43,21.57,56.85,-9.85,1.0,
1257,25424,117562,3067,2002,2002_House-G_RI-1,RI-1,House-G,House-G,Brown University,35,Live Phone,,10/20/02,194.0,Patrick J. Kennedy,13834,DEM,44.0,David W. Rogers,13835,REP,27.0,,17.0,11/5/02,59.88,37.31,22.57,-5.57,1.0,
3746,2383,2964,7404,2008,2008_Pres-D_RI,RI,Pres-P,Pres-D,Brown University,35,Live Phone,,2/29/08,402.0,Hillary Rodham Clinton,45,DEM,42.0,Barack Obama,41,DEM,37.0,,5.0,3/4/08,58.44,40.4,18.04,,1.0,
6985,34002,41321,1239,2014,2014_Gov-G_RI,RI,Gov-G,Gov-G,Brown University,35,Live Phone,,10/16/14,1129.0,Gina M. Raimondo,8744,DEM,41.6,Allan W. Fung,8740,REP,30.5,9.1,11.1,11/4/14,40.7,36.24,4.47,6.63,1.0,
7319,34316,42010,1239,2014,2014_Gov-G_RI,RI,Gov-G,Gov-G,Brown University,35,Live Phone,,10/26/14,500.0,Gina M. Raimondo,8744,DEM,38.0,Allan W. Fung,8740,REP,37.4,11.8,0.6,11/4/14,40.7,36.24,4.47,-3.87,1.0,


In [21]:
df_brown.bias.describe()

count     7.000000
mean     -2.212857
std      10.138818
min     -11.080000
25%      -9.060000
50%      -5.570000
75%       1.380000
max      16.520000
Name: bias, dtype: float64

👉 **American Research Group**

In [19]:
df_grp = df[df['pollster'] == 'American Research Group']
df_grp

Unnamed: 0,poll_id,question_id,race_id,year,race,location,type_simple,type_detail,pollster,pollster_rating_id,methodology,partisan,polldate,samplesize,cand1_name,cand1_id,cand1_party,cand1_pct,cand2_name,cand2_id,cand2_party,cand2_pct,cand3_pct,margin_poll,electiondate,cand1_actual,cand2_actual,margin_actual,bias,rightcall,comment
315,7384,9127,7115,2000,2000_Pres-D_NH,NH,Pres-P,Pres-D,American Research Group,9,Live Phone,,1/27/00,600.0,Al Gore,222,DEM,50.0,Bill Bradley,224,DEM,43.0,,7.0,2/1/00,49.73,45.59,4.14,,1.0,
318,7384,9129,7116,2000,2000_Pres-R_NH,NH,Pres-P,Pres-R,American Research Group,9,Live Phone,,1/27/00,600.0,John McCain,14677,REP,36.0,George W. Bush,241,REP,34.0,17.0,2.0,2/1/00,48.53,30.36,18.17,,1.0,
342,7357,9084,7115,2000,2000_Pres-D_NH,NH,Pres-P,Pres-D,American Research Group,9,Live Phone,,1/30/00,600.0,Al Gore,222,DEM,50.0,Bill Bradley,224,DEM,45.0,,5.0,2/1/00,49.73,45.59,4.14,,1.0,
346,7357,9087,7116,2000,2000_Pres-R_NH,NH,Pres-P,Pres-R,American Research Group,9,Live Phone,,1/30/00,600.0,John McCain,14677,REP,36.0,George W. Bush,241,REP,38.0,16.0,-2.0,2/1/00,48.53,30.36,18.17,,0.0,
357,7345,9062,7120,2000,2000_Pres-R_SC,SC,Pres-P,Pres-R,American Research Group,9,Live Phone,,2/3/00,600.0,George W. Bush,241,REP,42.0,John McCain,14677,REP,45.0,3.0,-3.0,2/19/00,53.39,41.87,11.52,,0.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9580,63413,116434,7694,2020,2020_Pres-D_IA,IA,Pres-P,Pres-D,American Research Group,9,Live Phone,,1/29/20,400.0,Bernard Sanders,13257,DEM,23.0,Pete Buttigieg,13345,DEM,9.0,15.0,14.0,2/3/20,24.71,21.31,3.41,,1.0,
9598,63493,116743,7729,2020,2020_Pres-D_NH,NH,Pres-P,Pres-D,American Research Group,9,Live Phone,,2/9/20,400.0,Bernard Sanders,13257,DEM,28.0,Pete Buttigieg,13345,DEM,20.0,13.0,8.0,2/11/20,25.60,24.28,1.32,,1.0,
10367,72226,135493,6241,2020,2020_Pres-G_NH,NH,Pres-G,Pres-G,American Research Group,9,Live Phone,,10/27/20,600.0,Joseph R. Biden Jr.,13256,DEM,58.0,Donald Trump,13254,REP,39.0,1.0,19.0,11/3/20,52.71,45.36,7.35,11.65,1.0,
10414,72226,135495,6286,2020,2020_Sen-G_NH,NH,Sen-G,Sen-G,American Research Group,9,Live Phone,,10/27/20,600.0,Jeanne Shaheen,13448,DEM,57.0,Corky Messner,14492,REP,40.0,0.0,17.0,11/3/20,56.64,40.99,15.65,1.35,1.0,


In [22]:
df_grp.bias.describe()

count    86.000000
mean      0.113023
std       5.737122
min     -10.100000
25%      -3.502500
50%      -0.560000
75%       2.825000
max      26.760000
Name: bias, dtype: float64

### Question 2: Which pollsters are the most accurate? Which are the least accurate?

👉 Which pollsters are the most accurate?

👉 Which are the least accurate?

### Question 2 Reflections

👉 Write a summary paragraph explaining how you decided what constitutes “most accurate” and "least accurate"?


👉 In bullet point form, name **methodological choices** you made in the process of determining which pollsters were the most and least accurate.


👉 In bullet point form, list the **limitations** of your approach 
