## Q 10 The statsmodels package (installed in the code cell above) includes built-in datasets. Execute the code below to download data from the American National Election Studies of 1996 and print a detailed description of the schema.


In [12]:
import pandas as pd 
import statsmodels.api as sm
import numpy as np

anes96 = sm.datasets.anes96
print(anes96.NOTE)

::

    Number of observations - 944
    Number of variables - 10

    Variables name definitions::

            popul - Census place population in 1000s
            TVnews - Number of times per week that respondent watches TV news.
            PID - Party identification of respondent.
                0 - Strong Democrat
                1 - Weak Democrat
                2 - Independent-Democrat
                3 - Independent-Indpendent
                4 - Independent-Republican
                5 - Weak Republican
                6 - Strong Republican
            age : Age of respondent.
            educ - Education level of respondent
                1 - 1-8 grades
                2 - Some high school
                3 - High school graduate
                4 - Some college
                5 - College degree
                6 - Master's degree
                7 - PhD
            income - Income of household
                1  - None or less than $2,999
                2  - $3,000-$4,9

### a) The DataFrame (df) contains data on registered voters in the United States, including demographic information and political preference. Using pandas, print the first 5 rows of the DataFrame to get a sense of what the data looks like.

In [50]:
df = anes96.load_pandas().data
df.sample(5)

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,educ,income,vote,logpopul
66,30.0,5.0,7.0,7.0,2.0,0.0,37.0,4.0,4.0,0.0,3.404525
563,0.0,1.0,4.0,4.0,6.0,1.0,41.0,4.0,19.0,0.0,-2.302585
121,170.0,2.0,4.0,2.0,6.0,6.0,21.0,3.0,8.0,1.0,5.136386
419,0.0,7.0,4.0,3.0,4.0,3.0,39.0,4.0,17.0,0.0,-2.302585
507,0.0,2.0,6.0,2.0,6.0,6.0,38.0,3.0,18.0,1.0,-2.302585


### b) Answer the following questions.
#### i. How many observations are in the DataFrame?
#### ii. How many variables are measured (how many columns)?
#### iii. What is the age of the youngest person in the data? The oldest?
#### iv. How many days a week does the average respondent watch TV news (round to the nearest tenth)?
#### v. Check for missing values. Are there any?

In [76]:
print(df.shape[0]) 
print(df.shape[1])
print(df['age'].min())
print(df['age'].max())
print(df['TVnews'].mean().round())
df.isnull().sum()

944
11
19.0
91.0
4.0


popul       0
TVnews      0
selfLR      0
ClinLR      0
DoleLR      0
PID         0
age         0
educ        0
income      0
vote        0
logpopul    0
dtype: int64

### c) We want to adjust the dataset for our use. Do the following:
#### i. Rename the educ column as education.
#### ii. Create a new column called party based on each respondent's answer to PID.
####  party should equal Democrat if the respondent selected either Strong Democrat or Weak Democrat.
####  party will equal Republican if the respondent selected Strong or Weak Republican for PID and
####  party will equal Independent if they selected anything else.
#### iii. Create a new column called age_group that buckets respondents into the following categories based on their age: 18-24, 25-34, 35-44, 45-54, 55-64, and 65 and over.

In [113]:
df.rename(columns={'educ':'education'}).head(2)

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,education,income,vote,logpopul,party
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,education,1.0,1.0,-2.302585,Republican
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,education,1.0,0.0,5.24755,Democrat


In [101]:
def categorise_parties(pid):
    if pid in [1,2]:
        return 'Democrat'
    elif pid in [6,7]:
        return 'Republican'
    else:
        return 'Independent'

In [109]:
df['party'] = df['PID'].apply(categorise_parties)
df.sample(2)

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,educ,income,vote,logpopul,party
321,3.0,6.0,5.0,2.0,6.0,5.0,67.0,education,15.0,1.0,1.131402,Independent
631,40.0,0.0,3.0,3.0,6.0,1.0,48.0,education,20.0,0.0,3.691376,Democrat


In [None]:
def age_group(age):
    if age in range(18,25):
        return '18-24'
    elif age in range(25,35):
        return '25-34'
    elif age in range(35,45):
        return '35-44'
    elif age in range(44,55):
        return '45-54'
    elif age in range(55,65):
        return ''