#### ------------------------------------**Importing Data and Making a DataFrame**--------------------------------------------------
The statsmodels package (installed in the code cell above) includes built-in datasets. Execute the code below to download data from the American National Election Studies of 1996 and print a detailed description of the schema.

The next cell extracts the Dataset object from the submodule and saves the DataFrame to the variable df. In the questions that follow, use df when referencing the dataset.

#### **Import statements go here**

import pandas as pd
import statsmodels.api as sm
import numpy as np

anes96 = sm.datasets.anes96
print(anes96.NOTE)

dataset_anes96 = anes96.load_pandas()
df = dataset_anes96.data

Q1
DataFrame Basic Properties Exercise
Our DataFrame (df) contains data on registered voters in the United States, including demographic information and political preference. Using pandas, print the first 5 rows of the DataFrame to get a sense of what the data looks like. Next, answer the following questions:

How many observations are in the DataFrame?
How many variables are measured (how many columns)?
What is the age of the youngest person in the data? The oldest?
How many days a week does the average respondent watch TV news (round to the nearest tenth)?
Check for missing values. Are there any?

In [8]:
import pandas as pd 
import statsmodels.api as sm
import numpy as np

anes96 = sm.datasets.anes96
print(anes96.NOTE)








::

    Number of observations - 944
    Number of variables - 10

    Variables name definitions::

            popul - Census place population in 1000s
            TVnews - Number of times per week that respondent watches TV news.
            PID - Party identification of respondent.
                0 - Strong Democrat
                1 - Weak Democrat
                2 - Independent-Democrat
                3 - Independent-Indpendent
                4 - Independent-Republican
                5 - Weak Republican
                6 - Strong Republican
            age : Age of respondent.
            educ - Education level of respondent
                1 - 1-8 grades
                2 - Some high school
                3 - High school graduate
                4 - Some college
                5 - College degree
                6 - Master's degree
                7 - PhD
            income - Income of household
                1  - None or less than $2,999
                2  - $3,000-$4,9

In [68]:
dataset_anes96 = anes96.load_pandas()
df = dataset_anes96.data
print(df.head(5))
print(f"Number of observation are {len(df)}")
print(f"Number of Columns are {len(df.columns)}")
print(f"Age of youngest person is {df['age'].min()}")
print(f"Age of Oldest person is {df['age'].max()}")
print(f"the average respondent watch TV news {round(df['TVnews'].mean(),2)}")
print(f"Missing Value are {df.isnull().sum().sum()}")




   popul  TVnews  selfLR  ClinLR  DoleLR  PID   age  educ  income  vote  \
0    0.0     7.0     7.0     1.0     6.0  6.0  36.0   3.0     1.0   1.0   
1  190.0     1.0     3.0     3.0     5.0  1.0  20.0   4.0     1.0   0.0   
2   31.0     7.0     2.0     2.0     6.0  1.0  24.0   6.0     1.0   0.0   
3   83.0     4.0     3.0     4.0     5.0  1.0  28.0   6.0     1.0   0.0   
4  640.0     7.0     5.0     6.0     4.0  0.0  68.0   6.0     1.0   0.0   

   logpopul  
0 -2.302585  
1  5.247550  
2  3.437208  
3  4.420045  
4  6.461624  
Number of observation are 944
Number of Columns are 11
Age of youngest person is 19.0
Age of Oldest person is 91.0
the average respondent watch TV news 3.73
Missing Value are 0


Q2Data Processing Exercise
We want to adjust the dataset for our use. Do the following:

Rename the educ column education.
Create a new column called party based on each respondent's answer to PID. party should equal Democrat if the respondent selected either Strong Democrat or Weak Democrat. party will equal Republican if the respondent selected Strong or Weak Republican for PID and Independent if they selected anything else.
Create a new column called age_group that buckets respondents into the following categories based on their age: 18-24, 25-34, 35-44, 45-54, 55-64, and 65 and over.

In [69]:
df = df.rename(columns={"educ":"education"})

def function(df):
    if df["PID"]<2:
        return "Democrate"
    elif df["PID"]>4:
        return "Republican"
    else:
        print("Independent")

       
df["party"] = df.apply(function,axis=1)
print(df)

def age_calcu(df):
    if(df["age"]<25):
        return "18-24"
    elif(df["age"]<35):
        return "25-34"
    elif(df["age"]<45):
        return "25-44"
    elif(df["age"]<55):
        return "45-54"
    elif(df["age"]<65):
        return "55-64"
    else:
        return "60 and over"
    
df["age_group"] = df.apply(age_calcu,axis=1)
print(df)

Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Independent
Inde

Q3. Filtering Data Exercise
Use the filtering method to find all the respondents who have the impression that Bill Clinton is moderate or conservative (ClinLR equals 4 or higher). How many respondents are in this subset?

Among these respondents, how many have a household income less than $50,000 and attended at least some college?

In [73]:
get_subset = df[df["ClinLR"]>=4]
print(len(get_subset))

result = df[(df['income']<20) & (df['education']>3) ]
print(len(result))



282
327


4. Calculating From Data Exercise
   For each of the below match-ups, choose the group that is more likely to vote for Bill Clinton. You can calculate this using the percentage of each group that intends to vote for Clinton (vote).

Another way to think about this: Given that a respondent is a Democrat, there is a \_\_\_\_ percent chance they will vote for Clinton. How does this value change if the respondent is a Republican?

Which match-up was the closest? Which had the biggest difference?

Democrats or Republicans
People younger than 44 or People 44 and older
People who watch TV news at least 6 days a week or People who watch TV news less than 3 days a week
People who live somewhere with a population greater than the average respondent or People who live in a place with a population equal to or less than the average respondent


In [101]:
democrats = df[df["party"]=="Democrate"]
vote_democrats = democrats[democrats["vote"]==0]

Republican = df[df["party"]=="Republican"]
vote_Republican = Republican[Republican["vote"]==0]
print(f" Democrats {len(vote_democrats)/len(democrats)*100:.2f} Republican {len(vote_Republican)/len(Republican)*100:.2f}")

youngerthan44 = df[(df["age"]<44) & (df["vote"]==0)]
olderthan44 = df[(df["age"]>=44) & (df["vote"]==0)]
print(f" Youngerthan44 => {len(youngerthan44)/len(df[df['age']<44])*100:.2f} Olderthan44 => {len(olderthan44)/len(df[df['age']>=44])*100:.2f}")
print(f"{len(youngerthan44)/len(df[df['age']<44])*100:.2f - len(olderthan44)/len(df[df['age']>=44])*100:.2f}")



 Democrats 96.32 Republican 10.46
 Youngerthan44 => 59.48 Olderthan44 => 57.29


ValueError: Invalid format specifier '.2f - len(olderthan44)/len(df[df['age']>=44])*100:.2f' for object of type 'float'

6. Voting Across the Aisle
   We are interested in learning more about respondents who's political views differ strongly from the candidate they expect to vote for. Using selfLR, vote, ClinLR, and DoleLR, work through the following questions. Your interpretation may differ from the answer key.

   What is the largest recorded difference between a respondent's political leaning and their impression of their intended candidate's political leaning?
   How many respondents exhibit a difference of that magnitude?
   Make a separate DataFrame called sway that only includes voters who exhibit a difference greater than |3|.
   Among those in sway, are respondents more likely to be voting for a candidate more conservative or more liberal than their own political leaning?
   In sway, which candidate is the more popular choice?


In [74]:
print(df)


     popul  TVnews  selfLR  ClinLR  DoleLR  PID   age  education  income  \
0      0.0     7.0     7.0     1.0     6.0  6.0  36.0        3.0     1.0   
1    190.0     1.0     3.0     3.0     5.0  1.0  20.0        4.0     1.0   
2     31.0     7.0     2.0     2.0     6.0  1.0  24.0        6.0     1.0   
3     83.0     4.0     3.0     4.0     5.0  1.0  28.0        6.0     1.0   
4    640.0     7.0     5.0     6.0     4.0  0.0  68.0        6.0     1.0   
..     ...     ...     ...     ...     ...  ...   ...        ...     ...   
939    0.0     7.0     7.0     1.0     6.0  4.0  73.0        6.0    24.0   
940    0.0     7.0     5.0     2.0     6.0  6.0  50.0        6.0    24.0   
941    0.0     3.0     6.0     2.0     7.0  5.0  43.0        6.0    24.0   
942    0.0     6.0     6.0     2.0     5.0  6.0  46.0        7.0    24.0   
943   18.0     7.0     4.0     2.0     6.0  3.0  61.0        7.0    24.0   

     vote  logpopul       party    age_group  
0     1.0 -2.302585  Republican        2