# Python pandas Practice Problems



In [2]:
# Import statements go here

import pandas as pd
import statsmodels.api as sm

## Importing Data and Making a DataFrame
The statsmodels package (installed in the code cell above) includes built-in datasets. Execute the code below to download data from the [American National Election Studies of 1996](https://www.statsmodels.org/dev/datasets/generated/anes96.html) and print a detailed description of the schema.

The next cell extracts the `Dataset` object from the submodule and saves the `DataFrame` to the variable `df`. In the questions that follow, use `df` when referencing the dataset.

In [3]:
anes96 = sm.datasets.anes96
print(anes96.NOTE)

::

    Number of observations - 944
    Number of variables - 10

    Variables name definitions::

            popul - Census place population in 1000s
            TVnews - Number of times per week that respondent watches TV news.
            PID - Party identification of respondent.
                0 - Strong Democrat
                1 - Weak Democrat
                2 - Independent-Democrat
                3 - Independent-Indpendent
                4 - Independent-Republican
                5 - Weak Republican
                6 - Strong Republican
            age : Age of respondent.
            educ - Education level of respondent
                1 - 1-8 grades
                2 - Some high school
                3 - High school graduate
                4 - Some college
                5 - College degree
                6 - Master's degree
                7 - PhD
            income - Income of household
                1  - None or less than $2,999
                2  - $3,000-$4,9

In [4]:
dataset_anes96 = anes96.load_pandas()
df = dataset_anes96.data

## 1. DataFrame Basic Properties Exercise

Our DataFrame (`df`) contains data on registered voters in the United States, including demographic information and political preference. Using `pandas`, print the first 5 rows of the DataFrame to get a sense of what the data looks like. Next, answer the following questions:


*   How many observations are in the DataFrame?
*   How many variables are measured (how many columns)?
*   What is the age of the youngest person in the data? The oldest?
*   How many days a week does the average respondent watch TV news (round to the nearest tenth)?
*   Check for missing values. Are there any?






In [5]:
# Your code here
df.head()

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,educ,income,vote,logpopul
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,3.0,1.0,1.0,-2.302585
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,4.0,1.0,0.0,5.24755
2,31.0,7.0,2.0,2.0,6.0,1.0,24.0,6.0,1.0,0.0,3.437208
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624


## 2. Data Processing Exercise

We want to adjust the dataset for our use. Do the following:


*   Rename the `educ` column `education`.
*   Create a new column called `party` based on each respondent's answer to `PID`. `party` should equal `Democrat` if the respondent selected either Strong Democrat or Weak Democrat. `party` will equal `Republican` if the respondent selected Strong or Weak Republican for `PID` and `Independent` if they selected anything else.
*   Create a new column called `age_group` that buckets respondents into the following categories based on their `age`: 18-24, 25-34, 35-44, 45-54, 55-64, and 65 and over. 



In [25]:
# Your code here
df.rename(columns={'educ':'education'},inplace=True)

    

#df['party']=df[(df['PID']==0) | (df['PID']==1)]#Democrat
# df['party']=df[(df['pid']==5) | (df['pid']==6)]#repbulican
# df['party']=df[(df['pid']==2) | (df['pid']==3)| (df['pid']==4)| (df['pid']==5)]#independent
#df[k]='Democrat'

def party(row):
    p=row['PID']
    a=row['age']
    if p in (0,1) and a > 30:
        return 'Democrat'
    elif p in (5,6):
        return 'Republican'
    else:
        return 'Independent'

df['party'] = df.apply(party,axis=1)

df

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,education,income,vote,logpopul,party
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,3.0,1.0,1.0,-2.302585,Republican
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,4.0,1.0,0.0,5.247550,Independent
2,31.0,7.0,2.0,2.0,6.0,1.0,24.0,6.0,1.0,0.0,3.437208,Independent
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045,Independent
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624,Democrat
...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,7.0,7.0,1.0,6.0,4.0,73.0,6.0,24.0,1.0,-2.302585,Independent
940,0.0,7.0,5.0,2.0,6.0,6.0,50.0,6.0,24.0,1.0,-2.302585,Republican
941,0.0,3.0,6.0,2.0,7.0,5.0,43.0,6.0,24.0,1.0,-2.302585,Republican
942,0.0,6.0,6.0,2.0,5.0,6.0,46.0,7.0,24.0,1.0,-2.302585,Republican


In [29]:

age_bins = [18, 25, 35, 45, 55, 65, 120]
age_labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65+']


df['age_group'] = pd.cut(df['age'], bins=age_bins, labels=age_labels, right=False)
df

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,education,income,vote,logpopul,party,age_group
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,3.0,1.0,1.0,-2.302585,Republican,35-44
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,4.0,1.0,0.0,5.247550,Independent,18-24
2,31.0,7.0,2.0,2.0,6.0,1.0,24.0,6.0,1.0,0.0,3.437208,Independent,18-24
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045,Independent,25-34
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624,Democrat,65+
...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,7.0,7.0,1.0,6.0,4.0,73.0,6.0,24.0,1.0,-2.302585,Independent,65+
940,0.0,7.0,5.0,2.0,6.0,6.0,50.0,6.0,24.0,1.0,-2.302585,Republican,45-54
941,0.0,3.0,6.0,2.0,7.0,5.0,43.0,6.0,24.0,1.0,-2.302585,Republican,35-44
942,0.0,6.0,6.0,2.0,5.0,6.0,46.0,7.0,24.0,1.0,-2.302585,Republican,45-54


In [28]:
df.dtypes


popul         float64
TVnews        float64
selfLR        float64
ClinLR        float64
DoleLR        float64
PID           float64
age           float64
education     float64
income        float64
vote          float64
logpopul      float64
party          object
age_group    category
dtype: object

In [None]:
PID - Party identification of respondent.
                0 - Strong Democrat
                1 - Weak Democrat
                2 - Independent-Democrat
                3 - Independent-Indpendent
                4 - Independent-Republican
                5 - Weak Republican
                6 - Strong Republican

## 3. Filtering Data Exercise

Use the filtering method to find all the respondents who have the impression that Bill Clinton is moderate or conservative (`ClinLR` equals 4 or higher). How many respondents are in this subset? 

Among these respondents, how many have a household income less than $50,000 and attended at least some college?

In [32]:
# Your code here
k=df['ClinLR']>=4
l=df[k]
df[k].shape[0]
l[(l['income']<50000) & (l['education']>=4) ]





Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,education,income,vote,logpopul,party
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045,Democrat
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624,Democrat
5,110.0,3.0,3.0,4.0,6.0,1.0,21.0,4.0,1.0,0.0,4.701389,Democrat
6,100.0,7.0,5.0,6.0,4.0,1.0,77.0,4.0,1.0,0.0,4.606170,Democrat
7,31.0,1.0,5.0,4.0,5.0,4.0,21.0,4.0,1.0,0.0,3.437208,Independent
...,...,...,...,...,...,...,...,...,...,...,...,...
908,18.0,7.0,5.0,4.0,6.0,4.0,52.0,7.0,24.0,1.0,2.895912,Independent
911,24.0,7.0,3.0,4.0,5.0,1.0,38.0,7.0,24.0,0.0,3.182212,Democrat
912,18.0,0.0,2.0,4.0,6.0,1.0,51.0,7.0,24.0,0.0,2.895912,Democrat
921,0.0,7.0,2.0,4.0,5.0,2.0,53.0,6.0,24.0,0.0,-2.302585,Independent


## 4. Calculating From Data Exercise

For each of the below match-ups, choose the group that is more likely to vote for Bill Clinton. You can calculate this using the percentage of each group that intends to vote for Clinton (`vote`). Which match-up was the closest? Which had the biggest difference?

Another way to think about this: Given that a respondent is a Democrat, there is a ____ percent chance they will vote for Clinton. How does this value change if the respondent is a Republican?

*   Democrats or Republicans
*   People younger than 44 or People 44 and older
*   People who watch TV news at least 6 days a week or People who watch TV news less than 3 days a week
*   People who live somewhere with a population greater than the average respondent or People who live in a place with a population equal to or less than the average respondent


In [188]:
# Democrats or Republicans
demo_Clinton=df[(df['party']=='Democrat')&(df['vote']==0)].count()
demo_all=df[(df['party']=='Democrat')].count()
rep_Clinton=df[(df['party']=='Republican')&(df['vote']==0)].count()
rep_all=df[(df['party']=='Republican')].count()
print('The percentage of vote for Clinton if party is Democrats ',round((demo_Clinton['party']/demo_all['party'])*100,4))
print('The percentage of vote for Clinton if party is Republican ',round((rep_Clinton['party']/rep_all['party'])*100,4))
#clinton_percentage_democrats = (df[df['party'] == 'Democrat']['vote'] == 0).mean() * 100
#print(clinton_percentage_democrats)
# df[df['party']=='Democrat']
# df[df['age']>44]
# df[df['TVnews']<=6]
# print((df['popul']>df['selfLR'].mean()))



The percentage of vote for Clinton if party is Democrats  96.3158
The percentage of vote for Clinton if party is Republican  10.4615


In [115]:
#People younger than 44 or People 44 and older
younger_Clinton=df[(df['age']<44)&(df['vote']==0)].count()
younger_all=df[(df['age']<44)].count()
older_Clinton=df[(df['age']>44)&(df['vote']==0)].count()
older_all=df[(df['age']>44)].count()
print('The percentage of vote for Clinton from youngpeople ',round((younger_Clinton['party']/younger_all['party'])*100,4))
print('The percentage of vote for Clinton from oldpeople ',round((older_Clinton['party']/older_all['party'])*100,4))




The percentage of vote for Clinton from youngpeople  59.4828
The percentage of vote for Clinton from oldpeople  56.9264


In [116]:
#People who watch TV news at least 6 days a week or People who watch TV news less than 3 days a week
tv6_Clinton=df[(df['TVnews']>6)&(df['vote']==0)].count()
tv6_all=df[(df['TVnews']>6)].count()
tv3_Clinton=df[(df['TVnews']<3)&(df['vote']==0)].count()
tv3_all=df[(df['TVnews']<3)].count()
print('The percentage of vote for Clinton who watch TV news at least 6 days a week ',round((tv6_Clinton['party']/tv6_all['party'])*100,4))
print('The percentage of vote for Clinton People who watch TV news less than 3 days a week ',round((tv3_Clinton['party']/tv3_all['party'])*100,4))



The percentage of vote for Clinton who watch TV news at least 6 days a week  59.7222
The percentage of vote for Clinton People who watch TV news less than 3 days a week  55.496


In [99]:
#People who live somewhere with a population greater than the average respondent or People who live in
#a place with a population equal to or less than the average respondent
mean1=df['popul'].mean()
gp_Clinton=df[(df['popul']>mean1)&(df['vote']==0)].count()
gp_all=df[(df['popul']>mean1)].count()
lp_Clinton=df[(df['popul']<mean1)&(df['vote']==0)].count()
lp_all=df[(df['popul']<mean1)].count()
print('The percentage of vote for Clinton who lives in above population average ',round((gp_Clinton['party']/gp_all['party'])*100,4))
print('The percentage of vote for Clinton People who lives in below population average ',round((lp_Clinton['party']/lp_all['party'])*100,4))








The percentage of vote for Clinton who lives in above population average  72.3404
The percentage of vote for Clinton People who lives in below population average  55.9153


## 5. Grouping Data Exercise

Use the `groupby()` method to bucket respondents by `age_group`. Which age group is the most conservative? Which watches TV news the least?

Next, calculate 5 percentile groups based on income. Group the dataset by these percentiles. Which income bracket is the most liberal? Which is the most conservative? The oldest? Highest educated? 

In [100]:
# Your code here


k=df.groupby('age_group')['TVnews'].count()
k

age_group
18-24     53
25-34    184
35-44    245
45-54    168
55-64    124
65+      170
Name: TVnews, dtype: int64

In [101]:
# k=df[(df.selfLR>6.0)|(df.ClinLR>6.0)|(df.DoleLR>6.0)]

# df['age_group'].mode()

0    35-44
Name: age_group, dtype: category
Categories (6, object): ['18-24' < '25-34' < '35-44' < '45-54' < '55-64' < '65+']

In [102]:
dd = df.loc[:,['selfLR','ClinLR','DoleLR','age_group']]
dd = dd[(dd['selfLR']>6)|(dd['ClinLR']>6) | (dd['DoleLR']>6)]
dd.shape[0]
print(dd['age_group'].mode())

0    25-34
1    35-44
Name: age_group, dtype: category
Categories (6, object): ['18-24' < '25-34' < '35-44' < '45-54' < '55-64' < '65+']


In [106]:
age_group_count = df.groupby('age_group')['age_group'].count()
age_group_count

age_group
18-24     53
25-34    184
35-44    245
45-54    168
55-64    124
65+      170
Name: age_group, dtype: int64

In [107]:
# # conservative age group
# most_conservative = age_group_count.idxmax()
# most_conservative

'35-44'

In [117]:
#Which watches TV news the least

tvage_group = df.groupby('age_group')['TVnews'].count()

# Finding the age group that watches TV news the least
leasttvage_group = tvage_group.idxmin()
print(tvage_group)
print(leasttvage_group)


age_group
18-24     53
25-34    184
35-44    245
45-54    168
55-64    124
65+      170
Name: TVnews, dtype: int64
18-24


In [145]:
k=df['income'].describe([0.20,0.40,0.60,0.80,0.100])
k

count    944.000000
mean      16.331568
std        5.974781
min        1.000000
10%        6.000000
20%       12.000000
40%       16.000000
50%       17.000000
60%       19.000000
80%       21.000000
max       24.000000
Name: income, dtype: float64

In [155]:
#i_percentiles = df['income'].quantile([df['income'].min()-1, 6, 12,16,17,19,21, df['income'].max()+1])
bin_labels = ["10th percentile","20th percentile", "40th percentile", "60th percentile", "80th percentile","100th percentile"]


# Grouping the dataset based on income percentiles right=False
#df['i_groups']=pd.cut(df['income'], bins=i_percentiles, labels=bin_labels[:-1],right=False)
#df['i_groups'] = pd.cut(df['income'], bins=i_percentiles, labels=bin_labels[:-1])
df['i_groups'] = pd.cut(df['income'], bins=[df['income'].min()-1,6,12,16,17,21, float('inf'), labels=bin_labels)
df

Unnamed: 0,popul,TVnews,selfLR,ClinLR,DoleLR,PID,age,education,income,vote,logpopul,party,age_group,i_groups,income_group
0,0.0,7.0,7.0,1.0,6.0,6.0,36.0,3.0,1.0,1.0,-2.302585,Republican,35-44,10th percentile,
1,190.0,1.0,3.0,3.0,5.0,1.0,20.0,4.0,1.0,0.0,5.247550,Democrat,18-24,10th percentile,
2,31.0,7.0,2.0,2.0,6.0,1.0,24.0,6.0,1.0,0.0,3.437208,Democrat,18-24,10th percentile,
3,83.0,4.0,3.0,4.0,5.0,1.0,28.0,6.0,1.0,0.0,4.420045,Democrat,25-34,10th percentile,
4,640.0,7.0,5.0,6.0,4.0,0.0,68.0,6.0,1.0,0.0,6.461624,Democrat,65+,10th percentile,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
939,0.0,7.0,7.0,1.0,6.0,4.0,73.0,6.0,24.0,1.0,-2.302585,Independent,65+,100th percentile,
940,0.0,7.0,5.0,2.0,6.0,6.0,50.0,6.0,24.0,1.0,-2.302585,Republican,45-54,100th percentile,
941,0.0,3.0,6.0,2.0,7.0,5.0,43.0,6.0,24.0,1.0,-2.302585,Republican,35-44,100th percentile,
942,0.0,6.0,6.0,2.0,5.0,6.0,46.0,7.0,24.0,1.0,-2.302585,Republican,45-54,100th percentile,


In [157]:
#Group the dataset by these percentiles

df.groupby('i_groups')['i_groups'].count()

i_groups
10th percentile      98
20th percentile     111
40th percentile     203
60th percentile      62
80th percentile     302
100th percentile    168
Name: i_groups, dtype: int64

In [171]:
#Liberal==2

dd = df.loc[:,['selfLR','ClinLR','DoleLR','i_groups']]
dd = dd[(dd['selfLR']==2)|(dd['ClinLR']==2) | (dd['DoleLR']==2)]
dd.shape
#print(dd['i_groups'])
#print(dd.groupby('i_groups')['i_groups'].count())
print(dd['i_groups'].mode())

0    80th percentile
Name: i_groups, dtype: category
Categories (6, object): ['10th percentile' < '20th percentile' < '40th percentile' < '60th percentile' < '80th percentile' < '100th percentile']


In [174]:
 #Which is the most conservative
    
dd = df.loc[:,['selfLR','ClinLR','DoleLR','i_groups']]
dd = dd[(dd['selfLR']>6)|(dd['ClinLR']>6) | (dd['DoleLR']>6)]
dd.shape[0]
#print(dd.groupby('i_groups')['i_groups'].count())
print(dd['i_groups'].mode())

0    80th percentile
Name: i_groups, dtype: category
Categories (6, object): ['10th percentile' < '20th percentile' < '40th percentile' < '60th percentile' < '80th percentile' < '100th percentile']


In [187]:
#The oldest

dd = df.loc[:,['age','i_groups']]
dd = dd[(dd['age']==dd['age'].max())]
dd.shape[0]
#print(dd.groupby('i_groups')['i_groups'].count())
print(dd['i_groups'].mode())

#print(df[df['age']==91])




0    10th percentile
1    20th percentile
Name: i_groups, dtype: category
Categories (6, object): ['10th percentile' < '20th percentile' < '40th percentile' < '60th percentile' < '80th percentile' < '100th percentile']


In [179]:
#Highest educated?
dd = df.loc[:,['education','i_groups']]
dd = dd[(dd['education']==7)]
dd.shape[0]
#print(dd.groupby('i_groups')['i_groups'].count())
print(dd['i_groups'].mode())








0     80th percentile
1    100th percentile
Name: i_groups, dtype: category
Categories (6, object): ['10th percentile' < '20th percentile' < '40th percentile' < '60th percentile' < '80th percentile' < '100th percentile']


## 6. Voting Across the Aisle

We are interested in learning more about respondents who's political views differ strongly from the candidate they expect to vote for. Using `selfLR`, `vote`, `ClinLR`, and `DoleLR`, work through the following questions. Your interpretation may differ from the answer key.

*   What is the largest recorded difference between a respondent's political leaning and their impression of their intended candidate's political leaning?
*   How many respondents exhibit a difference of that magnitude? 
*   Make a separate DataFrame called `sway` that only includes voters who exhibit a difference greater than |3|.
*   Among those in `sway`, are respondents more likely to be voting for a candidate more conservative or more liberal than their own political leaning?
*   In `sway`, which candidate is the more popular choice?



In [None]:
# Your code here















# BSD 3-Clause License

*Copyright (c) 2022, UC Berkeley School of Information*

*All rights reserved.*

*Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:*

*1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.*

*2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.*

*3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.*

*THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.*