# Data Manipulation with `Python` Exercises

Welcome to one of your first exercise notebooks. 
So what should you expect from these notebooks? 
Well, we will be touching on the concepts and code that we ran through in the subsequent labs and practices, 
except the majority of the coding will be done by you now. 
The questions that we ask of you will be very familiar, although the output might throw a few more errors. 
**Some of these issues we have not seen yet and this is meant to challenge you.** 
Learning to resolve new issues and development of your problem solving vocabulary for internet research is critical to developing you as a data scientist.

In these notebooks, we will ask you to write and execute your own code for questions that will look similar to what we have learned in the Labs and Practices. However, Exercises will often be a bit more challenging in that 1) you may be working with a new data set with which you will have to familiarize yourself, and 2) you will be asked to write code to problems you have yet to see.


## Read in the Data

We will be using a different data set for this exercise. These data are filled with all of the U.S. Congress members from January 1947 to February 2014 along with some information about them.

Go ahead and read in the `congress-terms.csv` in the `all_datasets/` directory. Pay particular attention to the encoding. Run the following line...

In [2]:
import pandas as pd

with open('/dsa/data/all_datasets/congress-terms.csv', 'r', encoding = 'ISO-8859-1' ) as file:
    data = file.read()

    data_lists = data.split("\n")

    list_of_lists = []
    for line in data_lists:
        row = line.split(',')
        list_of_lists.append(row)

    # return the first 11 lists (rows) to get an idea of what the data looks like     
    for row in list_of_lists[0:11]:
        print(' ,'.join(row))

congress ,chamber ,bioguide ,firstname ,middlename ,lastname ,suffix ,birthday ,state ,party ,incumbent ,termstart ,age
80 ,house ,M000112 ,Joseph ,Jefferson ,Mansfield , ,1861-02-09 ,TX ,D ,Yes ,1/3/47 ,85.9
80 ,house ,D000448 ,Robert ,Lee ,Doughton , ,1863-11-07 ,NC ,D ,Yes ,1/3/47 ,83.2
80 ,house ,S000001 ,Adolph ,Joachim ,Sabath , ,1866-04-04 ,IL ,D ,Yes ,1/3/47 ,80.7
80 ,house ,E000023 ,Charles ,Aubrey ,Eaton , ,1868-03-29 ,NJ ,R ,Yes ,1/3/47 ,78.8
80 ,house ,L000296 ,William , ,Lewis , ,1868-09-22 ,KY ,R ,No ,1/3/47 ,78.3
80 ,house ,G000017 ,James ,A. ,Gallagher , ,1869-01-16 ,PA ,R ,No ,1/3/47 ,78
80 ,house ,W000265 ,Richard ,Joseph ,Welch , ,1869-02-13 ,CA ,R ,Yes ,1/3/47 ,77.9
80 ,house ,B000565 ,Sol , ,Bloom , ,1870-03-09 ,NY ,D ,Yes ,1/3/47 ,76.8
80 ,house ,H000943 ,Merlin , ,Hull , ,1870-12-18 ,WI ,R ,Yes ,1/3/47 ,76
80 ,house ,G000169 ,Charles ,Laceille ,Gifford , ,1871-03-15 ,MA ,R ,Yes ,1/3/47 ,75.8


In [1]:
import pandas as pd

with open('/dsa/data/all_datasets/congress-terms.csv', 'r') as file:
    data = file.read()

    data_lists = data.split("\n")

    list_of_lists = []
    for line in data_lists:
        row = line.split(',')
        list_of_lists.append(row)

    # return the first 11 lists (rows) to get an idea of what the data looks like     
    for row in list_of_lists[0:11]:
        print(' ,'.join(row))

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 22701: invalid continuation byte

**Question 1**: You will notice something a little bit different about reading in this file, particularly the `encoding` parameter. Do a bit of research on what encoding is. What happens when you remove this parameter all together? Do your best to describe any errors being thrown.

**Question 2**: In the `list_of_lists` variable, the last item of each list is the `age` of the member of congress. This is currently a string. Without using any packages, create a subset that contains all of the values for `age` stored as floats.

In [8]:
# Execute your code for question 2 here
# -------------------------------------
age_list = []
for row in list_of_lists[1:]:
    age_list.append(float(row[12]))

print(age_list)

[85.9, 83.2, 80.7, 78.8, 78.3, 78.0, 77.9, 76.8, 76.0, 75.8, 74.7, 74.0, 73.5, 73.0, 72.6, 72.5, 72.4, 72.0, 72.0, 71.8, 71.7, 71.6, 71.3, 71.3, 70.8, 70.8, 70.3, 70.1, 69.6, 69.5, 69.4, 68.7, 68.7, 68.4, 68.4, 68.2, 67.7, 67.6, 67.3, 66.8, 66.7, 66.7, 66.6, 66.5, 66.4, 66.4, 66.2, 66.2, 66.2, 66.2, 66.0, 66.0, 65.9, 65.9, 65.8, 65.2, 65.0, 64.8, 64.8, 64.7, 64.7, 64.5, 64.3, 63.9, 63.7, 63.7, 63.5, 63.5, 63.1, 63.0, 63.0, 62.8, 62.8, 62.7, 62.5, 62.4, 62.3, 62.2, 62.0, 62.0, 62.0, 62.0, 61.9, 61.9, 61.9, 61.8, 61.7, 61.7, 61.6, 61.4, 61.3, 61.0, 61.0, 60.9, 60.7, 60.6, 60.4, 60.4, 60.3, 60.2, 60.1, 60.1, 59.9, 59.6, 59.5, 59.5, 59.4, 59.3, 59.3, 59.3, 59.2, 59.2, 59.0, 58.8, 58.7, 58.7, 58.7, 58.5, 58.4, 58.4, 58.3, 58.2, 58.2, 58.1, 57.6, 57.5, 57.5, 57.4, 57.3, 57.3, 57.3, 57.2, 57.2, 57.1, 57.1, 56.9, 56.6, 56.6, 56.6, 56.5, 56.5, 56.5, 56.5, 56.4, 56.3, 56.2, 56.0, 56.0, 55.9, 55.8, 55.8, 55.7, 55.6, 55.6, 55.5, 55.4, 55.3, 55.3, 55.1, 55.0, 55.0, 54.9, 54.9, 54.9, 54.7, 54.6, 54.

**Question 3**: Now go ahead and read in the file with `pandas` save the data frame to a variable called `df`.

In [11]:
# Execute your code for question 3 here
# -------------------------------------
with open('/dsa/data/all_datasets/congress-terms.csv', 'r', encoding = 'ISO-8859-1') as file:
    df = pd.read_csv(file)

df.head()

Unnamed: 0,congress,chamber,bioguide,firstname,middlename,lastname,suffix,birthday,state,party,incumbent,termstart,age
0,80,house,M000112,Joseph,Jefferson,Mansfield,,1861-02-09,TX,D,Yes,1/3/47,85.9
1,80,house,D000448,Robert,Lee,Doughton,,1863-11-07,NC,D,Yes,1/3/47,83.2
2,80,house,S000001,Adolph,Joachim,Sabath,,1866-04-04,IL,D,Yes,1/3/47,80.7
3,80,house,E000023,Charles,Aubrey,Eaton,,1868-03-29,NJ,R,Yes,1/3/47,78.8
4,80,house,L000296,William,,Lewis,,1868-09-22,KY,R,No,1/3/47,78.3


**Question 4**: Find a method to print of the column headers of the data frame `df`.

In [14]:
# Execute your code for question 4 here
# -------------------------------------
df.columns


Index(['congress', 'chamber', 'bioguide', 'firstname', 'middlename',
       'lastname', 'suffix', 'birthday', 'state', 'party', 'incumbent',
       'termstart', 'age'],
      dtype='object')

**Question 5**: Congresses are numbered. Notice that there is a column devoted to the Cogress number. This column is conveniently called `congress`. Create a subsetted data frame of the 80th congress only and call this subset `congress80`. 

In [16]:
# Execute your code for question 5 here
# -------------------------------------
congress80 = df[df['congress'] == 80]
congress80.tail()

Unnamed: 0,congress,chamber,bioguide,firstname,middlename,lastname,suffix,birthday,state,party,incumbent,termstart,age
550,80,senate,C000021,Harry,Pulliam,Cain,,1/10/06,WA,R,No,1/3/47,41.0
551,80,senate,K000292,William,Fife,Knowland,,6/26/08,CA,R,Yes,1/3/47,38.5
552,80,senate,J000093,William,Ezra,Jenner,,7/21/08,IN,R,No,1/3/47,38.5
553,80,senate,M000315,Joseph,Raymond,McCarthy,,11/14/08,WI,R,No,1/3/47,38.1
554,80,senate,L000428,Russell,Billiu,Long,,11/3/18,LA,D,Yes,1/3/47,28.2


**Question 6**: Now, from this `congress80` subset, use a method that will count the rows who are House members and then again for Senate Members.

In [25]:
# Execute your code for question 6 here
# -------------------------------------
print("House:", len(congress80[congress80['chamber'] == 'house']))
print("Senate:", len(congress80[congress80['chamber'] == 'senate']))


House: 453
Senate: 102


# Save your notebook, then `File > Close and Halt`