# Reading and Writing Data

In this assignment you will be reading and writing data. In this folder are 3 included data files ending in `csv`, `json` and `pkl`. 

* data.csv
* data.json
* data.pkl

These are different file formats that exist. You can run the following **on the command line** to see what is in each file:

```sh
head data.csv # or pkl # or json
```

You'll see that there is some method to the madness but that each file has its peculiarities. Each file contains a portion of the total dataset that consists of 100 records, so you will need to read in all of the files and combine them into some standard format with which you are comfortable.  Aim for something standard where each "row" is the same format.

After you've standardized all of the data, report the following bits of information by writing them to a csv file labelled `question_1.csv`, `question_2.csv` etc.  In addition, show all your work in an iPython notebook.

1. What are the unique countries in the dataset?
2. What are the unique email domains in the dataset?
3. What are the first names of everyone that does not have a P.O. Box address?
4. What are the names of the first 5 people when you sort the data by Country?
5. What are the names of the first 5 people when you sort the data by phone number?

### Restrictions
You should use these standard library imports

```python
import json
import csv
import pickle
```

Some of you may be familiar with a Python package called `pandas` which would greatly speed up this sort of file processing.  The point of this homework is to do the work manually.  You can use `pandas` to independently check your work  if you are so inclined.  Don't worry if you are not familiar with `pandas`.  We will do this homework as a class exercise using `pandas` in the near future.


### Comments

- You may use regular expressions if you wish to extract data from each row. You do not need to use them if you do not want to or see a need to.  The Python regular expression module is called `re`.
- You may want to use the `operator` module to help in sorting.
- There are many data structures and formats that you might use to solve this problem.  You will have to decide if you want to keep the information for each person together as one record or all the information for each of the fields together.

** Hints** 
* you can put these files into sensible structures such as lists or or dictionaries. The async covers how to do this for csv and json. For pickle this might help https://wiki.python.org/moin/UsingPickle 

* .items() or .key() can be useful for dictionaries



In [35]:
import json
import csv
import pickle
import pprint

## Load and Process Data

In [50]:
# Load Data
dataset = []
with open('data.csv', 'r') as data_csv:
    x = csv.reader(data_csv)
    [dataset.append(row) for row in x]

# Examine Data
for row in dataset:
    print(row) 

['', 'Name', 'Phone', 'Address', 'City', 'Country', 'Email']
['0', 'Hillary Benton', '1-243-669-7472', '144-1225 In Road', 'Navsari', 'Togo', 'rutrum.magna.Cras@eudolor.edu']
['1', 'Morgan Y. Little', '155-3483', 'Ap #909-6656 Ac St.', 'Kitimat', 'Nauru', 'pede.sagittis.augue@quis.ca']
['2', 'Camden Z. Blair', '123-5058', 'P.O. Box 441, 6183 Ligula St.', 'Casanova Elvo', 'Palestine, State of', 'consectetuer.rhoncus.Nullam@ultrices.org']
['3', 'Alexandra E. Saunders', '1-637-740-7614', '305-496 Morbi Rd.', 'Biggleswade', 'Malawi', 'dui.Fusce@duinecurna.org']
['4', 'Hanae P. Walsh', '901-2461', '7058 Dapibus St.', 'Dhuy', 'Qatar', 'Morbi@tinciduntpedeac.com']
['5', 'Jescie Sargent', '265-1176', '421-5501 Cursus. St.', 'Tulsa', 'Holy See (Vatican City State)', 'non.egestas.a@ullamcorper.co.uk']
['6', 'Kessie Morgan', '945-0713', 'Ap #481-6631 Vehicula Rd.', 'Pedro Aguirre Cerda', 'Bonaire, Sint Eustatius and Saba', 'est@vitaeeratVivamus.net']
['7', 'Bevis M. Santos', '227-9994', 'P.O. Box

In [51]:
# Load Data
data_json = json.load(open("data.json"))

# Examine Data
pprint.pprint(data_json)
for k,v in data_json.items():
    print(k, len(v))

{'Address': {'20': '916-8087 Vehicula Rd.',
             '21': '878-2231 Suspendisse Rd.',
             '22': 'P.O. Box 572, 7680 Ullamcorper Ave',
             '23': '563-4105 Donec Avenue',
             '24': '462-2112 In Rd.',
             '25': '420-7327 Facilisis Street',
             '26': '561-7476 Eget St.',
             '27': '1247 Nonummy Rd.',
             '28': 'Ap #603-3303 Libero. St.',
             '29': 'P.O. Box 975, 4593 Ante. Street',
             '30': '3696 Augue Ave',
             '31': 'P.O. Box 365, 6109 Metus. Rd.',
             '32': 'Ap #861-8699 Non Ave',
             '33': '371-7266 Tortor Avenue',
             '34': '4167 Nunc Ave',
             '35': 'Ap #302-2966 Cum Av.',
             '36': 'Ap #275-2917 Curabitur Rd.',
             '37': '6930 Duis Road',
             '38': '1511 Lobortis Ave',
             '39': 'Ap #711-213 Sagittis Avenue'},
 'City': {'20': 'Le Mans',
          '21': 'Wilhelmshaven',
          '22': 'Sangli',
          '23': 'Wabamu

AttributeError: '_csv.reader' object has no attribute 'items'

In [52]:
# Append Data
for x in range(20,40):
    newline = [x, data_json['Name'][str(x)], data_json['Phone'][str(x)], data_json['Address'][str(x)], 
           data_json['City'][str(x)], data_json['Country'][str(x)], data_json['Email'][str(x)]]
    print(newline)
    dataset.append(newline)

[20, 'Paul Merrill', '1-313-739-3854', '916-8087 Vehicula Rd.', 'Le Mans', 'Somalia', 'diam.Pellentesque@suscipitest.ca']
[21, 'Brynne S. Barr', '939-4818', '878-2231 Suspendisse Rd.', 'Wilhelmshaven', 'Samoa', 'euismod.et.commodo@nisi.co.uk']
[22, 'Cyrus Buckley', '266-3123', 'P.O. Box 572, 7680 Ullamcorper Ave', 'Sangli', 'Taiwan', 'accumsan.laoreet.ipsum@Quisqueimperdiet.ca']
[23, 'Chloe Burnett', '828-0406', '563-4105 Donec Avenue', 'Wabamun', 'Morocco', 'nec.orci.Donec@Suspendisse.co.uk']
[24, 'Zachery Wilcox', '1-611-756-4723', '462-2112 In Rd.', 'Barddhaman', 'Hong Kong', 'eget@tinciduntaliquamarcu.com']
[25, 'Casey Mcgowan', '1-155-558-4461', '420-7327 Facilisis Street', 'Pfungstadt', 'Iran', 'tellus.faucibus.leo@Sedpharetrafelis.org']
[26, 'Cole X. Hopper', '1-328-505-0545', '561-7476 Eget St.', 'Saint John', 'Macao', 'urna@Lorem.net']
[27, 'Tara Bender', '1-757-378-4079', '1247 Nonummy Rd.', 'Avellino', 'Dominica', 'libero.Integer@ligulaelitpretium.org']
[28, 'Malik Grimes', 

In [55]:
# Load Data
data_pkl = pickle.load(open("data.pkl", "rb"))

# Examine Data
pprint.pprint(data_pkl)
for k,v in data_pkl.items():
    print(k, len(v))

{'Address': {40: 'P.O. Box 466, 7919 In Av.',
             41: 'P.O. Box 484, 9648 Sit Avenue',
             42: 'P.O. Box 254, 2688 Luctus, Street',
             43: 'Ap #682-9992 Neque Rd.',
             44: '245-8811 Ut St.',
             45: 'P.O. Box 383, 139 A Ave',
             46: '7989 Magna Rd.',
             47: '7312 Tristique St.',
             48: 'P.O. Box 720, 9179 Fermentum Street',
             49: '200-5702 Mollis St.',
             50: 'Ap #221-1593 Fringilla St.',
             51: 'P.O. Box 133, 5382 Enim Ave',
             52: 'Ap #869-5869 Neque Avenue',
             53: '2992 Vitae Rd.',
             54: '6427 Eros Avenue',
             55: 'P.O. Box 133, 6862 Diam Road',
             56: 'P.O. Box 679, 7373 Mollis Ave',
             57: 'P.O. Box 642, 2289 Volutpat. Street',
             58: '221-3908 Pellentesque Av.',
             59: '581-1223 Aliquam Rd.',
             60: '2398 Lectus, Road',
             61: '1274 Nullam St.',
             62: '1229 Nisl.

In [59]:
# Append Data
for x in range(40,100):
    newline = [str(x), data_pkl['Name'][x], data_pkl['Phone'][x], data_pkl['Address'][x], 
           data_pkl['City'][x], data_pkl['Country'][x], data_pkl['Email'][x]]
    print(newline)
    dataset.append(newline)

['40', 'Garrison Lindsey', '420-1477', 'P.O. Box 466, 7919 In Av.', 'Dunbar', 'Zambia', 'ipsum.ac@quam.net']
['41', 'Jenna Mercado', '102-2189', 'P.O. Box 484, 9648 Sit Avenue', 'Pollena Trocchia', 'Burkina Faso', 'Nulla.facilisis.Suspendisse@urnanec.net']
['42', 'Drake Savage', '1-790-105-7695', 'P.O. Box 254, 2688 Luctus, Street', 'Hastings', 'Tunisia', 'quis.accumsan.convallis@fringilla.edu']
['43', 'Rana Z. Colon', '486-7539', 'Ap #682-9992 Neque Rd.', 'Gespeg', 'Canada', 'rutrum.non.hendrerit@etlacinia.com']
['44', 'Melodie Knox', '1-479-861-6093', '245-8811 Ut St.', 'Whitehorse', 'Norway', 'arcu@laciniaorciconsectetuer.ca']
['45', 'Cooper T. Horton', '768-1000', 'P.O. Box 383, 139 A Ave', 'Fernie', 'Israel', 'sit@eros.ca']
['46', 'Eaton Nelson', '746-8562', '7989 Magna Rd.', 'Ludlow', 'Cocos (Keeling) Islands', 'eu@aenim.ca']
['47', 'Lucian W. Lynn', '1-392-783-0634', '7312 Tristique St.', 'Tirrases', 'Western Sahara', 'purus.Duis.elementum@ut.org']
['48', 'Sydney Anderson', '1-6

In [64]:
# Final Checks - Should be 101 records (header + 100)
print(len(dataset))
# Make sure all rows have 7 enteries
for row in dataset:
    print(len(row),end=' ')

101
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 

## 1. What are the unique countries in the dataset?


In [89]:
## Check Variable Positions
print(dataset[0])
print()

## Extract Countries into a Set, which removes duplicates automatically
question_1 = {row[5] for row in dataset[1:]}
print(question_1)
print(len(question_1))

['', 'Name', 'Phone', 'Address', 'City', 'Country', 'Email']

{'Venezuela', 'Nauru', 'Dominica', 'Norway', 'Isle of Man', 'Sint Maarten', 'Mayotte', 'Switzerland', 'Lesotho', 'Bahamas', 'Turks and Caicos Islands', 'Ireland', 'Wallis and Futuna', 'Tajikistan', 'Canada', 'Malawi', 'Macao', 'Niger', 'Iraq', 'Austria', 'Cambodia', 'Zambia', 'Czech Republic', 'Jamaica', 'Slovakia', 'Moldova', 'Taiwan', 'Qatar', 'Kyrgyzstan', 'Uruguay', 'Indonesia', 'Mauritius', 'Antigua and Barbuda', 'Egypt', 'Guyana', 'Western Sahara', 'Guatemala', 'Botswana', 'Uganda', 'Togo', 'Croatia', 'Algeria', 'Latvia', 'Palestine, State of', 'Bonaire, Sint Eustatius and Saba', 'Saint Vincent and The Grenadines', 'Afghanistan', 'Israel', 'Marshall Islands', 'Faroe Islands', 'Italy', 'Turkmenistan', 'Estonia', 'Iran', 'Ghana', 'Morocco', 'Tunisia', 'Cayman Islands', 'Burkina Faso', 'Holy See (Vatican City State)', 'Dominican Republic', 'American Samoa', 'South Georgia and The South Sandwich Islands', 'Northern Mariana

## 2. What are the unique email domains in the dataset?


In [88]:
## Practice extracting everything after the '@' symbol
print(dataset[1][6])
print(dataset[1][6].index('@'))
print(dataset[1][6][dataset[1][6].index('@') + 1:])
print()

## extract the domain for each user, and put it into a set to find uniques
question_2 = {row[6][row[6].index('@') + 1:] for row in dataset[1:]}
print(question_2)
print(len(question_2))

rutrum.magna.Cras@eudolor.edu
17
eudolor.edu

{'arcu.org', 'erat.co.uk', 'anteblanditviverra.co.uk', 'convallisdolor.co.uk', 'urnanec.net', 'ametrisus.com', 'miac.edu', 'erat.org', 'egestas.net', 'facilisisvitae.ca', 'aenim.ca', 'consectetuercursuset.org', 'risusaultricies.co.uk', 'Curae.com', 'ornare.org', 'dictumeu.org', 'enim.org', 'tinciduntaliquamarcu.com', 'Namligula.edu', 'Lorem.net', 'quam.net', 'arcuVestibulumante.org', 'tinciduntpedeac.com', 'cursuset.net', 'ullamcorper.co.uk', 'ornarelectusjusto.net', 'nuncinterdum.edu', 'Suspendisse.co.uk', 'purus.ca', 'ut.org', 'Maurismolestie.co.uk', 'Aenean.edu', 'ettristiquepellentesque.ca', 'eudolor.edu', 'maurisipsum.edu', 'vitae.edu', 'Curabiturvellectus.net', 'Aliquamfringilla.com', 'est.ca', 'Duisdignissimtempor.com', 'purus.net', 'etlacinia.com', 'temporaugueac.com', 'suscipitest.ca', 'nullamagna.edu', 'laciniaorciconsectetuer.ca', 'lectusNullam.co.uk', 'cursusinhendrerit.edu', 'duinecurna.org', 'liberoDonec.net', 'dapibus.ca', 'o

## 3. What are the first names of everyone that does not have a P.O. Box address?


In [96]:
# Because the dataset is small, we can manually examine to find out the differnet ways that P.O. Boxes are referenced
print(dataset[0])
print([row[3] for row in dataset[1:]])

['', 'Name', 'Phone', 'Address', 'City', 'Country', 'Email']
['144-1225 In Road', 'Ap #909-6656 Ac St.', 'P.O. Box 441, 6183 Ligula St.', '305-496 Morbi Rd.', '7058 Dapibus St.', '421-5501 Cursus. St.', 'Ap #481-6631 Vehicula Rd.', 'P.O. Box 575, 4033 Mi St.', 'Ap #763-5990 Nec, Av.', 'Ap #841-1623 Vitae Avenue', '9269 Libero Ave', 'P.O. Box 677, 2311 Aliquet. Road', '7438 Amet, Rd.', 'P.O. Box 432, 9085 Nulla Ave', '1768 Magna. Road', 'P.O. Box 497, 8354 Habitant St.', '217-9163 Lobortis Road', 'Ap #929-9420 Vivamus Rd.', '910-8300 Varius Rd.', '7458 Sapien. St.', '916-8087 Vehicula Rd.', '878-2231 Suspendisse Rd.', 'P.O. Box 572, 7680 Ullamcorper Ave', '563-4105 Donec Avenue', '462-2112 In Rd.', '420-7327 Facilisis Street', '561-7476 Eget St.', '1247 Nonummy Rd.', 'Ap #603-3303 Libero. St.', 'P.O. Box 975, 4593 Ante. Street', '3696 Augue Ave', 'P.O. Box 365, 6109 Metus. Rd.', 'Ap #861-8699 Non Ave', '371-7266 Tortor Avenue', '4167 Nunc Ave', 'Ap #302-2966 Cum Av.', 'Ap #275-2917 Cura

In [98]:
# Okay. 'P.O. Box' is one match. Let's strip those out and see what else remains
print([row[3] for row in dataset[1:] if 'P.O. Box' not in row[3]])
# It looks like there are no other representations of P.O. Box in the data!

['144-1225 In Road', 'Ap #909-6656 Ac St.', '305-496 Morbi Rd.', '7058 Dapibus St.', '421-5501 Cursus. St.', 'Ap #481-6631 Vehicula Rd.', 'Ap #763-5990 Nec, Av.', 'Ap #841-1623 Vitae Avenue', '9269 Libero Ave', '7438 Amet, Rd.', '1768 Magna. Road', '217-9163 Lobortis Road', 'Ap #929-9420 Vivamus Rd.', '910-8300 Varius Rd.', '7458 Sapien. St.', '916-8087 Vehicula Rd.', '878-2231 Suspendisse Rd.', '563-4105 Donec Avenue', '462-2112 In Rd.', '420-7327 Facilisis Street', '561-7476 Eget St.', '1247 Nonummy Rd.', 'Ap #603-3303 Libero. St.', '3696 Augue Ave', 'Ap #861-8699 Non Ave', '371-7266 Tortor Avenue', '4167 Nunc Ave', 'Ap #302-2966 Cum Av.', 'Ap #275-2917 Curabitur Rd.', '6930 Duis Road', '1511 Lobortis Ave', 'Ap #711-213 Sagittis Avenue', 'Ap #682-9992 Neque Rd.', '245-8811 Ut St.', '7989 Magna Rd.', '7312 Tristique St.', '200-5702 Mollis St.', 'Ap #221-1593 Fringilla St.', 'Ap #869-5869 Neque Avenue', '2992 Vitae Rd.', '6427 Eros Avenue', '221-3908 Pellentesque Av.', '581-1223 Aliqua

In [102]:
# Now we want the first names.  We can actually do this in the same comprehension, but keep row[1] instead of row[3]
# Then, keep only the first word in the name
question_3 = [row[1].split(' ')[0] for row in dataset[1:] if 'P.O. Box' not in row[3]]
print(question_3)

['Hillary', 'Morgan', 'Alexandra', 'Hanae', 'Jescie', 'Kessie', 'Flynn', 'Charles', 'Cairo', 'Thane', 'Genevieve', 'Tatyana', 'Meredith', 'Rajah', 'Gabriel', 'Paul', 'Brynne', 'Chloe', 'Zachery', 'Casey', 'Cole', 'Tara', 'Malik', 'Colby', 'Cameron', 'Gail', 'Harding', 'Idona', 'Warren', 'Clayton', 'Alana', 'Mason', 'Rana', 'Melodie', 'Eaton', 'Lucian', 'Jane', 'Yen', 'Freya', 'Rama', 'Lawrence', 'Cherokee', 'Michael', 'Kay', 'Arden', 'Chantale', 'Calvin', 'Walter', 'Berk', 'Timothy', 'Ariana', 'Mason', 'Keane', 'Maggy', 'Talon', 'Devin', 'Orli', 'Wing', 'Inez', 'Kyle', 'Selma', 'Gwendolyn', 'Gary', 'Drake', 'Blossom', 'Joan', 'Buffy', 'Walker', 'Blake', 'Yardley', 'Lenore', 'Edan', 'Quintessa', 'Reuben', 'Yoshio', 'Rebecca', 'Shana', 'Adara']


## 4. What are the names of the first 5 people when you sort the data by Country?

In [109]:
# Here's a printing function just to make things nicer
def nice_print(data, rows):
    for row in data[:rows]:
        print(row)
        
nice_print(dataset, 6)

['', 'Name', 'Phone', 'Address', 'City', 'Country', 'Email']
['0', 'Hillary Benton', '1-243-669-7472', '144-1225 In Road', 'Navsari', 'Togo', 'rutrum.magna.Cras@eudolor.edu']
['1', 'Morgan Y. Little', '155-3483', 'Ap #909-6656 Ac St.', 'Kitimat', 'Nauru', 'pede.sagittis.augue@quis.ca']
['2', 'Camden Z. Blair', '123-5058', 'P.O. Box 441, 6183 Ligula St.', 'Casanova Elvo', 'Palestine, State of', 'consectetuer.rhoncus.Nullam@ultrices.org']
['3', 'Alexandra E. Saunders', '1-637-740-7614', '305-496 Morbi Rd.', 'Biggleswade', 'Malawi', 'dui.Fusce@duinecurna.org']
['4', 'Hanae P. Walsh', '901-2461', '7058 Dapibus St.', 'Dhuy', 'Qatar', 'Morbi@tinciduntpedeac.com']


In [127]:
sort_dataset = dataset[1:].copy()
sort_dataset.sort(key = lambda x: x[5])
nice_print(sort_dataset, 5)

question_4 = [row[1] for row in sort_dataset[:5]]
print(question_4)

['18', 'Rajah Carrillo', '1-576-789-5730', '910-8300 Varius Rd.', 'Bertiolo', 'Afghanistan', 'sapien.gravida.non@cursuset.net']
['79', 'Gwendolyn Crosby', '692-9172', '997 Posuere Rd.', 'San Miguel', 'Albania', 'molestie.orci.tincidunt@feugiat.org']
['92', 'Edan Cortez', '1-223-433-5209', '159-6608 Eu, St.', 'High Level', 'Algeria', 'Mauris.vestibulum.neque@odio.ca']
['81', 'Knox L. Cash', '535-9704', 'P.O. Box 469, 4278 Condimentum Rd.', 'Gönen', 'American Samoa', 'nunc@erat.co.uk']
['12', 'Thane Burch', '1-894-978-3696', '7438 Amet, Rd.', 'Algeciras', 'Anguilla', 'lobortis.quis.pede@Namligula.edu']
['Rajah Carrillo', 'Gwendolyn Crosby', 'Edan Cortez', 'Knox L. Cash', 'Thane Burch']



## 5. What are the names of the first 5 people when you sort the data by phone number?

In [129]:
sort_dataset = dataset[1:].copy()
sort_dataset.sort(key = lambda x: x[2])
nice_print(sort_dataset, 5)

question_5 = [row[1] for row in sort_dataset[:5]]
print(question_5)

['16', 'Tatyana H. French', '1-120-782-6047', '217-9163 Lobortis Road', 'Salles', 'Eritrea', 'Curabitur@magna.com']
['49', 'Jane Joyner', '1-131-574-3183', '200-5702 Mollis St.', 'HavrŽ', 'Austria', 'magna.sed.dui@diamPellentesquehabitant.com']
['73', 'Devin L. Boone', '1-132-242-8605', '1488 Dignissim Ave', 'Teruel', 'Switzerland', 'nec.tempus.scelerisque@ettristiquepellentesque.ca']
['89', 'Naida Guthrie', '1-138-699-9182', 'P.O. Box 656, 5397 Gravida. Ave', 'Tulita', 'Sudan', 'sit.amet@tinciduntvehicularisus.org']
[25, 'Casey Mcgowan', '1-155-558-4461', '420-7327 Facilisis Street', 'Pfungstadt', 'Iran', 'tellus.faucibus.leo@Sedpharetrafelis.org']
['Tatyana H. French', 'Jane Joyner', 'Devin L. Boone', 'Naida Guthrie', 'Casey Mcgowan']


## Output Results to CSV

In [133]:
def write_answer(file_name, solution):
    with open(file_name,'w') as writefile:
        x = csv.writer(writefile)
        x.writerow(solution)

In [135]:
write_answer('question_1.csv', question_1)
write_answer('question_2.csv', question_2)
write_answer('question_3.csv', question_3)
write_answer('question_4.csv', question_4)
write_answer('question_5.csv', question_5)