# Working with CSV data files

Tabular data is often saved in some sort of CSV like format. Sometimes the delimiter might be a tab or some other character instead of a comma. By convention, CSV files have a `.csv` extension (tab delimited files will often have a `.tsv` extension). CSV is such a common format that Python has a built in `csv` library. In addition, other Python libraries designed for data analysis (such as **pandas** or **NumPy**) often have their own functions for working with CSV files. In some cases, these functions are simply "wrappers" around Python's built in CSV related functions.

There isn't an actual standard format for CSV files. It's more accurate to say that there are various dialects of CSV.



## Read and write `kc_house_data.csv` with pandas

We will start with a file of housing price data available from Kaggle at https://www.kaggle.com/datasets/harlfoxem/housesalesprediction. The file is named `kc_house_data.csv` and it's available in the `data` folder. 

Here's a little snippet of the file opened in a text editor.

In [8]:
from IPython.display import Image
Image(filename='images/kc_house_data_snippet.png')

FileNotFoundError: [Errno 2] No such file or directory: 'images/kc_house_data_snippet.png'

The file is pretty typical for CSV files in that:

- it contains a header line to be used as column names
- each row contains a number of values separated by commas (CSV = comma separated values)
- some of the values in each row are numeric while others are "text" and are enclosed in double quotes
- the text columns are intended to be read into some non-numeric data type such as a string

A few other things to note:

- the `id` column, while it looks numeric isn't something we are going to do math with. It's just an identifier and the double quotes around it just reinforce the fact that we should interpret it as a label of some sort.
- the `date` column is also non-numeric with a `'T'` separating the date and time. Depending on what we are using this data for, we may have to try to convert this datetime-like string into a true Python datetime value.
- the zip code is like the `id` column, it looks numeric but it's really a label



Often we just want to read a CSV file into a pandas dataframe. For this we can use the pandas `read_csv()` function. It has a huge number of input arguments for customizing how you want the file read. Let's start with everything at the default values.

In [12]:
import pandas as pd

In [14]:
kc1 = pd.read_csv('data/kc_house_data.csv')
kc1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  int64  
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  int64  
 17  lat            21613 non-null  float64
 18  long  

In [16]:
kc1.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


It seems that:

- pandas ignored the quotations marks and made its best guess as to the appropriate data type for those columns.
- `id` ended up as `int64` while `floors` is a `float64` (some "split-level" houses might have 1.5 floors)
- the `date` column got interpreted as a string `object`

Before pandas 1.0, all strings got stored as an `object`, but now there is a `StringDtype` which is recommended for string columns. See https://pandas.pydata.org/docs/user_guide/text.html#text-types.

While we could fix up some of these datatypes once the CSV file has been read into a `DataFrame`, let's see what we can do with `read_csv` arguments.

In [18]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m:

Wow, there are a ton of options for customizing the behavior of `read_csv()` to deal with many common problems associated with ingesting text files:

- delimiter isn't always a comma
- skipping some lines at top of the file
- parsing dates
- handling commas used as a thousands separator or decimal point
- only importing a subset of the column and/or rows
- dealing with missing data
- modifyng data types
- transforming values during import
- dealing with quotation marks with string fields
- dealing with bad lines
- ... and many more. 

For the `id` column, we could move it into the index by using the `index_col=0` option.

In [21]:
kc2a = pd.read_csv('data/kc_house_data.csv', index_col=0)
kc2a.info()

<class 'pandas.core.frame.DataFrame'>
Index: 21613 entries, 7129300520 to 1523300157
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           21613 non-null  object 
 1   price          21613 non-null  float64
 2   bedrooms       21613 non-null  int64  
 3   bathrooms      21613 non-null  float64
 4   sqft_living    21613 non-null  int64  
 5   sqft_lot       21613 non-null  int64  
 6   floors         21613 non-null  float64
 7   waterfront     21613 non-null  int64  
 8   view           21613 non-null  int64  
 9   condition      21613 non-null  int64  
 10  grade          21613 non-null  int64  
 11  sqft_above     21613 non-null  int64  
 12  sqft_basement  21613 non-null  int64  
 13  yr_built       21613 non-null  int64  
 14  yr_renovated   21613 non-null  int64  
 15  zipcode        21613 non-null  int64  
 16  lat            21613 non-null  float64
 17  long           21613 non-null  float64
 1

If we don't want `id` to be the index, we could specify its datatype using the `dtype` argument. We can pass in a dictionary that specifies the column name and its intended datatype. While we are at it, we can make `zipcode` a string as well.

In [24]:
kc2b = pd.read_csv('data/kc_house_data.csv', dtype={'id': 'string', 'zipcode': 'string'})
kc2b.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 21 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   id             21613 non-null  string 
 1   date           21613 non-null  object 
 2   price          21613 non-null  float64
 3   bedrooms       21613 non-null  int64  
 4   bathrooms      21613 non-null  float64
 5   sqft_living    21613 non-null  int64  
 6   sqft_lot       21613 non-null  int64  
 7   floors         21613 non-null  float64
 8   waterfront     21613 non-null  int64  
 9   view           21613 non-null  int64  
 10  condition      21613 non-null  int64  
 11  grade          21613 non-null  int64  
 12  sqft_above     21613 non-null  int64  
 13  sqft_basement  21613 non-null  int64  
 14  yr_built       21613 non-null  int64  
 15  yr_renovated   21613 non-null  int64  
 16  zipcode        21613 non-null  string 
 17  lat            21613 non-null  float64
 18  long  

We can use the `parse_dates` argument to convert `date` from a string to an actual pandas `datetime64` column. If pandas can't parse the date properly using its default date parser (`dateutil.parser.parser`) we can specify a custom date parser function. 

The `usecols` argument lets us specify a subset of the columns in the dataframe to keep.

In [27]:
kc3 = pd.read_csv('data/kc_house_data.csv', 
                  dtype={'id': 'string', 'zipcode': 'string'}, 
                  parse_dates=['date'],
                  usecols=['id', 'date', 'price', 'sqft_living', 'zipcode'])
kc3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   id           21613 non-null  string        
 1   date         21613 non-null  datetime64[ns]
 2   price        21613 non-null  float64       
 3   sqft_living  21613 non-null  int64         
 4   zipcode      21613 non-null  string        
dtypes: datetime64[ns](1), float64(1), int64(1), string(2)
memory usage: 844.4 KB


In [29]:
kc3.head()

Unnamed: 0,id,date,price,sqft_living,zipcode
0,7129300520,2014-10-13,221900.0,1180,98178
1,6414100192,2014-12-09,538000.0,2570,98125
2,5631500400,2015-02-25,180000.0,770,98028
3,2487200875,2014-12-09,604000.0,1960,98136
4,1954400510,2015-02-18,510000.0,1680,98074


Now let's write out `kc3` to a new CSV file using `to_csv`, which is a `DataFrame` method.

In [32]:
pd.DataFrame.to_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m.[0m[0mto_csv[0m[1;33m([0m[1;33m
[0m    [0mself[0m[1;33m,[0m[1;33m
[0m    [0mpath_or_buf[0m[1;33m:[0m [1;34m'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m','[0m[1;33m,[0m[1;33m
[0m    [0mna_rep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m''[0m[1;33m,[0m[1;33m
[0m    [0mfloat_format[0m[1;33m:[0m [1;34m'str | Callable | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'bool_t | list[str]'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33

We'll start with the default behavior. The default behavior is to include the dataframe index in the resulting CSV file.

In [35]:
kc3.to_csv('data/kc3.csv')

What does the result look like? The default integer index gets included, but there is no corresponding column name in the header row.

In [38]:
with open('data/kc3.csv') as in_kc3:
    kc3_lines = in_kc3.readlines()

for _ in range(5):
    print(kc3_lines[_], end='')

,id,date,price,sqft_living,zipcode
0,7129300520,2014-10-13,221900.0,1180,98178
1,6414100192,2014-12-09,538000.0,2570,98125
2,5631500400,2015-02-25,180000.0,770,98028
3,2487200875,2014-12-09,604000.0,1960,98136


This time the index won't get included.

In [41]:
kc3.to_csv('data/kc3.csv', index=False)

with open('data/kc3.csv') as in_kc3:
    kc3_lines = in_kc3.readlines()

for _ in range(5):
    print(kc3_lines[_], end='')

id,date,price,sqft_living,zipcode
7129300520,2014-10-13,221900.0,1180,98178
6414100192,2014-12-09,538000.0,2570,98125
5631500400,2015-02-25,180000.0,770,98028
2487200875,2014-12-09,604000.0,1960,98136


Like `read_csv()`, the `to_csv()` function has a myriad of options for controlling how the dataframe is written to the CSV file.

## No header line

Sometimes a CSV file won't contain a header line. Other times, the header will contain additional lines along with the column names at the top of the file.

The file `data/siteA_loc1_shallow.csv` contains stream temperature readings from a sensor placed in a specific stream for several weeks. Each row consists of a datetime, a 'C' or 'F' indicating centigrade or fahrenheit, and the temperature reading. Unfortunately, no header was in the file.

In [45]:
with open('data/siteA_loc1_shallow.csv') as in_siteA:
    siteA_lines = in_siteA.readlines()

for _ in range(5):
    print(siteA_lines[_], end='')

9/2/2009 10:50,C,27
9/2/2009 14:50,C,20.5
9/2/2009 18:50,C,20.5
9/2/2009 22:50,C,20
9/3/2009 2:50,C,20


This is easily dealt with using the `names` argument for `read_csv`.

In [48]:
siteA_df = pd.read_csv('data/siteA_loc1_shallow.csv', names=['datetime','scale','temperature'])
siteA_df.head()

Unnamed: 0,datetime,scale,temperature
0,9/2/2009 10:50,C,27.0
1,9/2/2009 14:50,C,20.5
2,9/2/2009 18:50,C,20.5
3,9/2/2009 22:50,C,20.0
4,9/3/2009 2:50,C,20.0


## A more complex header

This next example is more complicated. It involves some stream flow data from the [USGS](https://www.usgs.gov/). The data is tab delimited but contains a bunch of comments at the top of the file, followed by the header line, which is then followed by a format specification line. Here's what the first 35 lines of the file look like.

In [51]:
with open('data/clinton_river_hard.csv') as in_crh:
    crh_lines = in_crh.readlines()

for _ in range(35):
    print(crh_lines[_], end='')

# Some of the data that you have obtained from this U.S. Geological Survey database
# may not have received Director's approval. Any such data values are qualified
# as provisional and are subject to revision. Provisional data are released on the
# condition that neither the USGS nor the United States Government may be held liable
# for any damages resulting from its use.
#
# Additional info: https://help.waterdata.usgs.gov/policies/provisional-data-statement
#
# File-format description:  https://help.waterdata.usgs.gov/faq/about-tab-delimited-output
# Automated-retrieval info: https://help.waterdata.usgs.gov/faq/automated-retrievals
#
# Contact:   gs-w_support_nwisweb@usgs.gov
# retrieved: 2021-05-27 10:22:40 EDT       (nadww01)
#
# Data for the following 1 site(s) are contained in this file
#    USGS 04161000 CLINTON RIVER AT AUBURN HILLS, MI
# -----------------------------------------------------------------------------------
#
# Data provided for site 04161000
#            TS   par

The key to getting this data read into a pandas `DataFrame` was careful reading of the pandas documentation for `read_csv`. Here's our challenge:

* We need to skip the row immediately after the header row.
* The header row itself follows a whole bunch of other rows to be skipped, all of them starting with '#'.

So, we can use the `comment='#'` argument to skip the rows above the header. However, in specifiying the row number for the header, pandas does NOT count any comment lines and starts counting at 0. The header line is actually row 0 from the perspective of `read_csv` after the comments are ignored. **However**, to skip the row below the header, we use the `skiprows` argument, and it does **NOT** ignore the comment lines when counting the rows to skip. Pasting part of the file into a text editor reveals that we need to skip row number 30 (again, starting to count at 0). So, let's cheat and hard code in the row number to skip just to show that this works. For the `skiprows` argument we can either pass in a list of row numbers to skip, or, we can pass in a single integer specifying the number of rows to skip. To make it work, we need to do the former.

In [54]:
flowdata_test = pd.read_csv('data/clinton_river_hard.csv', sep = '\t', comment='#', 
                       header = 0, skiprows = [30], parse_dates = ['datetime']) 

flowdata_test.head()

Unnamed: 0,agency_cd,site_no,datetime,tz_cd,279557_00065,279557_00065_cd,279558_63160,279558_63160_cd,72218_00060,72218_00060_cd
0,USGS,4161000,2021-01-01 00:00:00,EST,1.59,A,847.69,P,95.2,A
1,USGS,4161000,2021-01-01 00:15:00,EST,1.6,A,847.7,P,96.7,A
2,USGS,4161000,2021-01-01 00:30:00,EST,1.6,A,847.7,P,96.7,A
3,USGS,4161000,2021-01-01 00:45:00,EST,1.61,A,847.71,P,98.2,A
4,USGS,4161000,2021-01-01 01:00:00,EST,1.61,A,847.71,P,98.2,A


And here's what happens if we try telling pandas to skip 1 row after the header. We would end up with the format specification row in our dataframe. So, if the number of comment lines is always the same, then would can use the approach above to skip the row we want. But of course, in real life, the number of comment lines will usually vary and that is certainly the case with this data from the USGA. For example, depending on the type of gauge, a different number of metrics might be captured and that affects the number of rows in the section starting with "Data provided".

In [None]:
pd.read_csv('data/clinton_river_hard.csv', sep = '\t', comment='#', 
                       header = 0, skiprows = 1, parse_dates = ['datetime']).head()


So, how might we handle this more complex case of a variable number of comment lines and skipping one line after the header? Well, I actually address this problem with this same data source in my [Advanced Analytics with Python course](http://www.sba.oakland.edu/faculty/isken/courses/mis6900/index.html). In addition to this line skipping, I also show to use Python to automatically grab the data from the USGS website and do some datetime manipulations related to time zones. If you are interested you can check out the screencast and associated Jupyter notebook from [this course web page](http://www.sba.oakland.edu/faculty/isken/courses/mis6900/getting_data_from_web.html#web-apis-data-wrangling-with-pandas).

## Using the `csv` package

You may have data in a CSV file that you want to work with in other ways than just reading it into a pandas `DataFrame`. For example, you may have some row by row processing that needs to be done before moving the data to some new file or database. The `csv` package makes it easy to iterate through a CSV file using a `reader` object. 

Much like pandas, there are various input arguments for controlling the reading process in terms of dealing with things like whitespace, quote characters, and delimiters. 
In addition, the `csv` package uses the notion of *dialects* to group together commonly used input argument values. The default dialect is named `'excel'` and is designed to handle CSV files generated by Excel. As you might guess, this dialect includes things like using a comma for the delimiter and Windows style line endings. If you want all the details, see this [SO post](https://stackoverflow.com/questions/49204639/what-exactly-are-the-csv-modules-dialect-settings-for-excel-tab).

The `csv` package also supports writing CSV files using a `writer` object. 

Let's see some simple examples. We'll start by just iterating through a CSV file containing station information for a bike sharing system.

### Reading a CSV file with `csv`

In [None]:
import csv

In [None]:
# Open the file read mode
with open('data/station.csv','r') as csvfile:
    # Create a reader object
    station_reader = csv.reader(csvfile)

    # Iterate through the reader and print out each row that is read
    for station_row in station_reader:
        print(station_row)


We see that each row is read into a list containing the individual column values in that row. Notice that the header is no different than any other row - it's just a list. Also notice that there is no "container" data structure that is preserving all these lists. 

In [None]:
station_row

Here are some questions to answer as a review of basic Python concepts:

**Q**: What is `open`?

**Q**: What are the arguments passed into `open` and what are they specifying?

**Q**: What is `csvfile` and what is its data type?

**Q**: What type of thing is returned by `open`?

**Q**: What is the difference between `station_reader` and `csv.reader`?

**Q**: Would the following line of code be ok or would it cause an error immediately when it's executed?

    peanut_butter = csv.reader(csvfile)
    
Remember, the general form of a `for` loop block is:

    for variable in <iterable>:
        do one or more things with variable

An *iterable* is a collection of objects that we can iterate over (or step through).

So, in this case:

* variable: `station_row`
* collection or iterable: `station_reader`
* something: `print(station_row)`

**Q**: What kind of data is `station_row`?

**Q**: If `reader` is a collection, what's it a collection of?

So, if we want to store these lists (rows) for further processing, we need to do it ourselves. For example, we might store this data as a list of lists. Also, let's do a little data processing as we go:

- only keep the `station_id`, `lat`, `long`, and `current_dockcount` fields
- don't keep any stations that have been decomissioned

In [None]:
from pprint import pprint
# Create empty list to serve as container
stations = []

# Open the file read mode
with open('data/station.csv','r') as csvfile:
    # Create a reader object
    station_reader = csv.reader(csvfile)

    # Iterate through the reader 
    for station_row in station_reader:
        # Check for decomissioning
        if len(station_row[-1].strip()) == 0:
            # Just grab the columns we want
            new_station_row = [station_row[i] for i in [0, 2, 3, -2]]
            # Append list (row) to container list
            stations.append(new_station_row)

# pretty print the stations list
pprint(stations)


Now the station information is stored as a big list of lists and we've only kept the columns we are interested in.

### Writing a CSV file with `csv`

Now let's write this list of lists back out to a new CSV file. For this we use a `writer` object along with its `writerows` method. 

In [None]:
# Open a new file in write mode
with open('data/station_location.csv','w') as csvfile:
    # Create a writer object
    station_writer = csv.writer(csvfile)
    # Write out all the lists (rows) in the stations list (the collection or iterable)
    station_writer.writerows(stations)



## Answers to review questions

Here are some questions to answer as a review of basic Python concepts:

**Q**: What is `open`?

`open` is a function used to open files.

**Q**: What are the arguments passed into `open` and what are they specifying?

The first argument, `'data/station.csv'`, is a string specifiying the location of the file to be opened.
The second argument, `'r'`, specifies the *mode* in which the file is opened. 'r' stands for read mode.

**Q**: What is `csvfile` and what is its data type?

It is a variable who type is a *file object*. Specifically, it's ...

In [None]:
type(csvfile)

**Q**: What type of thing is returned by `open`?

We just answered that, it's a file object.

**Q**: What is the difference between `station_reader` and `csv.reader`?

`station_reader` is a variable we created to store the instance of a `csv.reader` object.



**Q**: Would the following line of code be ok or would it cause an error immediately when it's executed?

    peanut_butter = csv.reader(csvfile)

It would be perfectly fine. It's a strange name for a variable, but so be it.

Remember, the general form of a `for` loop block is:

    for variable in <iterable>:
        do one or more things with variable

An *iterable* is a collection of objects that we can iterate over (or step through).

So, in this case:

* variable: `station_row`
* collection or iterable: `station_reader`
* something: `print(station_row)`

**Q**: What kind of data is `station_row`?

It's a list containing the values in one row of the data file.

**Q**: If `station_reader` is a collection, what's it a collection of?

It's a collection of lists, each of which is one row of the CSV file.

### Read data into dictionaries instead of lists
As we saw above, the `csv.reader` returns each line of the file as a list. Sometimes we want that but sometimes we might want each line to go into a different type of data structure. Instead of lists, let's read each line into a dictionary. Thankfully, the `csv` library has a `DictReader` function that does exactly that.

In [None]:
import csv

csvfile = open('data/data-text.csv','r')
reader = csv.DictReader(csvfile)

for row in reader:
    print(row)

In [None]:
# Open the file read mode
with open('data/station.csv','r') as csvfile:
    # Create a reader object
    station_reader = csv.DictReader(csvfile)

    # Iterate through the reader and print out each row that is read
    for station_row in station_reader:
        print(station_row)

In [None]:
type(station_row)

In [None]:
for key, value in station_row.items():
    print(key, " = ", value)

Compare the output of this program with the original version. Obviously, one prints lists and this one prints dictionaries. However, what is one advantage of this dictionary based version?

## The bottom line
We've learned how to use Python's built in `csv` library to read CSV files into common data structures such as lists and dictionaries. This is a common precursor to doing data cleanup and other data preparation tasks before moving on to data analysis. Again, we'll see that most Python data analysis packages have their own CSV reader functions.  

Programming takes practice. Problem solving takes practice. Both Jupyter notebooks and PyCharm provide great environments for practicing your Python programming.