# Reading in CSV file

**Even though CSV stands for Comma-Separated Values, the values can be separated by any plain text character, e.g. '|', '\t', '/' etc. Python `csv` module is built to handle different CSV formats.**

**Using samples with different formats, read the data into Python format.**

**Always open the file in Python editor before reading it in to see the format and number of values, or fields.**

    Rank,Country,Gold,Silver,Bronze,Total
    1,United States,39,41,33,113
    2,China,38,32,18,88
    3,Japan,27,14,17,58
    ...

In [1]:
import csv

In [2]:
csv_file = 'data/OlympicMedals_2020.csv'

In [3]:
with open(csv_file, encoding='utf-8', newline='') as csv_data:
    reader = csv.reader(csv_data)
    for row in reader:
        print(row)

['Rank', 'Country', 'Gold', 'Silver', 'Bronze', 'Total']
['1', 'United States', '39', '41', '33', '113']
['2', 'China', '38', '32', '18', '88']
['3', 'Japan', '27', '14', '17', '58']
['4', 'Great Britain', '22', '21', '22', '65']
['5', 'ROC', '20', '28', '23', '71']
['6', 'Australia', '17', '7', '22', '46']
['7', 'Netherlands', '10', '12', '14', '36']
['8', 'France', '10', '12', '11', '33']
['9', 'Germany', '10', '11', '16', '37']
['10', 'Italy', '10', '10', '20', '40']
['11', 'Canada', '7', '6', '11', '24']
['12', 'Brazil', '7', '6', '8', '21']
['13', 'New Zealand', '7', '6', '7', '20']
['14', 'Cuba', '7', '3', '5', '15']
['15', 'Hungary', '6', '7', '7', '20']
['16', 'South Korea', '6', '4', '10', '20']
['17', 'Poland', '4', '5', '5', '14']
['18', 'Czech Republic', '4', '4', '3', '11']
['19', 'Kenya', '4', '4', '2', '10']
['20', 'Norway', '4', '2', '2', '8']
['21', 'Jamaica', '4', '1', '4', '9']
['22', 'Spain', '3', '8', '6', '17']
['23', 'Sweden', '3', '6', '0', '9']
['24', 'Switzerl

**Each row is contained in a list, showing the number of fields (6) and commas separating each field. The first row is almost always column headers for the fields, i.e. 'Rank', 'Country', 'Gold', 'Silver', 'Bronze' and 'Total'. Using CSV module, you can convert the data to dictionary using the headers as keys, with the corresponding values. You could also store the headers separately to then format the data in any way you want.**

In [4]:
with open(csv_file, encoding='utf-8', newline='') as csv_data:
    headers = csv_data.readline().strip('\n').split(',')
    print(f"Column headers are: {headers}")
    reader = csv.reader(csv_data)
    for row in reader:
        print(row)

Column headers are: ['Rank', 'Country', 'Gold', 'Silver', 'Bronze', 'Total']
['1', 'United States', '39', '41', '33', '113']
['2', 'China', '38', '32', '18', '88']
['3', 'Japan', '27', '14', '17', '58']
['4', 'Great Britain', '22', '21', '22', '65']
['5', 'ROC', '20', '28', '23', '71']
['6', 'Australia', '17', '7', '22', '46']
['7', 'Netherlands', '10', '12', '14', '36']
['8', 'France', '10', '12', '11', '33']
['9', 'Germany', '10', '11', '16', '37']
['10', 'Italy', '10', '10', '20', '40']
['11', 'Canada', '7', '6', '11', '24']
['12', 'Brazil', '7', '6', '8', '21']
['13', 'New Zealand', '7', '6', '7', '20']
['14', 'Cuba', '7', '3', '5', '15']
['15', 'Hungary', '6', '7', '7', '20']
['16', 'South Korea', '6', '4', '10', '20']
['17', 'Poland', '4', '5', '5', '14']
['18', 'Czech Republic', '4', '4', '3', '11']
['19', 'Kenya', '4', '4', '2', '10']
['20', 'Norway', '4', '2', '2', '8']
['21', 'Jamaica', '4', '1', '4', '9']
['22', 'Spain', '3', '8', '6', '17']
['23', 'Sweden', '3', '6', '0', '

**Python knows to start `csv.reader` on the next line, i.e. excluding the headers, because you already read in the first line. HOWEVER, you cannot access the reader object outside the `with` statement.**

**Note that all numbers are parsed to string objects, and you will need to convert those to integers. However, you cannot easily identify which values are strings or which are integers. CSV module provides 'quoting' functionality to surround string data like names and labels with quotation marks, thus differentiating string and non-string values.**

### Quoting CSV data

**Open the csv file in Jupyter to see the double quotation marks around the names, which shows that string types have been quoted correctly in CSV source.**

    "Cereal","Calories","Fat","Protein","Fibre","Vitamin E"
    "Barley",556,1.7,32.9,10.1,13.8
    "Durum",339,5,27.4,4.09,9.7
    "Fonio",240,1,4,1.7,0.05
    ...
    
**When reading the content in, you can add specification through `quoting` argument to indicate that the quotation marks around values means they are strings, i.e. non-numeric.**

In [5]:
csv_filename = 'data/cereal_grains.csv'

In [6]:
with open(csv_filename, encoding='utf-8', newline='') as csv_data:
    reader = csv.reader(csv_data)
    for row in reader:
        print(row)

['Cereal', 'Calories', 'Fat', 'Protein', 'Fibre', 'Vitamin E']
['Barley', '556', '1.7', '32.9', '10.1', '13.8']
['Durum', '339', '5', '27.4', '4.09', '9.7']
['Fonio', '240', '1', '4', '1.7', '0.05']
['Maize', '442', '7.4', '37.45', '6.15', '11.03']
['Millet', '484', '2', '37.9', '13.4', '9.15']
['Oats', '231', '9.2', '35.1', '10.3', '3.73']
['Rice (Brown)', '346', '2.8', '38.1', '9.9', '0.8']
['Rice (White)', '345', '3.6', '37.6', '5.4', '0.1']
['Rye', '422', '2', '31.4', '18.2', '21.2']
['Sorghum', '316', '3', '37.8', '9.92', '9.15']
['Triticale', '338', '1.81', '36.6', '19', '0.9']
['Wheat', '407', '1.2', '27.8', '12.9', '13.8']


**As you can see the parser returns all values as strings when file read in without any customization. Add the specification that non-numeric values have been quoted with `csv.QUOTE_NONNUMERIC` constant.** 

In [7]:
with open(csv_filename, encoding='utf-8', newline='') as csv_data:
    reader = csv.reader(csv_data, quoting=csv.QUOTE_NONNUMERIC)
    for row in reader:
        print(row)

['Cereal', 'Calories', 'Fat', 'Protein', 'Fibre', 'Vitamin E']
['Barley', 556.0, 1.7, 32.9, 10.1, 13.8]
['Durum', 339.0, 5.0, 27.4, 4.09, 9.7]
['Fonio', 240.0, 1.0, 4.0, 1.7, 0.05]
['Maize', 442.0, 7.4, 37.45, 6.15, 11.03]
['Millet', 484.0, 2.0, 37.9, 13.4, 9.15]
['Oats', 231.0, 9.2, 35.1, 10.3, 3.73]
['Rice (Brown)', 346.0, 2.8, 38.1, 9.9, 0.8]
['Rice (White)', 345.0, 3.6, 37.6, 5.4, 0.1]
['Rye', 422.0, 2.0, 31.4, 18.2, 21.2]
['Sorghum', 316.0, 3.0, 37.8, 9.92, 9.15]
['Triticale', 338.0, 1.81, 36.6, 19.0, 0.9]
['Wheat', 407.0, 1.2, 27.8, 12.9, 13.8]


**Voila! But this only works if non-numeric values have quotations in the source CSV file, and you are happy if the numbers are converted to floats. To convert to integers is another step.**

**If the string values are not quoted in source CSV, like the Olympic Medals dataset, there is another way to deal with numeric and non-numeric data. You can use `Sniffer().sniff()` to 'sniff' out what the separator is and any characters used to delimit strings.**

**The Sniffer creates a 'dialect', which groups the various format options in a single object.**

In [8]:
# NOTE - you can read in text files with csv module

input_filename = 'data/country_info.txt'

with open(input_filename, encoding='utf-8', newline='') as countries_data:
    sample = ""
    for line in range(3):
        sample += countries_data.readline()
    
    countries_dialect = csv.Sniffer().sniff(sample)
    
    # Tell Python to go back to start of file
    countries_data.seek(0)
    country_reader = csv.reader(countries_data, dialect=countries_dialect)
    
    for row in country_reader:
        print(row)

['Country', 'Capital', 'CC', 'CC3', 'IAC', 'TimeZone', 'Currency']
['Afghanistan', 'Kabul', 'AF', 'AFG', '+93', 'UTC+04:30', 'Afghan afghani']
['Aland Islands', 'Mariehamn', 'AX', 'ALA', '+358', 'UTC+02:00', 'Euro']
['Albania', 'Tirana', 'AL', 'ALB', '+355', 'UTC+01:00', 'Albanian lek']
['Algeria', 'Algiers', 'DZ', 'DZA', '+213', 'UTC', 'Algerian dinar']
['American Samoa', 'Pago Pago', 'AS', 'ASM', '+1 684', 'UTC-11:00', '']
['Andorra', 'Andorra la Vella', 'AD', 'AND', '+376', 'UTC+01:00', 'Euro']
['Angola', 'Luanda', 'AO', 'AGO', '+244', 'UTC+01:00', 'Angolan kwanza']
['Anguilla', 'The Valley', 'AI', 'AIA', '+1 264', 'UTC-04:00', 'East Caribbean dollar']
['Antarctica', '', 'AQ', 'ATA', '', '', '']
['Antigua and Barbuda', "St. John's", 'AG', 'ATG', '+1 268', 'UTC-04:00', 'East Caribbean dollar']
['Argentina', 'Buenos Aires', 'AR', 'ARG', '+54', 'UTC-03:00', 'Argentine peso']
['Armenia', 'Yerevan', 'AM', 'ARM', '+374', 'UTC+04:00', 'Armenian dram']
['Aruba', 'Oranjestad', 'AW', 'ABW', '

**As you can see, Sniffer knew that the separator was a pipe character. You can view all the attributes of the dialect object using the `dir()` function. You can only use `csv.get_dialect()` function if you register your dialect object first.**

In [9]:
print(dir(countries_dialect))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_name', '_valid', '_validate', 'delimiter', 'doublequote', 'escapechar', 'lineterminator', 'quotechar', 'quoting', 'skipinitialspace']


In [10]:
# List attributes you are interested in
attributes = [
    'delimiter', 
    'doublequote', 
    'escapechar', 
    'lineterminator', 
    'quotechar', 
    'quoting', 
    'skipinitialspace'
]

for attribute in attributes:
    print(f"{attribute}: {getattr(countries_dialect, attribute)}")

delimiter: |
doublequote: False
escapechar: None
lineterminator: 

quotechar: "
quoting: 0
skipinitialspace: False


* DELIMITER: **The `delimiter` is `|` character.**

* DOUBLEQUOTE: **There is no 'doubling-up' of quotes if string value already contains quotations.**

* ESCAPECHAR: **There is no escape character to escape the `delimiter` or `quotechar` characters.**

* LINETERMINATOR: **There is no specific character used for line termination.**

* QUOTECHAR: **The way of quoting strings, or values containing special characters, is with double quotation marks.** 

* QUOTING: **There are no quoting rules applied, like `QUOTE_NONNUMERIC`. This is only used when you are writing content to CSV.**

* SKIPINITIALSPACE: **Do not ignore leading whitespaces, just after each separator, i.e. keep them. Leading whitespaces are trimmed if set to `True`.**

**You can edit the dialect object in case you want to change an attribute value, like setting `skipinitialspace` to True, by running the program again and add update to dialect.**

In [11]:
with open(input_filename, encoding='utf-8', newline='') as countries_data:
    sample = ""
    for line in range(3):
        sample += countries_data.readline()
    
    countries_dialect = csv.Sniffer().sniff(sample)
    # Update dialect to trim leading whitespaces
    countries_dialect.skipinitialspace = True
    
    # Tell Python to go back to start of file
    countries_data.seek(0)
    country_reader = csv.reader(countries_data, dialect=countries_dialect)
    
    for row in country_reader:
        print(row)

['Country', 'Capital', 'CC', 'CC3', 'IAC', 'TimeZone', 'Currency']
['Afghanistan', 'Kabul', 'AF', 'AFG', '+93', 'UTC+04:30', 'Afghan afghani']
['Aland Islands', 'Mariehamn', 'AX', 'ALA', '+358', 'UTC+02:00', 'Euro']
['Albania', 'Tirana', 'AL', 'ALB', '+355', 'UTC+01:00', 'Albanian lek']
['Algeria', 'Algiers', 'DZ', 'DZA', '+213', 'UTC', 'Algerian dinar']
['American Samoa', 'Pago Pago', 'AS', 'ASM', '+1 684', 'UTC-11:00', '']
['Andorra', 'Andorra la Vella', 'AD', 'AND', '+376', 'UTC+01:00', 'Euro']
['Angola', 'Luanda', 'AO', 'AGO', '+244', 'UTC+01:00', 'Angolan kwanza']
['Anguilla', 'The Valley', 'AI', 'AIA', '+1 264', 'UTC-04:00', 'East Caribbean dollar']
['Antarctica', '', 'AQ', 'ATA', '', '', '']
['Antigua and Barbuda', "St. John's", 'AG', 'ATG', '+1 268', 'UTC-04:00', 'East Caribbean dollar']
['Argentina', 'Buenos Aires', 'AR', 'ARG', '+54', 'UTC-03:00', 'Argentine peso']
['Armenia', 'Yerevan', 'AM', 'ARM', '+374', 'UTC+04:00', 'Armenian dram']
['Aruba', 'Oranjestad', 'AW', 'ABW', '

In [12]:
for attribute in attributes:
    print(f"{attribute}: {getattr(countries_dialect, attribute)}")

delimiter: |
doublequote: False
escapechar: None
lineterminator: 

quotechar: "
quoting: 0
skipinitialspace: True


**Now the `skipinitialspace` attribute is set to True, and any leading whitespace has been been removed when reading in the file.**

**Even though it says there are no special line termination characters, this does not mean there is no default character in place. To see how a line really ends, you can use the `repr()` function to see what is really there.**

In [14]:
for attribute in attributes:
    print(f"{attribute}: {repr(getattr(countries_dialect, attribute))}")

delimiter: '|'
doublequote: False
escapechar: None
lineterminator: '\r\n'
quotechar: '"'
quoting: 0
skipinitialspace: True


**The `lineterminator` actually contains newline (`\n`) and line feed (`\r`) characters.**

## Using `csv.DictReader()` class

**The `DictReader()` class behaves like the normal `csv.reader()` function, with the same parameters, except it parses each row directly into a Python dictionary.**

**CSV File:**

    "Cereal","Calories","Fat","Protein","Fibre","Vitamin E"
    "Barley",556,1.7,32.9,10.1,13.8
    "Durum",339,5,27.4,4.09,9.7
    ...

In [13]:
csv_filename = 'data/cereal_grains.csv'

with open(csv_filename, encoding='utf-8', newline='') as cereals:
    reader = csv.DictReader(cereals)
    for row in reader:
        print(row)

{'Cereal': 'Barley', 'Calories': '556', 'Fat': '1.7', 'Protein': '32.9', 'Fibre': '10.1', 'Vitamin E': '13.8'}
{'Cereal': 'Durum', 'Calories': '339', 'Fat': '5', 'Protein': '27.4', 'Fibre': '4.09', 'Vitamin E': '9.7'}
{'Cereal': 'Fonio', 'Calories': '240', 'Fat': '1', 'Protein': '4', 'Fibre': '1.7', 'Vitamin E': '0.05'}
{'Cereal': 'Maize', 'Calories': '442', 'Fat': '7.4', 'Protein': '37.45', 'Fibre': '6.15', 'Vitamin E': '11.03'}
{'Cereal': 'Millet', 'Calories': '484', 'Fat': '2', 'Protein': '37.9', 'Fibre': '13.4', 'Vitamin E': '9.15'}
{'Cereal': 'Oats', 'Calories': '231', 'Fat': '9.2', 'Protein': '35.1', 'Fibre': '10.3', 'Vitamin E': '3.73'}
{'Cereal': 'Rice (Brown)', 'Calories': '346', 'Fat': '2.8', 'Protein': '38.1', 'Fibre': '9.9', 'Vitamin E': '0.8'}
{'Cereal': 'Rice (White)', 'Calories': '345', 'Fat': '3.6', 'Protein': '37.6', 'Fibre': '5.4', 'Vitamin E': '0.1'}
{'Cereal': 'Rye', 'Calories': '422', 'Fat': '2', 'Protein': '31.4', 'Fibre': '18.2', 'Vitamin E': '21.2'}
{'Cereal': '

**You can use `csv.DictReader()` class to read in text or JSON files directly to dictionary formmat.**

**You can also read in text files using DictReader. Using country_info.txt, the country names should be the keys and the values in nested dictionaries. Treat the text file as if it were CSV, i.e. include `newline` argument.**

**Text File:**

    Country|Capital|CC|CC3|IAC|TimeZone|Currency
    Afghanistan|Kabul|AF|AFG|+93|UTC+04:30|Afghan afghani
    Aland Islands|Mariehamn|AX|ALA|+358|UTC+02:00|Euro
    Albania|Tirana|AL|ALB|+355|UTC+01:00|Albanian lek
    ...

In [22]:
countries_input = 'data/country_info.txt'

with open(countries_input, encoding='utf-8', newline='') as countries:
    dict_reader = csv.DictReader(countries, delimiter='|')
    
    for row in dict_reader:
        print(f"{row['Country']}")
        print(f"\tThe capital is {row['Capital']}")
        

Afghanistan
	The capital is Kabul
Aland Islands
	The capital is Mariehamn
Albania
	The capital is Tirana
Algeria
	The capital is Algiers
American Samoa
	The capital is Pago Pago
Andorra
	The capital is Andorra la Vella
Angola
	The capital is Luanda
Anguilla
	The capital is The Valley
Antarctica
	The capital is 
Antigua and Barbuda
	The capital is St. John's
Argentina
	The capital is Buenos Aires
Armenia
	The capital is Yerevan
Aruba
	The capital is Oranjestad
Australia
	The capital is Canberra
Austria
	The capital is Vienna
Azerbaijan
	The capital is Baku
Bahamas
	The capital is Nassau
Bahrain
	The capital is Manama
Bangladesh
	The capital is Dhaka
Barbados
	The capital is Bridgetown
Belarus
	The capital is Minsk
Belgium
	The capital is Brussels
Belize
	The capital is Belmopan
Benin
	The capital is Porto-Novo
Bermuda
	The capital is Hamilton
Bhutan
	The capital is Thimphu
Bolivia
	The capital is Sucre
Bonaire
	The capital is Kralendijk
Bosnia and Herzegovina
	The capital is Sarajevo
Bo

### Customized key names

**The column headings start with capital letters, i.e. proper case, which goes against the naming convention for dictionary keys. You can specify a customized list of key names, especially for cases when there are no column headings in the source file at all.**

**In CSV module, there are three registered dialects built in to deal with common format types, like Excel:**

* **`csv.excel`**
* **`csv.excel_tab`**
* **`csv.unix_dialect`**

**You can change the settings in a dialect object to apply new rules to how the data is read in.**

In [25]:
excel_dialect = csv.excel

# Change delimiter to pipe char (instead of comma)
excel_dialect.delimiter = '|'

In [29]:
with open(countries_input, encoding='utf-8', newline='') as countries:
    headings = countries.readline().strip('\n').split(excel_dialect.delimiter)
    # Lowercase the headings
    for index, heading in enumerate(headings):
        headings[index] = heading.casefold()
    
    dict_reader = csv.DictReader(countries, fieldnames=headings, dialect=excel_dialect)
    
    for row in dict_reader:
        print(row)

{'country': 'Afghanistan', 'capital': 'Kabul', 'cc': 'AF', 'cc3': 'AFG', 'iac': '+93', 'timezone': 'UTC+04:30', 'currency': 'Afghan afghani'}
{'country': 'Aland Islands', 'capital': 'Mariehamn', 'cc': 'AX', 'cc3': 'ALA', 'iac': '+358', 'timezone': 'UTC+02:00', 'currency': 'Euro'}
{'country': 'Albania', 'capital': 'Tirana', 'cc': 'AL', 'cc3': 'ALB', 'iac': '+355', 'timezone': 'UTC+01:00', 'currency': 'Albanian lek'}
{'country': 'Algeria', 'capital': 'Algiers', 'cc': 'DZ', 'cc3': 'DZA', 'iac': '+213', 'timezone': 'UTC', 'currency': 'Algerian dinar'}
{'country': 'American Samoa', 'capital': 'Pago Pago', 'cc': 'AS', 'cc3': 'ASM', 'iac': '+1 684', 'timezone': 'UTC-11:00', 'currency': ''}
{'country': 'Andorra', 'capital': 'Andorra la Vella', 'cc': 'AD', 'cc3': 'AND', 'iac': '+376', 'timezone': 'UTC+01:00', 'currency': 'Euro'}
{'country': 'Angola', 'capital': 'Luanda', 'cc': 'AO', 'cc3': 'AGO', 'iac': '+244', 'timezone': 'UTC+01:00', 'currency': 'Angolan kwanza'}
{'country': 'Anguilla', 'capi