# COSC 526 - Assignment 01
### January 22, 2020
---

For this assignment, you must work in groups of one or two students. Each person is
responsible to write their own code, but the group will (together) discuss their solution.  In this notebook, we provide you with basic functions for completing the assignment.  *You will need to modify existing code and write new code to find a solution*.  Each member of the group must upload their own work to GitHub (**Bear with us! You will upload your solution to your GitHub repository during the next lecture, next week**).

This assignment is **due on Jan 17, 2020 (before 3:35PM ET).**

# Problem 1
In this problem we will explore reading in and parsing [delimiter-separated values](https://en.wikipedia.org/wiki/Delimiter-separated_values) stored in files.  We will start with [comma-separated values](https://en.wikipedia.org/wiki/Comma-separated_values) and then move on to [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values).

### Problem 1a: Comma-Separated Values (CSV)

From [Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values): In computing, a comma-separated values (CSV) file stores tabular data (numbers and text) in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. The use of the comma as a field separator is the source of the name for this file format.

If you were to consider the CSV file as a matrix, each line would represent a row and each comma would represent a column.  In the provided CSV file, the first row consists of a header that "names" each column.  In this problem, ...

- Count (and print) the number of rows of data (header is excluded) in the csv file
- Count (and print) the number of columns of data in the csv file
- Calculate (and print) the average of the values that are in the "age" column
  - You can assume each age in the file is an integer, but the average should be calculated as a float

In [1]:
def parse_delimited_file(filename, delimiter=","):
    # Open and read in all lines of the file
    # (I do not recommend readlines for LARGE files)
    # `open`: ref [1]
    # `readlines`: ref [2]
    with open(filename, 'r', encoding='utf8') as dsvfile:
        lines = dsvfile.readlines()

    # Strip off the newline from the end of each line
    # HINT: ref [3]
    # Using list comprehension is the recommended pythonic way to iterate through lists
    # HINT: ref [4]
    lines = [line.rstrip('\n') for line in lines]
    
    # Split each line based on the delimiter (which, in this case, is the comma)
    # HINT: ref [5]
    split_lines = [line.split(delimiter) for line in lines]
    
    # Separate the header from the data
    # HINT: ref [6]
    header = split_lines[0]
    data_lines = split_lines[1:]
    
    # Find "age" within the header
    # (i.e., calculating the column index for "age")
    # HINT: ref [7]
    age_index = header.index("age")

    # Calculate the number of data rows and columns
    # HINT: [8]
    num_data_rows = len(data_lines)
    num_data_cols = len(header)
    
    # Sum the "age" values
    # HINT: ref [9]
    sum_age = 0
    for row in data_lines:
        sum_age += int(row[age_index])
        
    # Calculate the average age
    avg_age = sum_age / num_data_rows
    
    # Print the results
    # `format`: ref [10]
    print("Number of rows of data: {}".format(num_data_rows))
    print("Number of cols: {}".format(num_data_cols))
    print("Average Age: {}".format(avg_age))
    
# Parse the provided csv file
parse_delimited_file('data.csv')

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


**Expected Ouput:**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```
**References:**
- [1: open](https://docs.python.org/3.6/library/functions.html#open)
- [2: readlines](https://docs.python.org/3.6/library/codecs.html#codecs.StreamReader.readlines)
- [3: rstrip](https://docs.python.org/3.6/library/stdtypes.html#str.rstrip)
- [4: list comprehension](https://docs.python.org/3/tutorial/datastructures.html#list-comprehensions)
- [5: split](https://docs.python.org/3.6/library/stdtypes.html#str.split)
- [6: splice](https://docs.python.org/3.6/glossary.html#term-slice)
- [7: "more on lists"](https://docs.python.org/3.6/tutorial/datastructures.html#more-on-lists)
- [8: len](https://docs.python.org/3.6/library/functions.html#len)
- [9: int](https://docs.python.org/3.6/library/functions.html#int)
- [10: format](https://docs.python.org/3.6/library/stdtypes.html#str.format)


### Problem 1b: Tab-Separated Values (TSV)

From [Wikipedia](https://en.wikipedia.org/wiki/Tab-separated_values): A tab-separated values (TSV) file is a simple text format for storing data in a tabular structure, e.g., database table or spreadsheet data, and a way of exchanging information between databases. Each record in the table is one line of the text file. Each field value of a record is separated from the next by a tab character. The TSV format is thus a type of the more general delimiter-separated values format.

In this problem, repeat the analyses performed in the prevous problem, but for the provided tab-delimited file.

**NOTE:** the order of the columns has changed in this file.  If you hardcoded the position of the "age" column, think about how you can generalize the `parse_delimited_file` function to work for any delimited file with an "age" column.

In [2]:
# Further reading on optional arguments, like "delimiter": http://www.diveintopython.net/power_of_introspection/optional_arguments.html
parse_delimited_file('data.tsv', delimiter="\t")

Number of rows of data: 8
Number of cols: 3
Average Age: 70.875


**Expected Ouput:**
```
Number of rows of data: 8
Number of cols: 3
Average Age: 70.875
```
---

# Problem 2
If you opened the `data.csv` file, you may have noticed some non-english letters in the names column.  These characters are represented using [Unicode](https://en.wikipedia.org/wiki/Unicode), a standard for representing many different types and forms of text.  Python 3 [natively supports](https://docs.python.org/3/howto/unicode.html) Unicode, but many tools do not.  Some tools require text to be formatted with [ASCII](https://en.wikipedia.org/wiki/ASCII).

Convert the unicode-formatted names into ascii-formated names, and save the names out to a file named `data-ascii.txt` (one name per line).  We have provided you with a [tranliteration dictionary](https://german.stackexchange.com/questions/4992/conversion-table-for-diacritics-e-g-%C3%BC-%E2%86%92-ue) that maps several common unicode characters to their ascii transliteration.  Use this dictionary to convert the unicode strings to ascii.

In [4]:
translit_dict = {
    "ä" : "ae",
    "ö" : "oe",
    "ü" : "ue",
    "Ä" : "Ae",
    "Ö" : "Oe",
    "Ü" : "Ue", 
    "ł" : "l",
    "ō" : "o",
}

with open("data.csv", 'r', encoding='utf8') as csvfile:
    lines = csvfile.readlines()

# Strip off the newline from the end of each line
lines = [line.rstrip() for line in lines]
    
# Split each line based on the delimiter (which, in this case, is the comma)
split_lines = [line.split(",") for line in lines]

# Separate the header from the data
header = split_lines[0]
data_lines = split_lines[1:]
    
# Find "name" within the header
name_index = header.index("name")

# Extract the names from the rows
unicode_names = [line[name_index] for line in data_lines]

# Iterate over the names
translit_names = []
for unicode_name in unicode_names:
    # Perform the replacements in the translit_dict
    # HINT: ref [1]
    translit_name = unicode_name
    for key, value in translit_dict.items():
        translit_name = translit_name.replace(key, value)
    translit_names.append(translit_name)

# Write out the names to a file named "data-ascii.txt"
# HINT: ref [2]
with open("data-ascii.txt", 'w') as outfile:
    for name in translit_names:
        outfile.write(name + "\n")

# Verify that the names were converted and written out correctly
with open("data-ascii.txt", 'r') as infile:
    for line in infile:
        print(line.rstrip())

Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie


**Expected Output:**
```
Richard Phillips Feynman
Shin'ichiro Tomonaga
Julian Schwinger
Rudolf Ludwig Moessbauer
Erwin Schroedinger
Paul Dirac
Maria Sklodowska-Curie
Pierre Curie
```

**References:**
- [1: replace](https://docs.python.org/3.6/library/stdtypes.html#str.replace)
- [2: file object methods](https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects)

# Practical Tasks:

This set of practical tasks is **due on Jan 17, 2020 (before 8:00AM ET).**

**Definitions:**
- **GitHub:** web-based hosting service for version control used to distribute and collect assignments as well as other class materials (e.g., slides, code, and datasets)
- **Git:** software used by GitHub

**Practcal Tasks:** 
- Create your own GitHub account
- Submit your GitHub username to the Google form: https://forms.gle/CKugke8Dzqjm9tQ89
- Install Git on your laptop

# Free-Form Questions:

The answers to the following questions are **due on Jan 17, 2020 (before 3:35PM ET).

Your solutions for Problems 1 & 2 probably share a lot of code in common. You might even have copied-and-pasted from Problem 1 into Problem 2. Refactor parse_delimited_file to be useful in both problems

In [4]:
# Add here your code 

Are there any pre-built Python packages that could help you solve these problems? If yes, refactor your solutions to use those packages.  

In [5]:
# Add here your code 

Tell us about your experience (for each quesiton provide a couple of sentences).
- Describe the challenges you faced in addressing these tasks and how you overcame these challenges.
- Did you work with other students on this assignment? If yes, how did you help them? How did they help you? 

*Write here your answers*