# Problem 1: Counting transcript types

In this problem, you'll use read in data from a file and use dictionaries to work with the data. You're also going to learn how to *write* your results to a file. Make sure you have the file `enriched.txt` in the same folder where you keep this in-class activity notebook. (This file should already be in the ps3 folder when you download it.)

This problem is similar to the Gene Ontology dictionary problem from problem set 2. You'll apply the same core idea, but you'll read in the data to a file and write some results to a new file.

## Tracking counts of transcript types with a dictionary

For this exercise, your data is a list of transcripts whose expression changed during an experiment. As a first pass analysis, your goal is to see what kinds of transcripts were affected (miRNA's, protein coding genes, lncRNAs, etc). You're not going to perform a rigorous statistical analysis of enrichment at this point – you're just going to use your Python skills to summarize one aspect of the data.

Your task is to:

1. Open the file 'enriched.txt'
2. Read the file line by line and count the occurrences of each transcript type.
3. Close the file and write the results, in alphabetical order, to an output file.

To keep track of the counts for each type of transcript, you'll use a dictionary. In this dictionary, the *keys* will be the transcript types, and the *value* will be the number of times that type of transcript occurred in the data. 

Remember from the lecture 3 activity and homework problem 0 what the general steps are for reading in data from a file:

**Step 1:** Open the file.

**Step 2:** Read in the header line by itself with `readline()`.

**Step 3:** Use a `for` loop to read and process the rest of the lines, *stripping* trailing newlines and *splitting* each line into a list of data elements.

**Step 4:** Do something with the data elements - such as assigning them to lists.

Steps 1, 2, and 3 are generally the same any time you work with file data. **Step 4** changes depending on the kind of analysis you want to do. In this case, **step 4** is to keep a running total of transcript types. You'll work through this in the cells below. 

### Warmup problem: Running counts and adding entries to a dictionary on the fly

Let's use a small dataset to write our code to keep track of counts with a dictionary. Run the cell below to create the list `transcripts`. 

In [None]:
transcripts = ['protein_coding', 'miRNA', 'protein_coding', 'miRNA', 'snRNA', 'protein_coding']

Next we need to create a dictionary to hold the counts of each type of transcript. We could create a dictionary that began with counts of 0 for each transcript type like this:

```python
t_types = {'protein_coding':0, 'miRNA':0, 'snRNA':0}
```

We would then increment the counts by one each time we saw that type:

```python
t_types['miRNA'] += 1 # the += adds the value to the right to the variable on the left
```

But it's tedious to write out all of the dictionary entries in advance. Also, what if you encounter a transcript type that's not already in your dictionary? 

**Recap from problem set 2, problem 1 (Gene Ontology Dictionary):**

A better way is to define an empty dictionary, and then add new entries as you encounter them. For this, we can use the key words `in` and `not in` to check if an entry is already in the dictionary. Below is a simple example of code that checks to see if a dictionary entry exists. If the entry *doesn't* exist, one is created. If the entry does exist, the code increments the count by 1:

```python
### Counting coin tosses

coin_tosses = {} # empty dictionary
toss = 'heads' # the data

if toss not in coin_tosses:
    coin_tosses[toss] = 1 # create a new entry and set value to 1
else:
    coin_tosses[toss] += 1 # increment existing entry
```

Below, apply the logic of the coin toss example to write code to count transcripts. Write the code to do the following:

1. Create a empty dictionary called `transcript_counts`.
2. Use a `for` loop to iterate over the list `transcripts`, which we created above.
3. In the block of the `for` loop, add new transcript types to the dictionary as they come up, and increment the count of types that are already in the dictionary.

Use `print()` to see the results.

In [None]:
# YOUR ANSWER HERE

print(transcript_counts)

In [None]:
assert len(transcript_counts) == 3
assert transcript_counts['miRNA'] == 2

To solve the main problem, you'll essentially run that same `for` loop on the data in the `enriched.txt` file.

### Steps 1 & 2: Open the file `enriched.txt` and examine the header line

Run the cell to open the file and read in the first line (the header line). You don't need to assign this line to a variable, but use `print()` to look at it. In which data column are the transcript (i.e., gene) types? If the header elements were in a list, what index number would you need to access `geneType`?

(Note you can also learn what the different columns are by opening the text file from the jupyter dashboard.)

In [None]:
# run this cell and examine the output

file = open('enriched.txt')
print(file.readline())

### Steps 3 & 4: Define an empty dictionary, loop over each line of the file, count transcripts

Now your task is to read in the file data and create a dictionary of transcript type counts, just as you did above.

In the cell below, define an empty dictionary called `transcript_counts`. Then write a `for` loop to read in the rest of the file lines from `enriched.txt`. Each line should be stripped of the trailing newline and split into individual data elements.

Once you've processed the line, use the `if`...`else` that code you wrote above to add a count for the appropriate transcript type in the dictionary `transcript_counts`.

**HINT:** In the block of the **`for`** loop, after using `.split()`, you will have a list of elements. Which list element holds the transcript type?

**NOTE:** Feel free to look at your code from the lecture activity to review how to process file lines. But don't just copy and paste the code – *type* it out. Physically typing code is crucial for becoming fluent in any programming language.

In [None]:
# YOUR ANSWER HERE

print(transcript_counts) 

In [None]:
# Run the cell below to check your answers

assert transcript_counts['protein_coding'] == 15021
assert transcript_counts['snoRNA'] == 241
assert transcript_counts['non_coding'] == 2
assert transcript_counts['polymorphic_pseudogene'] == 38

### Step 5: Write the results to a file.

Now that you've counted transcript types, how do you save the results in a file that you can return to later? 

The first step is to open a new file that you can *write* to. The syntax will be mostly familiar:

```python
output = open('transcript.counts.txt','w') # 'w' makes the file writeable.
```

Next, you write lines to the file as strings, like this:

```python
# Notice that line end with a newline! .write() doesn't add one automatically.
line = 'protein_coding 15021\n' 
output.write(line)
```

We write *strings* to files. But the counts in our dictionary are integers. In fact we want to print a mix of strings (the transcript type) and integers (the counts). Unlike `print()`, `.write()` can't handle this, unfortunately, as in this example:

```python
# key is the variable holding the dictionary key (a string).
print(key, transcript_counts[key], "\n") # works fine.
output.write(key, transcript_counts[key], "\n") # GIVES ERROR
```

So how would you combine the dictionary key (a string), the counts (an integer), and a newline (another string) into a single string (with a tab between the key and the counts)? You'll want a string that looks like this:

```python
'protein_coding\t15021\n`
```

Try to figure out how to do this in the cell below. Do the following:

1. Open a writeable file called `transcript.counts.txt`.
2. Loop over the transcript types in the dictionary *in alphabetical order*. (Use `sorted(my_dictionary.keys())`.) Write both the key (e.g., the transcript type) and the value (the counts) to the file.
3. Close both the input and output files using `.close()`. 

**HINT:** To create output strings from the dictionary data, try using string conversion (`str()`) and string concatenation (`+`) to link the key and the value. Another option is the `.join()` method, which we'll illustrate later in the class.

In [None]:
# Output file
output = open('transcript.counts.txt','w')

# For loop writing the dictionary to file:
# YOUR ANSWER HERE
output.close()
file.close()

Open your `transcript.counts.txt` to see how it looks. If you don't like the result, try again. You don't have to delete the file – `open('transcript.counts.txt','w')` will overwrite the old one.