# Project 3, Part I

*Nicholas Donahoe nrd485*

**This Project is due on May 3rd, 2018 at 7:00pm.**

The complete submission will consist of four parts:

1. Jupyter Notebook (.ipynb) file
2. Jupyter Notebook converted to pdf
3. R Markdown (.Rmd) file
4. R Markdown converted to pdf

Before submitting the Jupyter notebook part, please re-run all cells by clicking "Kernel" and selecting "Restart & Run All."

## Background and motivation

*Streptococcus pneumoniae is a bacteria that causes pneumococcal disease, which can be in the form of pneumonia, menigitis, and bacteremia. The bacteria is spread via coughing, sneezing, and close contact with a person who is infected. Typically, children, elderly, and people with weakened immune systems are more prone to contracting the disease. I chose to analyze streptococcus pneumonia due to having multiple family members having had contracted the disease in the past. I want to find out more about the bacteria.*

## Question

*What does the variation between the different strains of streptococcus pneumoniae indicate?*

## Introduction

*The complete genomes for streptococcus pneumoniae strains were obtained from the NCBI nucleotide database. The strains: Xen35, 19F, 335, 11A, and D39V were the strains of interest. The number of CoDing Sequences (CDS), hypothetical proteins within the coding sequences, and the number of genes for each strain were counted. The coding sequences is the portion of gene that encodes for proteins. The hypothetical protein are proteins where the name is not known. And the genes represent sequences in the DNA that contain coding and non-coding sequences in the genome. Analyzing the variations in these variables amongst the strains can provide insight on why these differences exist. (4-8 sentences)*

## Data acquisition code

In [1]:
# Download data from NCBI Entrez, process as necessary, and write into a .csv file.
import re
from Bio import Entrez, SeqIO
Entrez.email = "nicholasdonahoe@gmail.com" 

# Download sequence record for nucleotide id CP025256.1, CP025076.1, CP026670.1, CP018838.1, CP027540.1  (Streptococcus pneumoniae Strains: Xen35, 19F, 335, 11A, D39V)
Count_Dict = {} # Empty dictionary to store all the information that's counted
Count = 0 # Counter used to differentiate what strains are assigned what counts
Strain = ["D39V", "19F", "Xen35", "335", "11A"] # List of all the strains
Identifier = ["CP027540.1", "CP025076.1", "CP025256.1", "CP026670.1", "CP018838.1"] # List of all the IDs for each sequences

# For loop to loop through each ID for each strain
for ID in Identifier: 
    # Accesses and downloads each strain's information from the nucleotdie database
    download_handle = Entrez.efetch(db="nucleotide", id=ID, rettype="gb", retmode="text")
    data_read = SeqIO.read(download_handle, "genbank") # read file directly
    data = data_read.lower()
    download_handle.close()
    
    CDS_count = 0 # Empty counter for the number coding sequences in each Streptococcus pneumoniae strain's genome
    hyp_CDS_count = 0 # Empty counter for the number of hypothetical proteins in each Streptococcus pneumoniae strain's genome
    Gene_count = 0 # Empty counter for the number of genes in each Streptococcus pneumoniae strain's genome
    
    # Counts all CDSs, the hypothetical ones, and the genes
    for feature in data.features:
        if feature.type == 'CDS':
            CDS_count += 1
            if "product" in feature.qualifiers:
                product = feature.qualifiers["product"][0]
                match = re.search(r"[H|h]ypothetical protein", product) # Regex to match hypothetical protein that accounts for upper or lower case H 
                if match:
                    hyp_CDS_count += 1
        if feature.type == 'gene':
            Gene_count += 1
    
    All_Dict = {"CDS_count":CDS_count, "hyp_CDS_count":hyp_CDS_count, "Gene_count":Gene_count} # Dictionary stores the count type and its corresponding count
    Count_Dict[Strain[Count]] = All_Dict # Assigns the counts collected to each individual strain based on the overall count value 
    Count += 1 

final_list = [] # Empty list 
for strain_name in Count_Dict: # Loops through the dictionary that contains all the counts for each strain
    Counts = Count_Dict[strain_name] 
    temp_list = [strain_name, str(Counts["CDS_count"]), str(Counts["hyp_CDS_count"]), str(Counts["Gene_count"])] # List of strain name and converted integer values to strings 
    string = ",".join(temp_list) # String of temporary list
    final_list.append(string) # Appends the strings together into a final list

data_table = "\n".join(final_list) # Formats the list into a string to resemble a table

# Writes the csv for the information gathered
with open("StreptoPneumae.csv", "w") as file: 
    
    # Writes variable names as the header
    file.write("Strain,CDS_Count,hyp_CDS_Count,Gene_Count\n")
    file.write(data_table)
        