# Basic Python 2

Dr. Granger is interested in studying the relationship between the
length of house-elves' ears and aspects of their DNA. This research is
part of a larger project attempting to understand why house-elves
possess such powerful magic. She has obtained DNA samples and ear
measurements from a small group of house-elves to conduct a preliminary
analysis (prior to submitting a grant application to the Ministry of
Magic) and she would like you to conduct the analysis for her (she might
know everything there is to know about magic, but she sure doesn't know
much about computers). She has placed the file on the web for you to
[download]({{ site.github.url }}/data/houseelf_earlength_dna_data.csv).

You might be able to do this analysis by hand in Excel, but counting all
of those bases would be a lot of work, and besides, Dr. Granger seems to
always get funded, which means that you'll be doing this again soon with a
much larger dataset. So, you decide to write a script so that it will be
easy to do the analysis again.

Write a Python script that:

1.  Imports the data into a data structure of your choice
2.  Loops over the rows in the dataset
3.  For each row in the dataset checks to see if the ear length is large
    (>10 cm) or small (<=10 cm) and determines the GC-content of the
    DNA sequence (i.e., the percentage of bases that are either G or C)
4.  Stores this information in a table where the first column has the ID
    for the individual, the second column contains the string 'large' or
    the string 'small' depending on the size of the individuals ears,
    and the third column contains the GC content of the DNA sequence.
5.  Prints the average GC-content for both large-eared elves and
    small-eared elves to the screen.
6.  Exports the table of individual level GC values to a CSV (comma
    delimited text) file titled `grangers_analysis.csv`.

This code should use functions to break the code up into manageable
pieces. For example, here's a function for importing the data from the
web:

    def get_data_from_web(url):
        webpage = urllib.urlopen(url)
        datareader = csv.reader(webpage)
        data = []
        for row in datareader:
            data.append(row)
        return data

This function imports the data as a list of lists. Another good option would be
to use either a Pandas data frame or a Numpy array. An example function using
Pandas looks like:

```
def get_data_from_web(url):
    data = pd.read_csv(url)
	return data
```

Throughout the assignment feel free to use whatever data structures you
prefer. Ask your instructor if you have questions about the best choices.

In [1]:
import pandas as pd
import numpy as np

In [2]:
url = 'https://nyu-cds.github.io/courses/data/houseelf_earlength_dna_data.csv'

def get_data_from_web(url):
    webpage = urllib.urlopen(url)
    datareader = csv.reader(webpage)
    data = []
    for row in datareader:
        data.append(row)
    return data

def get_data_from_web(url):
    data = pd.read_csv(url)
    return data

data = get_data_from_web(url)

In [3]:
data.head()

Unnamed: 0,id,earlength,dnaseq
0,17A,5.1,CCGCATCTTGACTTAACTGACATATTACCATAGATGACTAGCCATG...
1,24P,7.5,GCTATGACTTGCTTAGCTACGTATGAAGGAAGAAACTTTTGTGTAT...
2,09Q,12.2,CCGCCGATTGATACAGGGGACGGTGACGTCGTCATAGATTCGGCAC...
3,65Y,9.9,GCAGGAGAAGTTCTTAACCTTCTCGTAGGACGTCAACCTATTCTTT...
4,19N,10.0,TCTTCATCCTTATCAAAGTTTGGAGTCAATGATCAGGATTATTGCC...


In [4]:
def measure_ear(data):
    data['size'] = np.where(data['earlength'] > 10, 'big', 'small')


def create_cols(data):
    data['Perc_G'] = 0
    data['Perc_C'] = 0
    
def avg_dna(data):
    for i in range(len(data)):
        data.iloc[i, 4] = (data.iloc[i, 2].count('G') * len(data.iloc[:, 2])) / 10
        data.iloc[i, 5] = (data.iloc[i, 2].count('C') * len(data.iloc[:, 2])) / 10
        
def change_cols(data):
    data = data[['id', 'size', 'Perc_G', 'Perc_C', 'earlength', 'dnaseq']]
    
def avg_ear(data):
    avg_small = data[data['size'] == 'small'].mean().iloc[:2]
    avg_big = data[data['size'] == 'big'].mean().iloc[:2]
    dic_data = {'Small G': [avg_small[0]],
                'Small C': [avg_small[1]],
                'Big G': [avg_big[0]],
                'Big C': [avg_big[1]],
               }
    avg_data = pd.DataFrame(dic_data, columns = ['Small G', 'Small C', 'Big G', 'Big C'], index = ['Avg'])
    return avg_data

In [5]:
#Gera uma nova coluna com a medida das orelhas - pequena(small) ou grande(big)
measure_ear(data)

#cria novas colunas
create_cols(data)

#adiciona as novas colunas a media de DNA - (G, C)
avg_dna(data)

#Altera ordem das colunas
change_cols(data)

In [6]:
#retorna um DataFrame com as medias das orelhas pequenas C e
print(avg_ear(data))

     Small G    Small C   Big G  Big C
Avg     8.35  20.333333  13.975   30.5


In [7]:
data.head(5)

Unnamed: 0,id,earlength,dnaseq,size,Perc_G,Perc_C
0,17A,5.1,CCGCATCTTGACTTAACTGACATATTACCATAGATGACTAGCCATG...,small,20.0,21.0
1,24P,7.5,GCTATGACTTGCTTAGCTACGTATGAAGGAAGAAACTTTTGTGTAT...,small,22.0,17.0
2,09Q,12.2,CCGCCGATTGATACAGGGGACGGTGACGTCGTCATAGATTCGGCAC...,big,30.0,27.0
3,65Y,9.9,GCAGGAGAAGTTCTTAACCTTCTCGTAGGACGTCAACCTATTCTTT...,small,19.0,21.0
4,19N,10.0,TCTTCATCCTTATCAAAGTTTGGAGTCAATGATCAGGATTATTGCC...,small,17.0,19.0
