<a href="https://colab.research.google.com/github/Muhammad-Tayyab-Bhutto/Data-Science-Exercises/blob/main/Exercise2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Basic Python 2
Dr. Granger is interested in studying the relationship between the length of house-elves’ ears and aspects of their DNA. This research is part of a larger project attempting to understand why house-elves possess such powerful magic. She has obtained DNA samples and ear measurements from a small group of house-elves to conduct a preliminary analysis (prior to submitting a grant application to the Ministry of Magic) and she would like you to conduct the analysis for her (she might know everything there is to know about magic, but she sure doesn’t know much about computers). She has placed the file on the web for you to [download](https://nyu-cds.github.io/courses/data/houseelf_earlength_dna_data.csv).

You might be able to do this analysis by hand in Excel, but counting all of those bases would be a lot of work, and besides, Dr. Granger seems to always get funded, which means that you’ll be doing this again soon with a much larger dataset. So, you decide to write a script so that it will be easy to do the analysis again.

Write a Python script that:

1. Imports the data into a data structure of your choice
2. Loops over the rows in the dataset
3. For each row in the dataset checks to see if the ear length is large (>10 cm) or small (<=10 cm) and determines the GC-content of the DNA sequence (i.e., the percentage of bases that are either G or C)
4. Stores this information in a table where the first column has the ID for the individual, the second column contains the string ‘large’ or the string ‘small’ depending on the size of the individuals ears, and the third column contains the GC content of the DNA sequence.
4. Prints the average GC-content for both large-eared elves and small-eared elves to the screen.
5. Exports the table of individual level GC values to a CSV (comma delimited text) file titled grangers_analysis.csv.

# This code should use functions to break the code up into manageable pieces. For example, here’s a function for importing the data from the web:


```
def get_data_from_web(url):
    webpage = urllib.urlopen(url)
    datareader = csv.reader(webpage)
    data = []
    for row in datareader:
        data.append(row)
    return data
```



# This function imports the data as a list of lists. Another good option would be to use either a Pandas data frame or a Numpy array. An example function using Pandas looks like:

```
def get_data_from_web(url):
    data = pd.read_csv(url)
	return data
```

# **Imports the data into a data structure of your choice**

In [1]:
import pandas as pd
from google.colab import drive
from google.colab import files
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
df = pd.read_csv('/content/drive/MyDrive/DataScience /DataSets/houseelf_earlength_dna_data.csv')
df.head()

Unnamed: 0,id,earlength,dnaseq
0,17A,5.1,CCGCATCTTGACTTAACTGACATATTACCATAGATGACTAGCCATG...
1,24P,7.5,GCTATGACTTGCTTAGCTACGTATGAAGGAAGAAACTTTTGTGTAT...
2,09Q,12.2,CCGCCGATTGATACAGGGGACGGTGACGTCGTCATAGATTCGGCAC...
3,65Y,9.9,GCAGGAGAAGTTCTTAACCTTCTCGTAGGACGTCAACCTATTCTTT...
4,19N,10.0,TCTTCATCCTTATCAAAGTTTGGAGTCAATGATCAGGATTATTGCC...


## For each row in the dataset checks to see if the ear length is large (>10 cm) or small (<=10 cm) and determines the GC-content of the DNA sequence (i.e., the percentage of bases that are either G or C)

In [12]:
def calculate_gc_content(dna_sequence):
    gc_count = dna_sequence.count('G') + dna_sequence.count('C')
    total_bases = len(dna_sequence)
    if total_bases == 0:
        return 0
    return (gc_count / total_bases) * 100

# Create two separate DataFrames for large and small ear lengths
large_ear_data = df[df['earlength'] > 10].copy()  # Use .copy() to avoid SettingWithCopyWarning
small_ear_data = df[df['earlength'] <= 10].copy()  # Use .copy() to avoid SettingWithCopyWarning

# Calculate GC content for each DataFrame
large_ear_data['Size'] = 'large'
small_ear_data['Size'] = 'small'

large_ear_data['GC Content'] = large_ear_data['dnaseq'].apply(calculate_gc_content)
small_ear_data['GC Content'] = small_ear_data['dnaseq'].apply(calculate_gc_content)

# Print the results or perform further analysis
print("Large Ear Data:")
print(large_ear_data)

print("Small Ear Data:")
print(small_ear_data)

Large Ear Data:
    id  earlength                                             dnaseq   Size  \
2  09Q       12.2  CCGCCGATTGATACAGGGGACGGTGACGTCGTCATAGATTCGGCAC...  large   
5  92K       14.6  ACCGATGGACAATGATTCGGGTAGCACCAGGAGTCCGTAGCGCGTG...  large   
7  98C       17.8  CTGCATGCTAGGTTGACACGCCTGCACTGCTCGAAGAAAATATGCG...  large   
9  88Q       11.3  GATTGCTCGCACATGAGCAAAACGGTAGAGCGTCACTTTCAGCCCT...  large   

   GC Content  
2        57.0  
5        62.0  
7        63.0  
9        52.0  
Small Ear Data:
    id  earlength                                             dnaseq   Size  \
0  17A        5.1  CCGCATCTTGACTTAACTGACATATTACCATAGATGACTAGCCATG...  small   
1  24P        7.5  GCTATGACTTGCTTAGCTACGTATGAAGGAAGAAACTTTTGTGTAT...  small   
3  65Y        9.9  GCAGGAGAAGTTCTTAACCTTCTCGTAGGACGTCAACCTATTCTTT...  small   
4  19N       10.0  TCTTCATCCTTATCAAAGTTTGGAGTCAATGATCAGGATTATTGCC...  small   
6  33W        8.2  CAGCTTGACTCGGTCTGTTAGGCCACGATTACGTGAGTTAGGGCTC...  small   
8  75G        9.4 

## Stores this information in a table where the first column has the ID for the individual, the second column contains the string ‘large’ or the string ‘small’ depending on the size of the individuals ears, and the third column contains the GC content of the DNA sequence.

In [5]:
# Combine the two DataFrames into one
result_data = pd.concat([large_ear_data, small_ear_data], ignore_index=True)

# Rename the columns
result_data = result_data.rename(columns={'ID': 'ID', 'Size': 'Size', 'GC Content': 'GC Content'})

# Print the resulting table
print(result_data)

    id  earlength                                             dnaseq  \
0  09Q       12.2  CCGCCGATTGATACAGGGGACGGTGACGTCGTCATAGATTCGGCAC...   
1  92K       14.6  ACCGATGGACAATGATTCGGGTAGCACCAGGAGTCCGTAGCGCGTG...   
2  98C       17.8  CTGCATGCTAGGTTGACACGCCTGCACTGCTCGAAGAAAATATGCG...   
3  88Q       11.3  GATTGCTCGCACATGAGCAAAACGGTAGAGCGTCACTTTCAGCCCT...   
4  17A        5.1  CCGCATCTTGACTTAACTGACATATTACCATAGATGACTAGCCATG...   
5  24P        7.5  GCTATGACTTGCTTAGCTACGTATGAAGGAAGAAACTTTTGTGTAT...   
6  65Y        9.9  GCAGGAGAAGTTCTTAACCTTCTCGTAGGACGTCAACCTATTCTTT...   
7  19N       10.0  TCTTCATCCTTATCAAAGTTTGGAGTCAATGATCAGGATTATTGCC...   
8  33W        8.2  CAGCTTGACTCGGTCTGTTAGGCCACGATTACGTGAGTTAGGGCTC...   
9  75G        9.4  CTTATTTAGATAACATGATTAGCCGAAGTTGTACGGGATATCCACC...   

   GC Content  
0        57.0  
1        62.0  
2        63.0  
3        52.0  
4        41.0  
5        39.0  
6        40.0  
7        36.0  
8        52.0  
9        47.0  


## Prints the average GC-content for both large-eared elves and small-eared elves to the screen.

In [6]:
# Calculate average GC-content for large and small ears
avg_gc_large_ear = large_ear_data['GC Content'].mean()
avg_gc_small_ear = small_ear_data['GC Content'].mean()

# Print the average GC-content
print("Average GC-content for large-eared elves:", avg_gc_large_ear)
print("Average GC-content for small-eared elves:", avg_gc_small_ear)

Average GC-content for large-eared elves: 58.5
Average GC-content for small-eared elves: 42.5


## Exports the table of individual level GC values to a CSV (comma delimited text) file titled grangers_analysis.csv.

In [7]:
# Export the table to a CSV file
result_data.to_csv('grangers_analysis.csv', index=False)