DNA Matching 🧬👥

Introduction 📖

DNA is a sequence of molecules (nucleotides) arranged into a particular shape. Each nucleotide of DNA contains one of four different bases: adenine (A), cytosine (C), guanine (G), or thymine (T). Some portions of this sequence are the same across almost all humans, but other portions of the sequence have a higher genetic diversity and thus vary more across the population.

DNA tends to have high genetic diversity in Short Tandem Repeats (STRs). An STR is a short sequence of DNA bases that tends to repeat consecutively numerous times at specific locations inside of a person’s DNA. The number of times any particular STR repeats varies among individuals. In the DNA samples below, for example, Alice has the STR AGAT repeated 4 times in her DNA, while Bob has the same STR repeated 5 times.

Alice - CTAGATAGATAGATAGATGACTA
Bob - CTAGATAGATAGATAGATAGATT

In its simplest form, a DNA database can be formatted as a CSV file, where each row corresponds to an individual, and each column corresponds to a particular STR.

Name	AGAT	AATG	TATC
Alice	28	42	14
Bob	17	22	19
Charlie	36	18	25

The data in the above table, for example, shows that Alice has the sequence AGAT repeated 28 times consecutively somewhere in her DNA, the sequence AATG repeated 42 times, and TATC repeated 14 times.

The program takes a sequence of DNA and a CSV file containing STR counts for a list of individuals and then outputs to whom the DNA belongs (a no match is also possible).

How to Run the Program 🗔

Programming Language Needed ⌨️

Python3

Execute ▶️

Start by cloning the repository in your local machine.

git clone https://github.com/ErTucci674/dna-matching.git

Choose a database file and a sequences file and enter the following line of code:

python dna.py (data.csv path) (sequence.txt path)

e.g.

python dna.py databases/large.csv sequences/6.txt

Files and Code 🗃️

Lists and Tables 📄

In the sequences folder there are 20 different DNA series that can be used to test the program. Each sequence is stored in a text-format file.

In the databases folder there are two CSV files: large.csv, small.csv. The two files contain tables with entities' DNA series similar to the one shown in the Introduction.

Main File ⚡

The main file that manages all the program is dna.py. The libraries csv and sys are used to read the CSV files and the user's input inserted in the terminal.

The program requires as it first command-line argument the CSV file path containing the STR counts for a list of individuals. As its second command-line arguments, instead, the name of the text file containing the DNA sequence to identify.

len_argv = len(sys.argv)
if len_argv != 3:
    print("Usage: python dna.py data.csv sequence.txt")
    sys.exit(1)

The if statement above checks if the user's input contains the number of requested 'items' with the sys library.

The STR lines are stored as a dictionary in a table data_dict through the csv library.

data_file = open(sys.argv[1], "r")
data_reader = csv.DictReader(data_file)

data_dict = list()
for row in data_reader:
    data_dict.append(row)

The second file is read by the read() function instead. The longest_match() function is then used to count each STR of the given series through a for loop.

The last for loop checks whether the combination is present in the given CSV file. In case of a match, the corresponding name is printed out, otherwise a No match is shown.

dna_match = "No match"
for person in data_dict:
    for str in str_list:
        if int(person[str]) != dna_dict[str]:
            break
        elif str == str_list[str_list_len - 1]:
            dna_match = person["name"]
    if dna_match != "No match":
        break

Reference Links 🔗

Databases and Sequences files - Harvard University Online Course (edx50)

Licence 🖋️

This project is licensed under the terms of the Attribution-NonCommercial-ShareAlike 4.0 International.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DNA Matching 🧬👥

Introduction 📖

How to Run the Program 🗔

Programming Language Needed ⌨️

Execute ▶️

Files and Code 🗃️

Lists and Tables 📄

Main File ⚡

Reference Links 🔗

Licence 🖋️

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
databases		databases
sequences		sequences
LICENCE		LICENCE
README.md		README.md
dna.py		dna.py

License

ErTucci674/dna-matching

Folders and files

Latest commit

History

Repository files navigation

DNA Matching 🧬👥

Introduction 📖

How to Run the Program 🗔

Programming Language Needed ⌨️

Execute ▶️

Files and Code 🗃️

Lists and Tables 📄

Main File ⚡

Reference Links 🔗

Licence 🖋️

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages