Python code written for EdX course CSE181 (Genomic Data Science): analyzing biological sequences, identifying motifs and consensus sequences, and applying statistical and random search methods to genomic data
- week1_frequent_words: constructing frequency arrays and identifying frequent sub-sequences in genomes (analyze e_coli.txt and vibrio_cholerae_genome.txt)
- week2_frequent_words_w_mismatches: identify lagging and leading strands by G-C content, identify mismatches and reverse complements of sub-sequences to more realistically determine most frequently repeated sub-sequences in genome in order to locate ori
- week3_motif_matrices: construct numpy arrays of regulatory motifs, probability distributions of motif matrices, construct consensus motif, and identify possible regulatory motif sets from brute force and "greedy" searches
- week4_random_motif_search: use more accurate random search with iterative and repeated searches to identify likely regulatory motifs