In [22]:
def clean_fastq_file(input_file, output_file):
    """
    Process a FASTQ file and remove special characters (:,-,) and spaces from sequence IDs
    while preserving the FASTQ format.

    Args:
        input_file (str): Path to input FASTQ file
        output_file (str): Path to output FASTQ file
    """
    with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
        # Process file four lines at a time (FASTQ format)
        line_count = 0
        for line in infile:
            # Every 4th line starting with line 1 is a sequence identifier (starts with @)
            if line_count % 4 == 0 and line.startswith('@'):
                # Clean the ID line (remove special characters and spaces)
                cleaned_id = line.strip().replace(":", "").replace("-", "").replace(" ", "")
                outfile.write(cleaned_id + '\n')
            else:
                # Write other lines (sequence, + line, and quality scores) as they are
                outfile.write(line)
            line_count += 1

# Example usage
if __name__ == "__main__":
    input_fastq = "/content/100-MN_1.fastq"  # Replace with your input file path
    output_fastq = "cleaned_output.fastq"  # Replace with desired output file path

    try:
        clean_fastq_file(input_fastq, output_fastq)
        print(f"Successfully processed FASTQ file. Cleaned output saved to: {output_fastq}")
    except FileNotFoundError:
        print("Error: Input file not found!")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

Successfully processed FASTQ file. Cleaned output saved to: cleaned_output.fastq


In [1]:
! sudo apt install seqtk muscle

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  muscle seqtk
0 upgraded, 2 newly installed, 0 to remove and 49 not upgraded.
Need to get 274 kB of archives.
After this operation, 794 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 muscle amd64 1:3.8.1551-2build1 [244 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 seqtk amd64 1.3-2 [30.2 kB]
Fetched 274 kB in 0s (1,223 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 2.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Select

In [3]:
! pip install -q condacolab
import condacolab
condacolab.install()

‚è¨ Downloading https://github.com/conda-forge/miniforge/releases/download/23.11.0-0/Mambaforge-23.11.0-0-Linux-x86_64.sh...
üì¶ Installing...
üìå Adjusting configuration...
ü©π Patching environment...
‚è≤ Done in 0:00:13
üîÅ Restarting kernel...


In [6]:
! pip install colab-xterm
%load_ext colabxterm
%env TERM=xterm

Collecting colab-xterm
  Downloading colab_xterm-0.2.0-py3-none-any.whl.metadata (1.2 kB)
Collecting ptyprocess~=0.7.0 (from colab-xterm)
  Downloading ptyprocess-0.7.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting tornado>5.1 (from colab-xterm)
  Downloading tornado-6.4.1-cp38-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Downloading colab_xterm-0.2.0-py3-none-any.whl (115 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m115.6/115.6 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading ptyprocess-0.7.0-py2.py3-none-any.whl (13 kB)
Downloading tornado-6.4.1-cp38-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (436 kB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m436.8/436.8 kB

In [1]:
! conda install bioconda::emboss

Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - 

In [2]:
! conda install bioconda::fasttree

Channels:
 - conda-forge
 - bioconda
Platform: linux-64
Collecting package metadata (repodata.json): - \ | / - \ | / - \ | / - \ | / done
Solving environment: \ | / - \ | done


    current version: 23.11.0
    latest version: 24.9.2

Please update conda by running

    $ conda update -n base -c conda-forge conda



## Package Plan ##

  environment location: /usr/local

  added / updated specs:
    - bioconda::fasttree


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    fasttree-2.1.11            |       h031d066_4         261 KB  bioconda
    ------------------------------------------------------------
                                           Total:         261 KB

The following NEW packages will be INSTALLED:

  fasttree           bioconda/linux-64::fasttree-2.1.11-h031d066_4 



Downloading and Extracting Packages:
                         

In [23]:
! seqret -sequence /content/cleaned_output.fastq -outseq /content/cleaned_100-MN_1.fasta

Read and write (return) sequences


In [15]:
!pip install biopython

Collecting biopython
  Downloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting numpy (from biopython)
  Downloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m62.0/62.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading biopython-1.84-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m3.2/3.2 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-2.1.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.3 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32

In [24]:
from Bio import SeqIO

# Input and output FASTA file paths
input_file = "/content/cleaned_100-MN_1.fasta"
output_file = "/content/sample_cleaned_100-MN_1.fasta"

# Read the first 100 records and write them to a new file
with open(input_file, "r") as infile, open(output_file, "w") as outfile:
    records = list(SeqIO.parse(infile, "fasta"))[:100]  # Get the first 100 records
    SeqIO.write(records, outfile, "fasta")


In [25]:
! muscle -in /content/sample_cleaned_100-MN_1.fasta -out /content/aligned_sample_cleaned_100-MN_1.fasta


MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

sample_cleaned_100-MN_1 100 seqs, lengths min 301, max 301, avg 301
00:00:00     16 MB(4%)  Iter   1    0.02%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1    9.92%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   19.82%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   29.72%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   39.62%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   49.52%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   59.43%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   69.33%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   79.23%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   89.13%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1   99.03%  K-mer dist pass 100:00:00     16 MB(4%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00     16 MB(4%)  I

In [26]:
! FastTree -nt /content/aligned_sample_cleaned_100-MN_1.fasta > tree_file.newick

FastTree Version 2.1.11 Double precision (No SSE3)
Alignment: /content/aligned_sample_cleaned_100-MN_1.fasta
Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000
Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1
TopHits: 1.00*sqrtN close=default refresh=0.80
ML Model: Jukes-Cantor, CAT approximation with 20 rate categories
Initial topology in 0.03 seconds
Refining topology: 26 rounds ME-NNIs, 2 rounds ME-SPRs, 13 rounds ML-NNIs
Total branch-length 6.102 after 0.31 sec
ML-NNI round 1: LogLk = -9630.340 NNIs 13 max delta 6.15 Time 0.45
Switched to using 20 rate categories (CAT approximation)
Rate categories were divided by 0.911 so that average rate = 1.0
CAT-based log-likelihoods may not be comparable across runs
Use -gamma for approximate but comparable Gamma(20) log-likelihoods
ML-NNI round 2: LogLk = -8420.677 NNIs 7 max delta 1.65 Time 0.56
ML-NNI round 3: LogLk = -8419.553 NNIs 2 max delta 0.56 Time 0.62
ML-NNI round 4: LogLk = -8419.551 NNIs 0 max

In [27]:
! pip install scipy

Collecting scipy
  Downloading scipy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[?25l     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m0.0/60.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.8/60.8 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
Downloading scipy-1.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (41.2 MB)
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m41.2/41.2 MB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: scipy
Successfully installed scipy-1.14.1


In [29]:
from Bio import Phylo
from scipy.spatial.distance import pdist, squareform
import pandas as pd

# Load the Newick tree
tree = Phylo.read("/content/tree_file.newick", "newick")

# Calculate the pairwise distances
tips = tree.get_terminals()
names = [tip.name for tip in tips]
matrix = [[tree.distance(t1, t2) for t2 in tips] for t1 in tips]

# Convert to a dictionary if desired, or keep as a matrix
dist_matrix = pd.DataFrame(matrix, index=names, columns=names)


In [30]:
dist_matrix

Unnamed: 0,M017719000000000L7C7K111051923660181N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K11105502660201N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111052293960181N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111052719360191N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111052730060201N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K11105526260181N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111052085860191N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111051322660191N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K11105746360191N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K11105789260161N0TCTCGCGC+ATAGAGGC,...,M017719000000000L7C7K111052771460171N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111051157860171N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111051198460171N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111052064060171N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K11105779760201N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111051276560191N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111052198860171N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111051347760201N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K11105513560191N0TCTCGCGC+ATAGAGGC,M017719000000000L7C7K111051796160201N0TCTCGCGC+ATAGAGGC
M017719000000000L7C7K111051923660181N0TCTCGCGC+ATAGAGGC,0.000000,0.200849,0.289529,0.289529,0.326902,0.397270,0.405897,0.405897,0.415963,0.472225,...,0.470873,0.470873,0.477373,0.477142,0.478121,0.485061,0.517419,0.507648,0.510184,0.505637
M017719000000000L7C7K11105502660201N0TCTCGCGC+ATAGAGGC,0.200849,0.000000,0.385611,0.385611,0.422984,0.493352,0.501979,0.501979,0.512045,0.568307,...,0.566955,0.566955,0.573454,0.573224,0.574203,0.581142,0.613501,0.603730,0.606266,0.601719
M017719000000000L7C7K111052293960181N0TCTCGCGC+ATAGAGGC,0.289529,0.385611,0.000000,0.000000,0.203937,0.274305,0.282932,0.282932,0.292998,0.349260,...,0.462004,0.462004,0.468503,0.468273,0.469252,0.476191,0.508550,0.498779,0.501315,0.496768
M017719000000000L7C7K111052719360191N0TCTCGCGC+ATAGAGGC,0.289529,0.385611,0.000000,0.000000,0.203937,0.274305,0.282932,0.282932,0.292998,0.349260,...,0.462004,0.462004,0.468503,0.468273,0.469252,0.476191,0.508550,0.498779,0.501315,0.496768
M017719000000000L7C7K111052730060201N0TCTCGCGC+ATAGAGGC,0.326902,0.422984,0.203937,0.203937,0.000000,0.197209,0.205837,0.205837,0.215902,0.272164,...,0.499377,0.499377,0.505876,0.505645,0.506624,0.513564,0.545922,0.536152,0.538688,0.534141
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
M017719000000000L7C7K111051276560191N0TCTCGCGC+ATAGAGGC,0.485061,0.581142,0.476191,0.476191,0.513564,0.583932,0.592559,0.592559,0.602625,0.658887,...,0.052529,0.052529,0.059028,0.058798,0.026642,0.000000,0.033073,0.023302,0.211273,0.206726
M017719000000000L7C7K111052198860171N0TCTCGCGC+ATAGAGGC,0.517419,0.613501,0.508550,0.508550,0.545922,0.616290,0.624918,0.624918,0.634983,0.691245,...,0.084888,0.084888,0.091387,0.091156,0.059001,0.033073,0.000000,0.009771,0.243631,0.239084
M017719000000000L7C7K111051347760201N0TCTCGCGC+ATAGAGGC,0.507648,0.603730,0.498779,0.498779,0.536152,0.606520,0.615147,0.615147,0.625213,0.681475,...,0.075117,0.075117,0.081616,0.081385,0.049230,0.023302,0.009771,0.000000,0.233861,0.229314
M017719000000000L7C7K11105513560191N0TCTCGCGC+ATAGAGGC,0.510184,0.606266,0.501315,0.501315,0.538688,0.609056,0.617683,0.617683,0.627749,0.684011,...,0.197086,0.197086,0.203585,0.203354,0.204333,0.211273,0.243631,0.233861,0.000000,0.190712


In [None]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab
%matplotlib inline

**Learn the Basics** \|\| [Quickstart](quickstart_tutorial.html) \|\|
[Tensors](tensorqs_tutorial.html) \|\| [Datasets &
DataLoaders](data_tutorial.html) \|\|
[Transforms](transforms_tutorial.html) \|\| [Build
Model](buildmodel_tutorial.html) \|\|
[Autograd](autogradqs_tutorial.html) \|\|
[Optimization](optimization_tutorial.html) \|\| [Save & Load
Model](saveloadrun_tutorial.html)

Learn the Basics
================

Authors: [Suraj Subramanian](https://github.com/subramen), [Seth
Juarez](https://github.com/sethjuarez/), [Cassie
Breviu](https://github.com/cassiebreviu/), [Dmitry
Soshnikov](https://soshnikov.com/), [Ari
Bornstein](https://github.com/aribornstein/)

Most machine learning workflows involve working with data, creating
models, optimizing model parameters, and saving the trained models. This
tutorial introduces you to a complete ML workflow implemented in
PyTorch, with links to learn more about each of these concepts.

We\'ll use the FashionMNIST dataset to train a neural network that
predicts if an input image belongs to one of the following classes:
T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker,
Bag, or Ankle boot.

[This tutorial assumes a basic familiarity with Python and Deep Learning
concepts.]{.title-ref}

Running the Tutorial Code
-------------------------

You can run this tutorial in a couple of ways:

-   **In the cloud**: This is the easiest way to get started! Each
    section has a \"Run in Microsoft Learn\" and \"Run in Google Colab\"
    link at the top, which opens an integrated notebook in Microsoft
    Learn or Google Colab, respectively, with the code in a fully-hosted
    environment.
-   **Locally**: This option requires you to setup PyTorch and
    TorchVision first on your local machine ([installation
    instructions](https://pytorch.org/get-started/locally/)). Download
    the notebook or copy the code into your favorite IDE.

How to Use this Guide
---------------------

If you\'re familiar with other deep learning frameworks, check out the
[0. Quickstart](quickstart_tutorial.html) first to quickly familiarize
yourself with PyTorch\'s API.

If you\'re new to deep learning frameworks, head right into the first
section of our step-by-step guide: [1. Tensors](tensor_tutorial.html).

::: {.toctree maxdepth="2" hidden=""}
quickstart\_tutorial tensorqs\_tutorial data\_tutorial
transforms\_tutorial buildmodel\_tutorial autogradqs\_tutorial
optimization\_tutorial saveloadrun\_tutorial
:::
