#### __ISP isolates: double-carbapenemase gen occurrences__

With ResFinder and AMRfinder information, make dataframe as:

| Sample | Carbapenemase-gene| .fastq reads | Illumina |
| ------ | ----------------- | ------------ | -------- |
| ... | ... | ... | ...|
| VA162_2022 | blaSHV-2a | YES | NO |
| ... | ... | ... | ...| 

In [2]:
import polars as pl

In [67]:
# Carbapenemase genes occurrences in .fastq reads (ResFinder)

resfinder = pl.read_csv("ResFinder/only_carbapenemase.tsv",
                        separator="\t",)

resfinder = resfinder.drop(["Identity",
                            "Phenotype"])

resfinder = resfinder.rename({"Sample": "sample"})
resfinder = resfinder.rename({"Resistance gene": "genID"})

# Carbapenemase genes occurrences in only illumina-assemblies (AMRfinder)

amrfinder = pl.read_csv("AMRfinder/AMRfinder_only_carbapenemase.tsv",
                        separator="\t",)

amrfinder = amrfinder.drop(["Protein identifier", "Strand",
                            "Contig id", "Scope", "Element type",
                            "Start", "Element subtype","HMM description",
                            "Stop","HMM id","Accession of closest sequence",
                            "Alignment length","% Coverage of reference sequence",
                            "Reference sequence length","Target length","Method",
                            "% Identity to reference sequence","Name of closest sequence",
                            "Sequence name","Class","Subclass"])

amrfinder = amrfinder.rename({"Name": "sample"})
amrfinder = amrfinder.rename({"Gene symbol": "genID"})

display(f"ResFinder nrows count: {resfinder.shape[0]}")
print(resfinder.head(4))

display(f"AMRfinder nrows count: {amrfinder.shape[0]}")
print(amrfinder.head(4))

resfinder_genID = (resfinder["genID"].unique()).to_list()
amrfinder_genID = (amrfinder["genID"].unique()).to_list()

display(f"Unique genes in ResFinder: {len(resfinder_genID)}")
display(resfinder_genID)
display(f"Unique genes in AMRfinder: {len(amrfinder_genID)}")
display(amrfinder_genID)

genID = list(set(resfinder_genID + amrfinder_genID))
display(f"Unique genes in both ResFinder and AMRfinder: {len(genID)}")
display(genID)

samples = list(set(resfinder["sample"].unique()))
display(samples)

'ResFinder nrows count: 40'

shape: (4, 2)
┌────────────┬─────────────┐
│ genID      ┆ sample      │
│ ---        ┆ ---         │
│ str        ┆ str         │
╞════════════╪═════════════╡
│ blaSHV-2a  ┆ VA1046_2020 │
│ blaSHV-27  ┆ VA1046_2020 │
│ blaSHV-110 ┆ VA1046_2020 │
│ blaNDM-7   ┆ VA1046_2020 │
└────────────┴─────────────┘


'AMRfinder nrows count: 15'

shape: (4, 2)
┌─────────────┬──────────┐
│ sample      ┆ genID    │
│ ---         ┆ ---      │
│ str         ┆ str      │
╞═════════════╪══════════╡
│ VA1046_2020 ┆ blaNDM-7 │
│ VA1184_2021 ┆ blaVIM-1 │
│ VA1788_2021 ┆ blaKPC-3 │
│ VA2464_2020 ┆ blaNDM-7 │
└─────────────┴──────────┘


'Unique genes in ResFinder: 23'

['blaVIM-2',
 'blaKPC-3',
 'blaVIM-1',
 'blaACT-14',
 'blaKPC-2',
 'blaSHV-11',
 'blaACT-7',
 'blaNDM-1',
 'blaSHV-148',
 'blaSHV-1',
 'blaACT-15',
 'blaSHV-158',
 'blaSHV-110',
 'blaACT-6',
 'blaSHV-2a',
 'blaNDM-7',
 'blaSHV-27',
 'blaACT-5',
 'blaSHV-81',
 'blaACT-10',
 'blaSHV-190',
 'blaSHV-40',
 'blaACT-16']

'Unique genes in AMRfinder: 6'

['blaVIM-2', 'blaVIM-1', 'blaKPC-3', 'blaNDM-1', 'blaKPC-2', 'blaNDM-7']

'Unique genes in both ResFinder and AMRfinder: 23'

['blaSHV-11',
 'blaACT-6',
 'blaACT-5',
 'blaVIM-1',
 'blaNDM-1',
 'blaSHV-148',
 'blaSHV-190',
 'blaNDM-7',
 'blaKPC-2',
 'blaACT-15',
 'blaACT-10',
 'blaSHV-2a',
 'blaVIM-2',
 'blaKPC-3',
 'blaACT-16',
 'blaSHV-110',
 'blaSHV-40',
 'blaSHV-27',
 'blaACT-7',
 'blaSHV-1',
 'blaSHV-81',
 'blaACT-14',
 'blaSHV-158']

['VA1788_2021',
 'VA1046_2020',
 'VA1184_2021',
 'VA2464_2020',
 'VA61_2022',
 'VA1565_2021',
 'VA2588_2020',
 'VA585_2022',
 'VA692_2022',
 'VA418_2022',
 'VA1101_2021',
 'VA2067_2020']

In [96]:
genIDx12 = genID * 12

empty_list = [None] * len(genIDx12)
samplesx23 = [sample for sample in samples for _ in range(23)]

carb = pl.DataFrame(
    {
        "sample":samplesx23,
        "genID":genIDx12
    }
)

print(carb.shape)
print(carb.head(2))

pl.DataFrame.write_csv(carb, "CP_CRE.tsv",
             separator="\t")

(276, 2)
shape: (2, 2)
┌─────────────┬───────────┐
│ sample      ┆ genID     │
│ ---         ┆ ---       │
│ str         ┆ str       │
╞═════════════╪═══════════╡
│ VA1788_2021 ┆ blaSHV-11 │
│ VA1788_2021 ┆ blaACT-6  │
└─────────────┴───────────┘


In [97]:
pl.DataFrame.write_csv(resfinder, "ResFinder_carbapem_occurrences.tsv",
             separator="\t")
pl.DataFrame.write_csv(amrfinder, "AMRfinder_carbapem_occurrences.tsv",
             separator="\t")

Tengo un dataframe de 276 filas llamado carb que tiene todas las muestras y todos los genes y es de la forma:

|sample|genID|
|---|---|
|VA1046_2020|blaSHV-11|
|---|---|

necesito hacer una nueva columna en carb llamada fastq_reads que tenga como valores un YES o un NO dependiendo si el sample junto al genID se encuentran en otro dataframe llamado resfinder que tiene 40 filas y tiene la forma:

|genID|sample|
|---|---|
|blaSHV-2a|VA1184_2021|
|---|---|

dame el codigo para hacer esto con polars

In [93]:
resfinder = resfinder.with_column_renamed("genID", "genID_res").with_column_renamed("sample", "sample_res")

merged = carb.join(resfinder, on=['sample', 'genID'], how='left')

merged = merged.with_column(pl.when(merged['genID_res'].is_null() & merged['sample_res'].is_null()).then('NO').otherwise('YES').alias('fastq_reads'))

merged = merged.drop(['genID_res', 'sample_res'])

display(merged.head())

AttributeError: 'DataFrame' object has no attribute 'with_column_renamed'

In [None]:
['VA1788_2021',
 'VA1046_2020',
 'VA1184_2021',
 'VA2464_2020',
 'VA61_2022',
 'VA1565_2021',
 'VA2588_2020',
 'VA585_2022',
 'VA692_2022',
 'VA418_2022',
 'VA1101_2021',
 'VA2067_2020']

In [None]:
CARB_occurrences = pl.DataFrame(
    {
        "samples":[
            'VA1788_2021_fastq',
            'VA1046_2020_fastq',
            'VA1184_2021_fastq',
            'VA2464_2020_fastq',
            'VA61_2022_fastq',
            'VA1565_2021_fastq',
            'VA2588_2020_fastq',
            'VA585_2022_fastq',
            'VA692_2022_fastq',
            'VA418_2022_fastq',
            'VA1101_2021_fastq',
            'VA2067_2020_fastq',

            'VA1788_2021_illumina',
            'VA1046_2020_illumina',
            'VA1184_2021_illumina',
            'VA2464_2020_illumina',
            'VA61_2022_illumina',
            'VA1565_2021_illumina',
            'VA2588_2020_illumina',
            'VA585_2022_illumina',
            'VA692_2022_illumina',
            'VA418_2022_illumina',
            'VA1101_2021_illumina',
            'VA2067_2020_illumina'
        ],
        
        "genID":genIDx12
    }
)