**Table of contents**<a id='toc0_'></a>    
- [Logs for the methylation data app](#toc1_)    
  - [Introduction](#toc1_1_)    
    - [21-11-2024](#toc1_1_1_)    
  - [Loading in the data](#toc1_2_)    
    - [21-11-2024](#toc1_2_1_)    
    - [22-11-2024](#toc1_2_2_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Logs for the methylation data app](#toc0_)

## <a id='toc1_1_'></a>[Introduction](#toc0_)
### <a id='toc1_1_1_'></a>[21-11-2024](#toc0_)

This logbook will describe the process of creating visualisations, ideas. These visualisations and ideas will be used to create an application for research students.
This application will take DNA methylation data as input. This app will make it easier for the students to look into their generated data, and it will help them with understanding their data.


## <a id='toc1_2_'></a>[Loading in the data](#toc0_)
### <a id='toc1_2_1_'></a>[21-11-2024](#toc0_)

I would like to combine the data from all the files into one single file, with the id in the column of the df.
This way i could compare different conditions to eachother.

The first code-block is to load in the used libraries.

In [3]:
import os
import seaborn as sns
import polars as pl
import pandas as pd
import re

In [4]:
barcodes_names: pl.dataframe = pl.read_csv("/home/redman/jaar2/data/Methylatie/barcodes.csv")

barcodes_names = barcodes_names.with_columns(controle_n = pl.int_range(pl.len()).over(" description")+1)
barcodes_names = barcodes_names.with_columns(group_and_n = pl.concat_str([pl.col(' description'), pl.col("controle_n")]))
print(barcodes_names.head())



shape: (5, 4)
┌─────────┬──────────────────────┬────────────┬───────────────────────┐
│ barcode ┆  description         ┆ controle_n ┆ group_and_n           │
│ ---     ┆ ---                  ┆ ---        ┆ ---                   │
│ i64     ┆ str                  ┆ i64        ┆ str                   │
╞═════════╪══════════════════════╪════════════╪═══════════════════════╡
│ 11      ┆  Jurkat_DMSO_control ┆ 1          ┆  Jurkat_DMSO_control1 │
│ 12      ┆  Jurkat_betuline     ┆ 1          ┆  Jurkat_betuline1     │
│ 13      ┆  Healthy_control     ┆ 1          ┆  Healthy_control1     │
│ 14      ┆  Jurkat_betuline     ┆ 2          ┆  Jurkat_betuline2     │
│ 15      ┆  Jurkat_DMSO_control ┆ 2          ┆  Jurkat_DMSO_control2 │
└─────────┴──────────────────────┴────────────┴───────────────────────┘


This generates a data frame that contains the barcode and also the description of the barcode
The column called group_and_n contains the description with a control group number

This is needed to label the different groups in the df that will contain all of the data

Which will be loaded in the code below this block

In [5]:
path: str = "/home/redman/jaar2/data/Methylatie/analysis"
def load_files(path: str) -> pl.dataframe:
    test: list = []
    col_names: list = ["chr", "start", "end", "frac", "valid", "group_name"]
    resulting_df: pd.DataFrame = pl.DataFrame(
        {"chr":[],
         "start":[],
         "end":[],
         "frac":[],
         "valid":[],
         "group_name":[]}
    )
    files: list = os.listdir(path)

    for file in files:
        if os.path.isfile(f"{path}/{file}") and file.endswith("methylatie_ALL.csv"):
            temp_df = pd.read_csv(f"{path}/{file}", sep="\t")
            temp_df = pl.from_pandas(temp_df)
            barcode_num = re.findall(r"\d+", file)

            name_group = barcodes_names.filter(pl.col("barcode").cast(pl.String) == barcode_num[0]).select("group_and_n")
            temp_df = temp_df.with_columns(pl.lit(name_group).alias("group_name"))
            resulting_df = pl.concat([temp_df, resulting_df])
    
    return resulting_df
    
df: pl.DataFrame = load_files(path=path)

All of the csv files are now loaded into 1 polars dataframe

In [7]:

print(df.head())


shape: (5, 6)
┌──────┬───────┬───────┬──────┬───────┬───────────────────┐
│ chr  ┆ start ┆ end   ┆ frac ┆ valid ┆ group_name        │
│ ---  ┆ ---   ┆ ---   ┆ ---  ┆ ---   ┆ ---               │
│ str  ┆ i64   ┆ i64   ┆ f64  ┆ i64   ┆ str               │
╞══════╪═══════╪═══════╪══════╪═══════╪═══════════════════╡
│ chr1 ┆ 61624 ┆ 61625 ┆ 0.0  ┆ 1     ┆  Jurkat_betuline1 │
│ chr1 ┆ 61802 ┆ 61803 ┆ 0.0  ┆ 1     ┆  Jurkat_betuline1 │
│ chr1 ┆ 61900 ┆ 61901 ┆ 0.0  ┆ 1     ┆  Jurkat_betuline1 │
│ chr1 ┆ 61921 ┆ 61922 ┆ 1.0  ┆ 1     ┆  Jurkat_betuline1 │
│ chr1 ┆ 61929 ┆ 61930 ┆ 1.0  ┆ 1     ┆  Jurkat_betuline1 │
└──────┴───────┴───────┴──────┴───────┴───────────────────┘


### <a id='toc1_2_2_'></a>[22-11-2024](#toc0_)
I now have a data frame with the methylation data with a column called group_name that holds the name of the group of which the data comes from