# Introduction to the DataConfig Object

The `DataConfig` class is a vital component of the GenAIRR package, designed to manage and organize the various configurations needed for immunoglobulin sequence generation, allele usage, trimming, and mutation simulations. This object serves as a centralized hub for storing and accessing all the essential data required during simulations and analyses.

## Key Attributes of DataConfig
- **family_use_dict**: Manages the usage frequencies of gene families, helping to simulate realistic gene family distributions. ( currently not used and uniform selection of each allele is prioritized)
- **gene_use_dict**: Similar to `family_use_dict`, but focuses on individual gene usage frequencies.
- **trim_dicts**: Contains information on how to trim gene segments (V, D, J) during sequence generation.
- **NP_transitions & NP_first_bases**: These dictionaries define the transition probabilities and initial base probabilities for non-polymorphic (NP) regions, which are crucial for simulating realistic sequences.
- **NP_lengths**: Provides the distribution of NP region lengths, adding another layer of realism to the sequence generation process.
- **v_alleles, d_alleles, j_alleles, c_alleles**: These dictionaries store allele information for V, D, J, and C gene segments, respectively, organized by family.
- **correction_maps**: Maps used for correcting or adjusting sequences or simulation parameters, ensuring that generated sequences meet specific criteria.
- **asc_tables**: Stores allele sequence cluster (ASC) tables, which group alleles based on sequence similarity and other criteria, providing insights into allele relationships.

The `DataConfig` object is integral to ensuring that simulations and sequence analyses are conducted with accurate and relevant data. Throughout this notebook, we will explore how to utilize `DataConfig` to configure and manage data effectively for your specific research needs.

Note that for proper GenAIRR functionality in its various functions and capabilities, all of the above variables must be present and in the correct format in case you decide to modify an existing or create a custom DataConfig file.

Let's begin by diving into the structure and examples of the `DataConfig` object!


In [1]:
from GenAIRR.data import builtin_heavy_chain_data_config
from GenAIRR.dataconfig import DataConfig


heavychain_dataconfig = builtin_heavy_chain_data_config()

## Trim Dictionary (trim_dicts)

The Trim Dictionary is a multi-level dictionary housed within each `DataConfig` object. This structure organizes trimming information based on gene and side (e.g., 5' or 3'). The keys follow the format of `gene_side`, such as `V_3` for the 3' end of the V gene.

For each `gene_side` key, there are sub-keys representing all the gene families available in the reference. Under each family sub-key, the dictionary lists the possible trimming lengths that can be applied to an allele within that family, along with the likelihood of each trimming length being selected.

Modifying this dictionary within the `DataConfig` object allows you to control the trimming lengths applied to specific gene-side and family combinations during sequence generation.


In [2]:
heavychain_dataconfig.trim_dicts['V_3']['IGHVF1']

{'IGHVF1-G1': defaultdict(float,
             {0: 0.2136996827285346,
              1: 0.0882906999801705,
              2: 0.2842987804878049,
              3: 0.1517759766012294,
              4: 0.1780314297045409,
              5: 0.0283127602617489,
              6: 0.0390206722189173,
              7: 0.0038915328177672,
              8: 0.0087621455482847,
              9: 0.0017536684513186,
              10: 0.00098527662105889,
              11: 0.0007064247471743,
              12: 4.9573666468372e-05,
              13: 7.4360499702558e-05,
              14: 3.09835415427325e-05,
              15: 1.85901249256395e-05,
              16: 4.33769581598255e-05,
              17: 5.57703747769185e-05,
              18: 4.33769581598255e-05,
              19: 3.09835415427325e-05,
              20: 6.1967083085465e-06,
              21: 1.2393416617093e-05,
              22: 2.4786833234186e-05,
              23: 6.1967083085465e-06,
              25: 6.1967083085465e-06,
       

## NP Region Generation Parameters

The `DataConfig` object contains three crucial components that guide the generation of NP regions during sequence simulation.

### 1. NP First Bases (`NP_first_bases`)

The first component is `NP_first_bases`, which is a multi-level dictionary. The top-level key represents the specific NP region of interest, either `NP1` or `NP2`. In cases where there is no D allele, such as in light chains, only `NP1` exists. The inner dictionary provides the probabilities of the NP region starting with each of the four nucleotides (A, T, C, G).

When simulating NP regions in GenAIRR, a first-order Markov chain is used. The `NP_first_bases` dictionary provides the initial state probabilities for this Markov chain. For example, when generating the `NP1` region, the first nucleotide is sampled based on the weights (likelihoods) defined in `dataconfig.NP_first_bases["NP1"]`.

### 2. Markov Chain Transition Matrix (`NP_transitions`)

The second component is the Markov chain transition matrix, stored in the `NP_transitions` dictionary. This is also a multi-level dictionary with several layers that define how the NP region evolves as nucleotides are added.

- **Top-Level Key**: Similar to `NP_first_bases`, the first key in `NP_transitions` specifies the NP region type (`NP1` or `NP2`). For instance, `dataconfig.NP_transitions['NP1']` retrieves the transition matrix used for generating the `NP1` region.
  
- **Second-Level Key**: The next level in the dictionary corresponds to the position within the NP region. For example, if you are generating the 5th nucleotide in the sequence, you would use `dataconfig.NP_transitions['NP1'][4]` to access the relevant transition probabilities.

- **Third-Level Key**: At this level, the key corresponds to the current nucleotide observed at the specific position. If the 4th position in the generated NP region is a "T", you would query `dataconfig.NP_transitions['NP1'][4]["T"]`. This returns a distribution that allows you to sample the next nucleotide (5th in this case), continuing the process for the entire length of the NP region.

This loop repeats until the NP region reaches its predetermined length.

### 3. NP Region Length Distribution (`NP_lengths`)

The third component is the NP region length distribution, stored in the `NP_lengths` dictionary. This is a two-level dictionary where the top-level key specifies the NP region (`NP1` or `NP2`). The value for each key is a distribution of likelihoods over the possible lengths for that NP region.

This distribution defines the variety of lengths that can occur in the NP regions during simulation, allowing for more realistic sequence generation.


In [3]:
heavychain_dataconfig.NP_first_bases

{'NP1': {'A': 0.11170254294101757,
  'C': 0.24612237873865697,
  'G': 0.28458098488427797,
  'T': 0.3575940934360475},
 'NP2': {'A': 0.16727374243138382,
  'C': 0.3712668132552802,
  'G': 0.27509575963731403,
  'T': 0.18636368467602188}}

In [4]:
heavychain_dataconfig.NP_transitions['NP1'][0]['T']

{'A': 0.16940029106419394,
 'C': 0.39025137900329543,
 'G': 0.23279850358164553,
 'T': 0.2075498263508651}

In [5]:
heavychain_dataconfig.NP_lengths['NP1']

{0: 0.05265143785167222,
 1: 0.04075110575704825,
 2: 0.05608972515275564,
 3: 0.07007809677350897,
 4: 0.07858552952587226,
 5: 0.07975324952254308,
 6: 0.07758633439139435,
 7: 0.07449997706055178,
 8: 0.06841003801143569,
 9: 0.05997446760685735,
 10: 0.05486857812958784,
 11: 0.04679214939837378,
 12: 0.04113627082942587,
 13: 0.034565642417839555,
 14: 0.0284863084612022,
 15: 0.024700616824390856,
 16: 0.020714718727390977,
 17: 0.01551780635633689,
 18: 0.013459847504451441,
 19: 0.010925362147151564,
 20: 0.009099157910326753,
 21: 0.007894751794817339,
 22: 0.006256334057027708,
 23: 0.005212570520841891,
 24: 0.004889068681041339,
 25: 0.003603860948874375,
 26: 0.0031465202209132598,
 27: 0.0026087014130785004,
 28: 0.0022806267502488496,
 29: 0.0018500272959719036,
 30: 0.0017848341530929469,
 31: 0.0014223901128996882,
 32: 0.0012977542751098972,
 33: 0.0008708713162783527,
 34: 0.0013054161439008335,
 35: 0.0007111989391616005,
 36: 0.0005481891501906322,
 37: 0.000612558