# Lesson 1 - 2025/10/03

## Why Visual Studio Code and Jupyter Notebook?
Visual Studio Code is a code editor that supports Jupyter notebook documents. A Jupyter Notebook document is a document (usually ending with the `.ipynb` extension) containing an ordered list of input/output cells which can contain code, text, mathematics, plots and rich media.

## What is Python?

Python is general purpose, high level **[interpreted language](https://en.wikipedia.org/wiki/Interpreted_language)**. An interpreted language is a type of programming language for which most of its implementations execute instructions directly, without previously compiling a program into machine-language instructions. The **interpreter** executes the program directly, translating each statement into a sequence of one or more subroutines, and then into another language (often machine code).

## and why?
- Readability (improving communication)
- Built-in libraries and high-level data structures (dictionaries, sets, lists, tuples, ...)
- Cross platform (a program made in Python can be run under any computer that has a Python interpreter)
- active community (important in daily work).

## Variables

In Python, we can assign a value to a variable, using the equals sign `=`.

In [2]:
# chromosome length total
GRCh38p13_length = 3096649726

#print(GRCh38p13_length)

GRCh38p13_length

3096649726

`print` is a Python function that displays the value passed to it on the screen. With Jupyter-Notebook we can avoid that.

To note that:
- you don't have to declare the variable
- you don't have to specify what type of value the variable is going to refer.

Python is a **dynamic typed** programming language. Dynamic programming languages at runtime execute many common programming behaviours that static programming languages perform during compilation. In this case, the type checking is performed at runtime.

A variable is a name for a value. It
- can include letters, digits, and underscores
- cannot start with a digit
- are case sensitive.

Basic data types in Python include integers, strings, and floating-point numbers.

| Operator     | Name           | Description                                            |
|--------------|----------------|--------------------------------------------------------|
| ``a + b``    | Addition       | Sum of ``a`` and ``b``                                 |
| ``a - b``    | Subtraction    | Difference of ``a`` and ``b``                          |
| ``a * b``    | Multiplication | Product of ``a`` and ``b``                             |
| ``a / b``    | Division       | Quotient of ``a`` and ``b``                            |
| ``a // b``   | Floor division | Quotient of ``a`` and ``b``, removing fractional parts |
| ``a % b``    | Modulus        | Integer remainder from division of ``a`` by ``b``      |
| ``a ** b``   | Exponentiation | ``a`` raised to the power of ``b``                     |

## Strings
Strings can be seen as sequences of characters.

In [3]:
'GATTACA'

'GATTACA'

They can be marked either by double or single quotation marks.

In [4]:
'GATTACA's'

SyntaxError: unterminated string literal (detected at line 1) (3194273073.py, line 1)

In [5]:
'GATTACA\'s'

"GATTACA's"

In [6]:
"GATTACA's"

"GATTACA's"

In [7]:
'Hello "world"'

'Hello "world"'

You can use the position’s index in square brackets to get the character at that position. Indices are numbered from 0.

In [8]:
dna_seq = 'GATTACA'

dna_seq[0]

'G'

In [9]:
print(dna_seq[-1])
print(dna_seq[-2])

A
C


In [10]:
dna_seq_length = len(dna_seq)

print(dna_seq[dna_seq_length - 1])
print(dna_seq[dna_seq_length - 2])

A
C


The `len()` function returns the number of items (length) in an object.

You can use the `+` and `*` operators on strings.

In [11]:
'PRO' + 'TE' + 'IN'

'PROTEIN'

In [12]:
'-' * 10

'----------'

In [13]:
piece = 'TACA'

print('GAT' + piece)
print('GA' + '-' * 3 + 'CA')

GATTACA
GA---CA


In [14]:
# what happens?

print(1 + '2')

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Not allowed because it’s ambiguous: should 1 + '2' be 3 or '12'?

In [15]:
print(1 + int('2'))
print(str(1) + '2')

3
12


The built-in function `type` returns what type a value has.

In [16]:
print(type(1))
print(type('2'))

<class 'int'>
<class 'str'>


### Operations on strings

In [2]:
piece_of_a_protein = "MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNZVVIKVCEFQFCNDPFLGVY"

len(piece_of_a_protein)

145

To get the number of serine amino acids:

In [18]:
piece_of_a_protein.count('S')

12

The string `count()` method returns the number of occurrences of a substring in the given string.

In [19]:
piece_of_a_protein.count('S') * 100 / len(piece_of_a_protein)

8.275862068965518

In [20]:
print(str(piece_of_a_protein.count('S') * 100 / len(piece_of_a_protein)) + '%')

8.275862068965518%


We can call `.format()` method on a **format string**, where the values we want to insert are represented with `{}` placeholders. format() will take care of the type conversion too.

In [21]:
print('Percentage of serine amino acids: {:.3f}%'.format(
    piece_of_a_protein.count('S') * 100 / len(piece_of_a_protein)
))

Percentage of serine amino acids: 8.276%


It is similar to `printf("Percentage of serine amino acids: %.3f%%", value);`.

### Slicing
A part of a string is called a substring. In general, for any list-like type, we can take a part of it, a **slice**.

We take a slice by using `[start:stop]`, where start is replaced with the index of the first element we want and stop is replaced with the index of the element just after the last element we want. The difference between stop and start is the slice’s length.

In [22]:
'01234'[0:2]

'01'

In [23]:
print('01234[:2]', '01234'[:2])
print('01234[2:]', '01234'[2:])
print('01234[:] ', '01234'[:])

01234[:2] 01
01234[2:] 234
01234[:]  01234


In [24]:
'''
Missense mutation description
- the variant description is preceded by a letter indicating the type of reference sequence used: `p.` is for protein sequences;
- aminoacids in three-letter code, first letter in upper case;
- wild type aminoacid, position in the protein sequence, mutant aminoacid
'''

mutation = 'p.Ser33Ala'

In [25]:
# Take the wild type aminoacid



In [25]:
mutation[2:5]

'Ser'

In [14]:
# Remove the "p."


In [26]:
print(mutation[2:10], mutation[2:])

Ser33Ala Ser33Ala


In [29]:
# Take the mutant aminoacid (mutation = 'p.Ser33Ala')



In [27]:
mutation[7:]

'Ala'

In [28]:
mutation2 = 'p.Leu280Val'

mutation2[7:]

'0Val'

In [29]:
print(mutation[-3:], mutation2[-3:])

Ala Val


#### Mutability

Taking a slice does not change the contents of the original string. Instead, the slice is a copy of part of the original string. In Python, a string is **immutable**.

An immutable object is an object whose state can't be modified after its creation.

In [30]:
mutation = 'p.Ser33Ala'
print('Memory address of "mutation":      {}'.format(id(mutation)))
print('Memory address of "mutation[-3:]": {}'.format(id(mutation[-3:])))

mutation = 'p.Ser100Ala'
print('Memory address of "mutation":      {}'.format(id(mutation)))

Memory address of "mutation":      138920552147056
Memory address of "mutation[-3:]": 138920551860272
Memory address of "mutation":      138920552147888


Built-in functions have online documentation. Use the built-in function `help` to get it.

In [31]:
help(id)

Help on built-in function id in module builtins:

id(obj, /)
    Return the identity of an object.
    
    This is guaranteed to be unique among simultaneously existing objects.
    (CPython uses the object's memory address.)



Immutability ensures that an object remains constant throughout your programs. Instead, mutable objects can change, whether you expect it or not.

### List

In [32]:
mutations = [] # Empty list

mutations = [
    'p.Ser31Ala', 'p.Pro38Leu', 'p.Asn100Lys'
]

mutations

['p.Ser31Ala', 'p.Pro38Leu', 'p.Asn100Lys']

Lists are **mutable**, and hence, they can be altered even after their creation.

In [33]:
mutations.append('p.LEU110VAL')

mutations

['p.Ser31Ala', 'p.Pro38Leu', 'p.Asn100Lys', 'p.LEU110VAL']

Lists are a useful tool for preserving a sequence of data and further iterating over it. They can contain duplicated elements.

Lists need not be homogeneous. A single list may contain DataTypes like Integers, Strings, as well as Objects. Lists are mutable

In [34]:
mutations.extend([13, 4.0, True, 'p.Tyr341Leu', 'AUG', 'p.Tyr0Le', 'p.Asn1.3Lys', 'p.Arg0Leu'])

mutations

['p.Ser31Ala',
 'p.Pro38Leu',
 'p.Asn100Lys',
 'p.LEU110VAL',
 13,
 4.0,
 True,
 'p.Tyr341Leu',
 'AUG',
 'p.Tyr0Le',
 'p.Asn1.3Lys',
 'p.Arg0Leu']

### Tuple

In [3]:
mutations_tuple = () # Empty tuple

mutations_tuple = (
    'p.Ser31Ala', 'p.Pro39Leu', 'p.Asn100Lys'
)

mutations_tuple

('p.Ser31Ala', 'p.Pro39Leu', 'p.Asn100Lys')

Tuples are **immutable**.

In [36]:
mutations_tuple[2] = 'p.Asn101Lys'

TypeError: 'tuple' object does not support item assignment

### Conditionals

Conditional statements let you control what pieces of code are run based on the value of some Boolean condition.

In [None]:
print('mutations_tuple:', mutations_tuple)

if piece_of_a_protein[30] == 'S' and piece_of_a_protein[38] == 'P':
    print('aminos are correct')
else:
    print('something is wrong', piece_of_a_protein[30], piece_of_a_protein[38])

mutations_tuple: ('p.Ser31Ala', 'p.Pro39Leu', 'p.Asn100Lys')
aminos are correct


The colon (`:`) indicates that a new "code block" is starting. Subsequent indented lines are part of that code block.

### Loops

Loops lead to repeatedly execute pieces of code.

In [38]:
index = 0

while index < len(mutations):
    print(mutations[index])
    index += 1

p.Ser31Ala
p.Pro38Leu
p.Asn100Lys
p.LEU110VAL
13
4.0
True
p.Tyr341Leu
AUG
p.Tyr0Le
p.Asn1.3Lys
p.Arg0Leu


In [39]:
for mut in mutations:
    print(mut)

p.Ser31Ala
p.Pro38Leu
p.Asn100Lys
p.LEU110VAL
13
4.0
True
p.Tyr341Leu
AUG
p.Tyr0Le
p.Asn1.3Lys
p.Arg0Leu


Lists and tuples are **iterable** objects.

In [40]:
for mut in mutations:
    if mut == 13:
        break
    print(mut)

p.Ser31Ala
p.Pro38Leu
p.Asn100Lys
p.LEU110VAL


The `break` statement terminates the loop containing it.

`for` loops also have an `else` clause. The `else` clause executes after the loop completes **normally**, that is, it did not encounter a break statement.

In [41]:
for mut in mutations:
    if mut == 13:
        break
    print(mut)
else:
    print('element not found')

p.Ser31Ala
p.Pro38Leu
p.Asn100Lys
p.LEU110VAL


In [42]:
for mut in mutations:
    if mut == 113:
        break
    print(mut)
else:
    print('element not found')

p.Ser31Ala
p.Pro38Leu
p.Asn100Lys
p.LEU110VAL
13
4.0
True
p.Tyr341Leu
AUG
p.Tyr0Le
p.Asn1.3Lys
p.Arg0Leu
element not found
