# Python 2 HSUTCC: Session 12: String, Bytes, Files, I/O

# Strings

You can declare a special character inside a string declaration using `\` (backslash)

In [1]:
s1 = '\' This is a quote'
s2 = '\" This is a double quote'
s3 = '\t This is a tab'
s4 = '\n This is a newline character'
print(s1, s2, s3, s4, sep="\n")

' This is a quote
" This is a double quote
	 This is a tab

 This is a newline character


* full list of supported sequences can be found in documentation:

https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

### String Formatting
String can be formatted using three separated ways:

```python
"{!s}".format("hello")

```

```c
"%s, %s, how are you?" % ("Hello", "Sally")
```

```python
price = -25.6321
f"The price is {price:.4f}"
```


In [6]:
"{!s}".format("hello")

'hello'

In [None]:
"{0} {1}".format("One", "two")

'One two'

In [9]:
"{1} {2} {name}".format("one", "two", name="North")

IndexError: Replacement index 2 out of range for positional args tuple

In [14]:
"%s, %s, how are you?" % ("Hello", "Vetit")

'Hello, Vetit, how are you?'

In [39]:
price = -1250283.6821
name = "Vetit"
print_func = print

f"The price is {price:.2f}, the name is {name}, print_func: {print_func}"

'The price is -1250283.68, the name is Vetit, print_func: <built-in function print>'

# Encodings

## ASCII
http://en.wikipedia.org/wiki/ASCII
ASCII is a 7 bits representation of characters. Therefore, ASCII ranges from 0 - 127.

$2^7 =128$

`ord()`

Convert character to ASCII code

$41_{16} = 4 \times 16^1 + 1 \times 16^0=65$

In [40]:
char = 'A'
ord(char) # 1000001 = 65

65

In [41]:
char = '!'
ord(char)

33

In [42]:
char = '„ÅÇ'
ord(char)

12354

$2^14=16xxx$

But there is some character that would resulted not in range of 0-127:

In [43]:
char = '‡∏Å'
ord(char)

3585

`chr()`

Convert ASCII code to character

In [44]:
chr(65)

'A'

In [75]:
chr(0x110000-1)

'\U0010ffff'

In [79]:
0x110000-1

1114111

In [81]:
2**20

1048576

Or with hex code like `0x68`

$16 \times 6 + 8$

In [82]:
ord('h')

104

In [84]:
chr(0x68)

'h'

In [87]:
map(int, [1,3])

<map at 0x114152ec0>

Base 16 -> Hexadecimal -> RAM address, Colors.
Base 10 -> Human Interpretation
Base 2 -> Computer Understanding

In [96]:
hex(int('10101010010101', 2)) # 2A95

'0x2a95'

In [None]:
hex()
ord()
chr()
bin()
int('10101', 2)

'0x89'
'0b10101'

In [None]:
hex(ord("A")) # 0~255
hex(ord("„ÅÇ")) # 12354 -> 0x3042
hex(ord("‡∏Å")) # 3585 # UCS-2

'0xe01'

What about those whose ordinal is out of ASCII range (0-127)?

## Unicode Standard

* Unicode - standard of coding text on different languages.
* Standard version 8.0 contains more than 120,000 symbols.
* In fact Unicode - a mapping which matches symbol with a unique number.

There are several unicode standards: UTF-8, UTF-16, UTF-32, UCS-2, UCS-4
Which has different number of bits representations -> More bits, more characters that can be represented!

* UTF-8: 1-4 bytes (from U+0000 to U+10FFFF)
    - Can be used for ASCII and non-ASCII character
    - ASCII takes 1 byte
    - Non-ASCII takes 2-4 bytes
    - Character 'A' is 0x41 (1 byte) in UTF-8
    - Character '‡∏Å' is 	0xE0 0xB8 0x81 (3 bytes) in UTF-8

* UTF-16: 2 or 4 bytes
    - UTF-16 represents characters in 2 bytes (16 bits) for most characters.
	- For characters outside the Basic Multilingual Plane (BMP) (U+0000 to U+FFFF), it uses a pair of 16-bit code units, called surrogate pairs, making it 4 bytes in total.
	- UTF-16 is popular for Windows and Java applications.
    - The letter 'A' in UTF-16 is 0x0041 (2 bytes).
	- The character 'üòä' is 0xD83D 0xDE0A (4 bytes, using a surrogate pair).

* UTF-32: 4 bytes fixed
    - Every single character has 4 bytes.
    - **What is the drawbacks?**
    - The letter 'A' in UTF-32 is 0x00000041 (4 bytes).
	- The character 'üòä' is 0x0001F60A (4 bytes).

* UCS-2 & UCS-4 (2&4-byte Universal Character Set)
    - UCS-2, UCS-4 is an older encoding that uses a fixed 2, 4 bytes (16, 32 bits) per character.
    - It can only represent characters within the Basic Multilingual Plane (BMP), which includes characters from U+0000 to U+FFFF.
	- Because it doesn‚Äôt support surrogate pairs, UCS-2 cannot represent characters outside the BMP (e.g., emojis and some rare symbols).

> How does each string represents in Python?

In [None]:
list(map(ord, "hello")) # ASCII

[104, 101, 108, 108, 111]

This is not UTF-8! They are `Unicode code points` for each character

In [None]:
list(map(ord, "‡∏Ñ‡∏ß‡∏≤‡∏¢")) # UCS-2

[3588, 3623, 3634, 3618]

To get UTF-8 encoding, we use `.encode(ENCODER)` to encode the desire string.

In [1]:
utf8_encoded = "A".encode("utf-8")
print(utf8_encoded)
utf8_bytes = list(utf8_encoded)
print(len(utf8_bytes), utf8_bytes)

b'A'
1 [65]


In [3]:
utf8_encoded = "‡∏ü".encode("utf-8")
print(utf8_encoded)
utf8_bytes = list(utf8_encoded)
print(len(utf8_bytes), utf8_bytes)

b'\xe0\xb8\x9f'
3 [224, 184, 159]


From above each character consists of 3 values. For instace, `‡∏Ñ` has 224, 184, 132 corresponds to it. Therefore, `‡∏Ñ` has 3 bytes to represents in UTF-8.

To be more clear, let's convert them to hexadecimal values by using `hex()`

In [10]:
utf8_encoded = "‡∏Ñ‡∏ß‡∏≤‡∏¢".encode("utf-16")
print(utf8_encoded)
print(list(utf8_encoded))
utf8_hex = [hex(b) for b in utf8_encoded]
print(utf8_hex)

b'\xff\xfe\x04\x0e\'\x0e2\x0e"\x0e'
[255, 254, 4, 14, 39, 14, 50, 14, 34, 14]
['0xff', '0xfe', '0x4', '0xe', '0x27', '0xe', '0x32', '0xe', '0x22', '0xe']


In [8]:
utf16_encoded = "‡∏Ñ‡∏ß‡∏≤‡∏¢".encode("utf-16")
print(list(utf16_encoded))

[255, 254, 4, 14, 39, 14, 50, 14, 34, 14]


NICE!

Or very special character that outside `BMP` requires full 4 bytes to represents:

In [149]:
print(ord('üçÜ')) # UCS-4
utf8_encoded = 'üçÜ'.encode('utf-8')
list(map(hex, utf8_encoded))

127814


['0xf0', '0x9f', '0x8d', '0x86']

### A little bit more about Unicode and strings
Python supports escaped sequences for Unicode characters

In [150]:
"\u4EE4", "\U00000068"

('‰ª§', 'h')

In [154]:
"\N{DOMINO TILE HORIZONTAL-06-04}"

'üÅü'

## Bytes and Encodings

### Bytes

- A byte type ‚Äì an immutable sequence of bytes.

In [156]:
b"\00\42\24\00"

b'\x00"\x14\x00'

Bytes and String are tightly connected

In [6]:
encoded = "hi ‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ".encode("utf-8")
encoded

b'hi \xe0\xb8\xaa\xe0\xb8\xa7\xe0\xb8\xb1\xe0\xb8\xaa\xe0\xb8\x94\xe0\xb8\xb5'

And it can be decoded with `.decode(DECODER)` method:

In [7]:
encoded.decode("utf-8")

'hi ‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ'

You can encode a text string into sequence of bytes by any of available codecs (Python supports more than 100 encodings).

In [8]:
encoded_utf8 = "hi ‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ –ø—Ä–∏–≤–µ—Ç üçÜüí¶".encode('utf_8_sig')
encoded_experiment = "hi ‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ –ø—Ä–∏–≤–µ—Ç üçÜüí¶".encode('utf-16')
print(encoded_utf8)
print(encoded_experiment)

b'\xef\xbb\xbfhi \xe0\xb8\xaa\xe0\xb8\xa7\xe0\xb8\xb1\xe0\xb8\xaa\xe0\xb8\x94\xe0\xb8\xb5 \xd0\xbf\xd1\x80\xd0\xb8\xd0\xb2\xd0\xb5\xd1\x82 \xf0\x9f\x8d\x86\xf0\x9f\x92\xa6'
b"\xff\xfeh\x00i\x00 \x00*\x0e'\x0e1\x0e*\x0e\x14\x0e5\x0e \x00?\x04@\x048\x042\x045\x04B\x04 \x00<\xd8F\xdf=\xd8\xa6\xdc"


What if we try to decode with other decoder?

In [9]:
print(encoded_utf8.decode('utf-16'))
print(encoded_experiment.decode('utf-16'))

ÎØØÊ¢ø‚Å©Î£†ÓÇ™Íû∏Î£†ÓÇ±Í™∏Î£†ÓÇîÎñ∏ÌÄ†ÌÜøÌÇÄÌÇ∏ÌÇ≤ÌÜµ‚ÇÇÈø∞ËöçÈø∞Íöí
hi ‡∏™‡∏ß‡∏±‡∏™‡∏î‡∏µ –ø—Ä–∏–≤–µ—Ç üçÜüí¶


# Files and I/O

io = input output
The io module in Python provides tools to handle both input and output streams in a consistent way ‚Äî whether the data comes from a file, memory, or another source.

### The `open()` function

- Text and binary files in Python are two different stories.
- To create a object of type file you can use the open function, which takes one positional argument ‚Äì a
path to the file:

In [20]:
import os

os.chdir('..')
os.getcwd() # Get current working directory

'c:\\'

In [5]:
open("Python2/north/vetit.txt")

<_io.TextIOWrapper name='Python2/north/vetit.txt' mode='r' encoding='UTF-8'>

- The open function has quite a lot of arguments, we are going to look into:
    - mode ‚Äì defines a mode to open a file, possible values:
        - "r" , "w" , "x" , "a" , "+" , "b" , "t" .
    - for text files it‚Äôs also possible to specify encoding and errors .

Examples:

In [None]:

open("./sample.txt", "r") # for reading only
open("./sample.txt", "r+b") # for reading in binary(read with bytes not text) mode
## BEWARE OF "w" mode bevause it will delete (truncate) the existing content within!
open("./sample.txt", "w") # for writing
open("./sample.txt", "w+b") # for writing in binary mode
open("./sample.txt", "a+") # for appending and reading
# for create a new file and writing to it, raise error if file already exists
open("./sample.txt", "x")

FileExistsError: [Errno 17] File exists: './sample.txt'

Open a text file with encoding "cp1251" to append, ignoring encoding errors:

In [None]:
open("sample.txt", mode="a+", encoding="utf-16", errors="strict").read()

PermissionError: [Errno 13] Permission denied: 'sample.txt'

### Read

In [1]:
handle = open('sample.txt')
result = handle.read() # read whole file to string
print(result) # read to the end of the file
print('>>>', handle.read(), '<<<<') # file pointer is already at the end -> empty string

foo
bar
boo
boobooboobooboobooboo
>>>  <<<<


or read a specific number of symbol in a file:

In [2]:
handle = open('sample.txt')
print(handle.read(6)) # read first 6 char
print(handle.read(2)) # read another 2 char after first 6

foo
ba
r



What! why empty?

`seek()`

Because seeker is already at the end and when we are trying to read further 16 symbols, there is nothing more. To reset the seeker to be at a certain point (in this case, at the beginning), we use `<FILE>.seek(<NUMBER>)`

In [None]:
print(handle.seek(33)) # position of the file pointer
print(handle.read(10)) # empty string cause it's the end of file

33



`tell()`

This method is used to see the current `seeker` position

In [17]:
handle = open('sample.txt')
handle.seek(8)
print(handle.read(16))
print(handle.tell())

boo
booboobooboo
24


`readline()` and `readlines()`

The `readline` and `readlines` methods read one or all lines correspondingly. You can specify a
maximum number of symbols to read:

In [19]:
handle = open('Taylor_Swift.txt')
handle.seek(0)
handle.readline()

"We could leave the Christmas lights up 'til January\n"

In [20]:
handle.seek(0)
len(handle.readline())

52

In [21]:
handle.seek(0)
handle.readline(3)

'We '

In [22]:
handle.readline()

"could leave the Christmas lights up 'til January\n"

read all lines and store each line to `list` of `strings`

In [23]:
handle.seek(0)
print(handle.readlines())



### Write

A write method writes a string to a file:

In [72]:
handle = open("sample.txt", "w")
handle.write("avada kedavra\n")
handle.write("Big O")
handle.write("?")
handle.flush()

when continue writing, it continues at the `seeker`position

In [73]:
handle.write("?")
handle.flush()

In [74]:
handle.tell()

21

In [75]:
handle.seek(0)
handle.write("!!!")
handle.flush()

Be aware that when using `w` mode, the original file will be immediately deleted (truncated)

In [78]:
handle2 = open("sample.txt", "w")

You can also write multiple lines with this:

In [83]:
handle2 = open("sample.txt", "w")
lines = ["foo", "bar", "boo"]
result = list(map(lambda x: x + "\n", lines))
handle2.writelines(result)
handle2.flush()

What if mode=a and seek(0)

In [91]:
handle = open("sample.txt", "a")
handle.tell()
handle.write("boo")

3

In [85]:
handle.write("boo")

3

In [None]:
handle.flush()

In [None]:
handle.seek(0)
handle.tell()

0

In [None]:
handle.write("!!")
handle.flush()

### IMPORTANT! `close()`

You need to close the file for it to
1. Free system resource
2. Ensuring data is written to the disk
3. Avoiding data corruption

In [92]:
handle2.close()

`.flush()`

`flush()` immediately write the content to the file without closing it

In [None]:
handle2.flush()

### `io` module basics

- The `io` module contains basic classes to work with text and binary data.
- The `io.StringIO` class allows to get a file object from a string and `io.BytesIO` ‚Äì from bytes:

In [None]:
s = ""
b = b"\x00"
b

b'\x00'

file -> open(file) -> TextIO


In [None]:
import io
handle = io.StringIO("foo\n\bar")
handle.readline()

'foo\n'

In [None]:
handle.write("boo")
handle.getvalue()

'foo\nboo'

s.seek(offset, whence)

offset ‚Üí number of characters (or bytes in binary) to move

whence ‚Üí reference point: where to start counting


Writing overwrites existing characters if pointer is in the middle

Writing appends if pointer is at the end

.getvalue()

Only works for memory file objects (StringIO, BytesIO)

It‚Äôs like ‚Äúpeek inside the file‚Äù without touching the pointer

Useful for checking current state after multiple writes or reads

similar to `io.BytesIO`

In [None]:
import io
handle = io.BytesIO(b"foobar")
handle.read(3)

TypeError: initial_value must be str or None, not bytes

But you cannot read normal text with bytes and vice versa.

In [97]:
handle_wrong = io.BytesIO("This is text not a Byte!")
handle_wrong.read()

TypeError: a bytes-like object is required, not 'str'

# Task

Write a function `cut_suffix` which takes a string and a suffix. A function should return this string
without the given suffix.

```python
cut_suffix("foobar", "bar")
>>> "foo"

cut_suffix("foobar", "boo")
>>> "foobar"
```

In [None]:
# Your work here
def cut_suffix(word: str, suffix: str) -> str:
    """
    Remove the given suffix from the string if it exists.
    
    Args:
        word (str): The original string.
        suffix (str): The suffix to remove.
        
    Returns:
        str: The string without the suffix if it was present, otherwise the original string.
    """
    if len(suffix) <= len(word) and word[-len(suffix):] == suffix: # slicing #-len(suffix) starts counting from the right end of the string. # The slice word[-len(suffix):] then takes as many characters as the length of the suffix, all the way to the end of the string.
        return word[:-len(suffix)] # gives everything except the last n characters (the suffix)
    return word

# Suffix to remove
print(cut_suffix("foobar", "bar"))
print(cut_suffix("foobar", "boo"))
print(cut_suffix("foobar", "boooooooooooo"))


foo
foobar
foobar


Write a function `boxed` which takes a string and two arguments: a symbol `fill` and a number
`pad`. A result of the `boxed` function execution should be a string surrounded by `fill` symbols as
it‚Äôs shown in the example.

```python
print(boxed("Hello world", fill="*", pad=2))
print(boxed("Fishy", fill="#", pad=1))
```

result:

```md
*****************
** Hello world **
*****************

#########
# Fishy #
#########
```

In [None]:
# Your work here
def boxed(text: str, fill: str, pad: int) -> str:
    """
    Creates a text box around a string with a specified fill character and padding.

    Args:
        text (str): The string to put inside the box.
        fill (str): The character used to create the box border.
        pad (int): Number of fill characters to add as padding around the text.

    Returns:
        str: The formatted string with a given fill character and padding,
    """
    
    # length of the top/bottom line
    total_length = len(text) + (pad * 2) + 2
    
    # top and bottom lines
    border = fill * total_length
    
    # middle line with padding
    middle = f"{fill * pad} {text} {fill * pad}"
    
    # combine everything
    return f"{border}\n{middle}\n{border}"

# Examples
print(boxed("Hello world", fill="*", pad=2))
print()
print(boxed("Fishy", fill="#", pad=1))


*****************
** Hello world **
*****************

#########
# Fishy #
#########



---



**She-bang** ‚Äì a sequence `#!` which is used in Unix-like systems to run executable scripts. **She-bang**
is always written on the first line in the script. After **she-bang** there is path to an interpreter program
written, for example:

`#! /bin/sh`

`#!/usr/bin/env python -v`

> Look at more example of she-bang here: http://en.wikipedia.org/wiki/Shebang_(Unix)


Write a function `parse_shebang` which takes a path to an executable script and return a path to an
interpreter program, if a script contains `she-bang` and None otherwise.
For the scripts in the example above:
```python
parse_shebang("./example1.txt")
>>> "/bin/sh"

parse_shebang("./example2.txt")
>>> "/usr/bin/env python -v"
```

In [None]:
# Your work here
def parse_shebang(file_path: str) -> str|None:
    """
    Reads the first line of a script file and returns the interpreter path if it contains a she-bang.

    Parameters:
        file_path (str): The path to the script file.

    Returns:
        str or None: The interpreter path specified in the script if she-bang is present, None if the file does not start with a she-bang.
                     
    Notes:
        - The function reads only the first line of the file.
        - Leading/trailing whitespace, tabs, and newline characters are removed from the result.
    """
      
    file = open(file_path, "r")
    first_line = file.readline()
    file.close()

    if len(first_line) >= 2 and first_line[0] == "#" and first_line[1] == "!":
        return first_line[2:].strip() # strip can remove spaces at the start and the end of string, tabs and newlines
    else:
        return None
    
# len(first_line) >= 2 avoid index out of range in 2nd and 3rd conditions when first line is empty or has less than 2 characters


In [30]:
# Testing Result: example 1
parse_shebang("Python2-session12-file-for-she-bang-task/example1.txt")

'/bin/sh'

In [31]:
# Testing Result: example 2
parse_shebang("Python2-session12-file-for-she-bang-task/example2.txt")

'/usr/bin/env python -v'

# Special Task (Bonus 4%)

A probabilistic langauge model describes pieces of text of some language in terms of random processes.
One of the simplest language model can be stated the following way. Let‚Äôs assume that we know a set of all
words in a language. Let‚Äôs generate words in a sentence from left to right one-by-one:

- Randomly take first two words from the set of all words.
- Each i‚Äôth word we will generate having from two previous (i - 1)‚Äôth and (i - 2)‚Äôth words.

Let‚Äôs try to build a language model based on lyrics of Taylor Swift's songs!

1. Write a function `words` which takes a text file and returns a list of words from a file:
```python
import io
handle = io.StringIO("""Can we always be this close forever and ever?
And ah, take me out, and take me home forever and ever.""")
words(handle)
>>>['Can', 'we', 'always', 'be', 'this', 'close', 'forever', 'and', 'ever?\n', 'And', 'ah,', 'take', 'me', 'out,', 'and', 'take', 'me', 'home', 'forever', 'and', 'ever.',
 ]
```

**Mind that punctuation and new-line characters stay unchanged!!!**

In [None]:
import io

def words(file_handle: io.StringIO) -> list[str]:
    """
    Reads the content of a file-like object and returns a list of words.

    Args:
        file_handle (io.StringIO): A file-like object containing text.

    Returns:
        list[str]: A list of words in the text, split by whitespace.
    """
    content = file_handle.read()
    return content.split()

import io
handle = io.StringIO("""Can we always be this close forever and ever?
And ah, take me out, and take me home forever and ever.""")

print(words(handle))

['Can', 'we', 'always', 'be', 'this', 'close', 'forever', 'and', 'ever?', 'And', 'ah,', 'take', 'me', 'out,', 'and', 'take', 'me', 'home', 'forever', 'and', 'ever.']


---

2. Write a function `transition_matrix` which takes a list of words and returns a dictionary. This
dictionary for every pair of words `(u, v)` contains a list of words `w` which follow words `u` and `v` in the
input list of words. For the example above:

```python
language = words(handle)
m = transition_matrix(language)
m[("take", "me")]
>>> ["out,", "home"]

m[("we", "always")]
>>> ["be"]

m[("forever", "and")]
>>> ["ever?\n", "ever."]
```

In [None]:
import io

def words(file_handle: io.StringIO) -> list[str]:
    """
    Reads the content of a file-like object and returns a list of words.

    Args:
        file_handle (io.StringIO): A file-like object containing text.

    Returns:
        List[str]: A list of words in the text, split by whitespace.
    """

    text = file_handle.read()
    result = text.split()
    return result

def transition_matrix(words_list: list[str]) -> dict[tuple[str, str], list[str]]:
    """
    Creates a transition matrix (dictionary)
    
    Each key is a tuple of two consecutive words (u, v), and the value
    is a list of words w that follow the pair (u, v)

    Args:
        words_list (list[str]): List of words from a text.

    Returns:
        dict[tuple[str, str], list[str]]: The transition matrix.
    """

    matrix = {} # empty dict that will store the pairs of words as keys and list of following words as values

    for i in range(len(words_list) - 2): # to make sure u get group of 3, it won't loop through the last 2 to prevent index out of range
        u = words_list[i]
        v = words_list[i + 1]
        w = words_list[i + 2]

        key = (u, v)

        if key not in matrix:
            matrix[key] = []

        matrix[key].append(w)

    return matrix

handle = io.StringIO("""Can we always be this close forever and ever?
And ah, take me out, and take me home forever and ever.""")


lyric = words(handle) # lyric is a list of each individual word from the text

m = transition_matrix(lyric) # create dictionary

print(m[("take", "me")])
print(m[("we", "always")])
print(m[("forever", "and")])
print(m[("And", "ah,")])


['out,', 'home']
['be']
['ever?', 'ever.']
['take']


---

3. Write a function `markov_chain` which generates sentences of a defined size. A function takes three
parameters:
- a list of words, a result `words` function execution,
- a dictionary, built with `transition_matrix` function,
- an integer ‚Äì a number of words in sentence to be generated.


Let me remind how to generate random sentences. Let‚Äôs generate words in a sentence from left to right
one-by-one:
- Randomly take two first words from all words list words.
- Each `i`‚Äôth word will be generated using previous two `(i - 1)`‚Äôth and `(i - 2)`‚Äôth words (with help of `transition_matrix`).
- If this pair didn‚Äôt happen to exist (it‚Äôs in `transition_matrix` dictionary) then `i`‚Äôth word is taken randomly from the set of all
words.

You will need functions `random.randint` and `random.choice`.

In [None]:
import random

def markov_chain(words_list: list[str], matrix: dict[tuple[str, str], list[str]], n_words: int) -> list[str]:
    """
    Generates a sentence of n_words using a Markov chain.

    Args:
        words_list (list): List of all words (from words()).
        matrix (dict[tuple[str, str], list[str]]): Transition matrix from transition_matrix().
        n_words (int): Number of words in the generated sentence.

    Returns:
        list[str]: A list of words forming the generated sentence.
    """
    
    # Pick two first words randomly
    first_word = random.randint(0, len(words_list) - 1)
    second_word = random.randint(0, len(words_list) - 1)
    sentence = [words_list[first_word], words_list[second_word]]

    # Generate remaining words
    for index in range(2, n_words):
        u, v = sentence[index-2], sentence[index-1]
        key = (u, v)
        if key in matrix:
            next_word = random.choice(matrix[key]) # randomly choose a value from that key
        else:
            next_word = random.choice(words_list)
        sentence.append(next_word)

    return sentence


handle = io.StringIO("""We could leave the Christmas lights up 'til January
And this is our place, we make the rules
And there's a dazzling haze, a mysterious way about you dear
Have I known you 20 seconds or 20 years?
Can I go where you go?
Can we always be this close forever and ever?
And ah, take me out, and take me home
You're my, my, my, my
Lover""")

lyric = words(handle)
m = transition_matrix(lyric)

sentence = markov_chain(lyric, m, 8)
print(" ".join(sentence))


And a Have dear place, lights ah, the


randint choose INDEX
choice choose ITEM from the list

output = And a Have dear place, lights ah, the

**Third word** index 2
sentence = ["And", "a"]
u = "And"
v = "a"
key = ("And", "a")
next_word = "Have" (random from a whole list of individual word cause the key ("And", "a") does not exist in matrix from previous task)

**Fourth word** index 3
sentence = ["And", "a", "Have"]
u = "a"
v = "Have"
key = ("a", "Have") (not in matrix)
next_word = "dear"

**Fifth word** index 4
sentence = ["And", "a", "Have", "dear"]
u = "Have"
v = "dear"
key = ("Have", "dear") (not in matrix)
next_word = "place,"

sentence = ["And", "a", "Have", "dear", "place,"]

...

---

4. Write a function `taylor_swifter()` which takes a path to a file `taylor_swift.txt` and an
integer ‚Äì a length of a sentence and returns a sentence of specified language on Taylor Swift's
language.

```python
print(taylor_swifter("./taylor_swift.txt", 30))

>>>'well dancing pack I got this music in my head tell me to the garden? In the garden, would you trust me If I told you it was never mine'
```

In [None]:
def taylor_swifter(file_path: str, length: int) -> str:
    """
    Generates a sentence using word from Taylor_Swift.txt

    Args:
        file_path (str): Path to the text file containing Taylor Swift lyrics.
        length (int): Number of words to generate.

    Returns:
        str: Generated sentence as a single string.
    """

    file = open(file_path, "r")
    text = file.read()
    file.close()

    words_list = text.split()
    matrix = transition_matrix(words_list)
    sentence_list = markov_chain(words_list, matrix, length)
    
    return " ".join(sentence_list)

print(taylor_swifter("./Taylor_Swift.txt", 30))


it's me again that I'm don't even is shake the I I wait by the ache in you put there by the door with you The lingering question kept me


>The lyrics came from well-known taylor swifts songs and use for education only.