<a href="https://colab.research.google.com/github/Ghonem22/Learning/blob/main/Python3%20object%20oriented%20programming/Ch8%2C%20Strings%20and%20Serialization/Strings_and_Serialization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CH8: Strings and Serialization

## What will we cover in this chapter:
    
* The complexities of strings, bytes, and byte arrays
* The ins and outs of string formatting
* A few ways to serialize data
* The mysterious regular expression

## Strings

Python strings are all represented in Unicode, a character definition standard that can represent virtually any character in any language on the planet. So, let's think of Python 3 strings as an immutable sequence of Unicode characters.

---
**What's unicode?**

think of Unicode as a massive version of the ASCII table—one that has 1,114,112 possible code points. That’s 0 through 1,114,111, or 0 through 17 * (216) - 1, or 0x10ffff hexadecimal. In fact, ASCII is a perfect subset of Unicode. The first 128 characters in the Unicode table correspond precisely to the ASCII characters that you’d reasonably expect them to.

In [None]:
### String manipulation

a = "hello"
b = 'world'
c = '''a multiple
    line string'''
d = """More
    multiple"""
e = ("Three " "Strings "
    "Together")

In [None]:
e

'Three Strings Together'

**The str class has numerous methods on it to make manipulating strings easier.**



In [None]:
s = "hello world"

# how many times a given substring shows up in the string
s.count('l')

3

In [None]:
# the position of a given substring within the original string
s.find('l')

2

In [None]:
s.rfind('l')

9

In [None]:
s.index('h')

0

In [None]:
s.rindex('h')

6

* **The  'r' (rfind/ rindex....) methods refers to start searching from the end of the string**

* **The find methods return -1 if the substring can't be found**

* **index raises a ValueError if the substring can't be found**

In [None]:
# capitalize first char
s.capitalize()

'Hello world'

In [None]:
# # capitalize first char of every word
s.title()

'Hello World'

In [None]:
s = "hello world, how are you"
s2 = s.split(' ')
s2

['hello', 'world,', 'how', 'are', 'you']

In [None]:
'#'.join(s2)

'hello#world,#how#are#you'

In [None]:
s.replace(' ', '**')


'hello**world,**how**are**you'

In [None]:
'''The partition and rpartition methods split the string at only the first or last
occurrence of the substring, and return a tuple of three values: characters before
the substring, the substring itself, and the characters after the substring.'''

s.partition(' ')

('hello ', 'world', ', how are you')

### String formatting

In [None]:
template = "Hello {}, you are currently {}."
print(template.format('Dusty', 'writing'))

Hello Dusty, you are currently writing.


In [None]:
template = "Hello {0}, you are {1}. Your name is {0}."
print(template.format('Dusty', 'writing'))

Hello Dusty, you are writing. Your name is Dusty.


In [None]:
print("{} {label} {}".format("x", "y", label="z"))

x z y


In [None]:
emails = ("a@example.com", "b@example.com")
message = {
    'subject': "You Have Mail!",
    'message': "Here's some mail for you!"
    }

template = """
    From: <{0[0]}>
    To: <{0[1]}>
    Subject: {message[subject]}
    {message[message]}"""

print(template.format(emails, message=message))


    From: <a@example.com>
    To: <b@example.com>
    Subject: You Have Mail!
    Here's some mail for you!


### Object lookups

**We can also pass arbitrary objects as parameters, and use the dot notation to look up attributes on those objects.**

In [None]:
class EMail:
    def __init__(self, from_addr, to_addr, subject, message):
        self.from_addr = from_addr
        self.to_addr = to_addr
        self.subject = subject
        self.message = message


email = EMail("a@example.com", "b@example.com",
                "You Have Mail!",
                "Here's some mail for you!")

In [None]:
template = """
            From: <{0.from_addr}>
            To: <{0.to_addr}>
            Subject: {0.subject}
            {0.message}"""

print(template.format(email))


            From: <a@example.com>
            To: <b@example.com>
            Subject: You Have Mail!
            Here's some mail for you!


### Making it look right

**What if the variables need a bit of coercion to make them look right in the output?**

In [None]:
subtotal = 12.32
tax = subtotal * 0.07
total = subtotal + tax


print("Sub: ${0} Tax: ${1} Total: ${total}".format(
            subtotal, tax, total=total))

Sub: $12.32 Tax: $0.8624 Total: $13.182400000000001


In [None]:
print("Sub: ${0:0.2f} Tax: ${1:0.2f} "
    "Total: ${total:0.2f}".format(
    subtotal, tax, total=total))

Sub: $12.32 Tax: $0.86 Total: $13.18


**We can also specify that each number should take up a particular number of characters on the screen by placing a value before the period in the precision.**

In [None]:
orders = [('burger', 2, 5),
        ('fries', 3.5, 1),
        ('cola', 1.75, 3)]


print("PRODUCT QUANTITY PRICE SUBTOTAL")

for product, price, quantity in orders:
    subtotal = price * quantity
    print("{0:10s}{1: ^9d} ${2: <8.2f}${3: >7.2f}".format(
        product, quantity, price, subtotal))

PRODUCT QUANTITY PRICE SUBTOTAL
burger        5     $2.00    $  10.00
fries         1     $3.50    $   3.50
cola          3     $1.75    $   5.25


* **{0:10s}: The s means it is a string variable, and the 10 means it should take up ten characters.**
* **{1: ^9d}: The d represents an integer value. The 9 tells us the value should take up nine characters. The caret character ^ tells us that the number should be aligned in the center of this available paddin**
* **{2: <8.2f} and {3: >7.2f}: we use the < and > symbols, respectively, to represent that the numbers should be aligned to the left or right within the minimum space of eight or seven characters.**


In [None]:
import datetime
print("{0:%Y-%m-%d %I:%M%p }".format(
    datetime.datetime.now()))

2021-12-11 01:31PM 


## Strings are Unicode

* strings as collections of immutable Unicode characters. But Unicode isn't really a storage format.

* In contrast bytes, which are the lowest-level storage format in computing represent 8 bits, described as an integer between 0 and 255, or a hexadecimal equivalent between 0 and FF.

* Bytes don't represent anything specific; a sequence of bytes may store characters of an encoded string, or pixels in an image.

* If we print a byte object, any bytes that map to ASCII representations will be printed as their original character.

* The character "a" is represented by the same byte as the integer 97, which is the hexadecimal number 0x61. Specifically, all of these are an interpretation of the binary pattern 01100001.

* Many I/O operations only know how to deal with bytes, even if the bytes object refers to textual data. It is therefore vital to know how to convert between bytes and Unicode.

### Converting bytes to text

* **we can convert array of bytes to Unicode using the .decode method on the bytes class.**
* **This method accepts a string for the name of the character encoding.**
* **There are many such names; common ones for Western languages include ASCII, UTF-8, and latin-1.**


In [None]:
characters = b'\x63\x6c\x69\x63\x68\xe9'   # b means we represent bytes
print(characters)
print(characters.decode("latin-1"))

b'clich\xe9'
cliché


### Converting text to bytes

In [None]:
characters = "cliché"
print(characters.encode("UTF-8"))
print(characters.encode("latin-1"))
print(characters.encode("CP437"))


b'clich\xc3\xa9'
b'clich\xe9'
b'clich\x82'


**The accented character is represented as a different byte for each encoding; if we use the wrong one when we are decoding bytes to text, we get the wrong character.**

## Regular expressions

**Regular expressions are used to solve a common problem: Given a string, determine whether that string matches a given pattern and, optionally, collect substrings that contain relevant information. They can be used to answer questions like:**

* **Is this string a valid URL?**
* **What is the date and time of all warning messages in a log file?**
* **Which users in /etc/passwd are in a given group?**
* **What username and document were requested by the URL a visitor typed?**


### Matching patterns

**Regular expressionsThey rely on special characters to match unknown strings**

In [None]:
import re

search_string = "hello worl"
pattern = "hello world"

match = re.match(pattern, search_string)
if match:
    print("regex matches")

**Given That: the match function matches the pattern to the beginning of the string:**

* **if the pattern were "ello world", no match would be found.**
* **if the pattern "hello wo" matches successfully.**

In [None]:
# THis codes uses argv, it's for using it through command line

import sys
import re

pattern = sys.argv[1]
search_string = sys.argv[2]
match = re.match(pattern, search_string)

if match:
    template = "'{}' matches pattern '{}'"
else:
    template = "'{}' does not match pattern '{}'"
    
print(template.format(search_string, pattern))

# after writing this code inside .py file named "regex_generic", call this lines from the cmd 
# python regex_generic.py "hello worl" "hello world"         # This will match
# python regex_generic.py "ello world" "hello world"         # This won't match

'C:\Users\aghon\AppData\Roaming\jupyter\runtime\kernel-b5a423e9-530b-4dcb-9f3a-bee12e50115e.json' does not match pattern '-f'


### Matching a selection of characters

**Using a period in the string means you don't care what the character is:**

* 'hello world' matches pattern 'hel.o world'

* 'helpo world' matches pattern 'hel.o world'

* 'hel o world' matches pattern 'hel.o world'

* 'helo world' does not match pattern 'hel.o world': this does not match because there is no character at the period's position in the pattern.

---
**We can put a set of characters inside square brackets to match any one of those
characters.** 

* 'hello world' matches pattern 'hel[lp]o world'
* 'helpo world' matches pattern 'hel[lp]o world'
* 'helPo world' does not match pattern 'hel[lp]o world'

---
**The dash character, in a character set, will create a range. This is especially useful if you want to match "all lower case letters", "all letters", or "all numbers" as follows:**

* 'hello world' does not match pattern 'hello [a-z] world'
* 'hello b world' matches pattern 'hello [a-z] world'
* 'hello B world' matches pattern 'hello [a-zA-Z] world'
* 'hello 2 world' matches pattern 'hello [a-zA-Z0-9] world'

### Escaping characters

---
**Here's a regular expression to match two digit decimal numbers between 0.00 and 0.99:**

* '0.05' matches pattern '0\.[0-9][0-9]'
* '005' does not match pattern '0\.[0-9][0-9]'
* '0,05' does not match pattern '0\.[0-9][0-9]'

---

This backslash escape sequence is used for a variety of special characters in regular expressions. You can use \[ to insert a square bracket without starting a character class, and \( to insert a parenthesis, which we'll later see is also a special character. Or to represent special characters such as newlines (\n), and tabs (\t).


Further, some character classes can be represented more succinctly using escape strings; \s represents whitespace characters, \w represents letters, numbers, and underscore, and \d represents a digit:


* '(abc]' matches pattern '\(abc\]'
* ' 1a' matches pattern '\s\d\w'
* '\t5n' does not match pattern '\s\d\w'
* '5n' matches pattern '\s\d\w'

### Matching multiple characters:

**The asterisk (*)  character says that the previous pattern (the l character) can be matched zero or more
times.**

* 'hello' matches pattern 'hel*o'
* 'heo' matches pattern 'hel*o'
* 'helllllo' matches pattern 'hel*o'

---

**we can combine the asterisk with patterns that match multiple characters: whereas [a-z]* matches any collection of lowercase words, including the empty string:**

* 'A string.' matches pattern '[A-Z][a-z]* [a-z]*\.'
* 'No .' matches pattern '[A-Z][a-z]* [a-z]*\.'
* '' matches pattern '[a-z]*.*'

The plus (+) sign in a pattern behaves similarly to an asterisk; it states that the previous pattern can be repeated one or more times, but, unlike the asterisk is not optional.

The question mark (?) ensures a pattern shows up exactly zero or one times, but not more. 

* '0.4' matches pattern '\d+\.\d+'
* '1.002' matches pattern '\d+\.\d+'
* '1.' does not match pattern '\d+\.\d+'
* '1%' matches pattern '\d?\d%'
* '99%' matches pattern '\d?\d%'
* '999%' does not match pattern '\d?\d%'

### Grouping patterns together

**what if we want a repeating sequence of characters?**
Enclosing any set of patterns in parenthesis allows them to be treated as a single pattern when applying repetition operations. Compare these patterns:

* 'abccc' matches pattern 'abc{3}'
* 'abccc' does not match pattern '(abc){3}'
* 'abcabcabc' matches pattern '(abc){3}'

---
1. 'Eat.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'

The first word starts with a capital, followed by zero or more lowercase letters.
Then, we enter a parenthetical that matches a single space followed by a word of
one or more lowercase letters. This entire parenthetical is repeated zero or more
times, and the pattern is terminated with a period. There cannot be any other
characters after the period, as indicated by the $ matching the end of string.

2. 'Eat more good food.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'

3. 'A good meal.' matches pattern '[A-Z][a-z]*( [a-z]+)*\.$'

## Getting information from regular expressions

In [None]:
pattern = "^[a-zA-Z.]+@([a-z.]*\.[a-z]+)$"
search_string = "some.user@example.com"

match = re.match(pattern, search_string)
if match:
    domain = match.groups()[0]
    print(domain)

example.com


In [None]:
match.groups()

('example.com',)

The groups method returns a tuple of all the groups matched inside the pattern,
which you can index to access a specific value. The groups are ordered from left to
right.

* The search function finds the first instance of a matching pattern

* The findall function behaves similarly to search, except that it finds all non-overlapping instances of the matching pattern, not just the first one


* If there are no groups in the pattern, re.findall will return a list of strings, where each value is a complete substring from the source string that matches the pattern

* If there is exactly one group in the pattern, re.findall will return a list of strings where each value is the contents of that group
*  If there are multiple groups in the pattern, then re.findall will return a list of tuples where each tuple contains a value from a matching group, in order

In [None]:
import re
re.findall('a.', 'abacadefagah')

['ab', 'ac', 'ad', 'ag', 'ah']

In [None]:
re.findall('a(.)', 'abacadefagah')

['b', 'c', 'd', 'g', 'h']

In [None]:
re.findall('(a)(.)', 'abacadefagah')

[('a', 'b'), ('a', 'c'), ('a', 'd'), ('a', 'g'), ('a', 'h')]

In [None]:
re.findall('((a)(.))', 'abacadefagah')

[('ab', 'a', 'b'),
 ('ac', 'a', 'c'),
 ('ad', 'a', 'd'),
 ('ag', 'a', 'g'),
 ('ah', 'a', 'h')]

## Serializing objects

In the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer) or transmitted and reconstructed later.

In serialization, an object is transformed into a format that can be stored, so as to be able to deserialize it later and recreate the original object from the serialized format.

---
**Pickle:**

Pickling is the process whereby a Python object hierarchy is converted into a byte stream (usually not human readable) to be written to a file, this is also known as Serialization. Unpickling is the reverse operation, whereby a byte stream is converted back into a working Python object hierarchy.

Pickle is operationally simplest way to store the object. The Python Pickle module is an object-oriented way to store objects directly in a special storage format.

---
**What can it do?**
* Pickle can store and reproduce dictionaries and lists very easily.
* Stores object attributes and restores them back to the same State.

---
**What pickle can’t do?**
* It does not save an objects code. Only it’s attributes values.
* It cannot store file handles or connection sockets.
* In short we can say, pickling is a way to store and retrieve data variables into and out from files where variables can be lists, classes, etc.

---

* The dump method accepts an object to be written and a file-like object to write the serialized bytes to. This object must have a write method (or it wouldn't be file-like), and that method must know how to handle a bytes argument

* The load method does exactly the opposite; it reads a serialized object from a file-like object. This object must have the proper file-like read and readline arguments, each of which must, of course, return bytes. The pickle module will load the object from these bytes and the load method will return the fully reconstructed object.


In [None]:
cd Desktop/

C:\Users\aghon\Desktop


In [None]:
import pickle
some_data = ["a list", "containing", 5,
            "values including another list",
            ["inner", "list"]]

with open("pickled_list", 'wb') as file:
    pickle.dump(some_data, file)
    
with open("pickled_list", 'rb') as file:
    loaded_data = pickle.load(file)
    
print(loaded_data)
assert loaded_data == some_data

['a list', 'containing', 5, 'values including another list', ['inner', 'list']]
