## Code alongs - Fundamentals part 2

## Error handling 

- syntax errors 语法错误
- runtime errors 运行时错误
- logical errors 逻辑错误

In [1]:
# syntax errors
prin("hej")

NameError: name 'prin' is not defined

In [2]:
print("hej")

hej


In [4]:
numbers = list(range(19))
numbers[19]

IndexError: list index out of range

In [6]:
numbers[18]

18

In [12]:
import numpy as np

radius = 10
# np.pi*radius**2 is the area of the cicle -> logical error

area_circle = radius**2 * np.pi
print(f"{area_circle = :.2f} a.u.")

# :.2f 是指保留两位小数， a.u. 是 "arbitrary units"（任意单位）的缩写

area_circle = 31.42 a.u.


## Handle errors

try - except

In [13]:
age = input("Enter your age")
age

'-42'

In [21]:
while True:
    try:
        # type casting might give ValueError 类型转换可能会给出 ValueError
        age = int(input("Enter your age"))
        if not 0  <= age <= 125:
            raise ValueError(f"Age must be between 0 and 125 not {age}")
        break
    except ValueError as err:
        print(err)

age

invalid literal for int() with base 10: 'dsdj'
Age must be between 0 and 125 not 567
Age must be between 0 and 125 not -23


12

## Functions

- avoid spaghetti kod - 结构混乱、难以维护、且缺乏清晰逻辑和组织的程序代码。
- change one place
- DRY - Don't Repeat Yourself
- organize code
- make code modular
- break down complex programs


### oneline if - else

In [22]:
# number1 and number2 are paramters
def smallest_of_two(number1, number2):
    return number1 if number1 < number2 else number2   # Java: number1 ? number1 <number2: number2

# positional arguments 位置参数
smallest_of_two(2, -5)

-5

In [23]:
# keyword arguments 关键字参数
smallest_of_two(number1=-5, number2=-5)

-5

In [25]:
smallest_of_two(-5, number2=-9)

-9

**default value**

双星号 (** )：/表示加粗文本。 例如：** text ** 会显示为 text

In [4]:
# x o o o o 
# x x o o o 
# x x x o o 
# x x x x o
# x x x x x 

def draw_ascii_pattern(number_rows= 5):
    print(number_rows*"0")

# note default value: number_rows = 5
draw_ascii_pattern()

draw_ascii_pattern(3)

00000
000


In [7]:
def draw_ascii_pattern(number_rows= 5):
    for i in range(number_rows):
        print(number_rows*"0 ")

draw_ascii_pattern()

0 0 0 0 0 
0 0 0 0 0 
0 0 0 0 0 
0 0 0 0 0 
0 0 0 0 0 


In [10]:
def draw_ascii_pattern(number_rows= 5):
    for i in range(number_rows):
        print(f'{i*"x " + (number_rows-i)*"0 "}')

draw_ascii_pattern()

0 0 0 0 0 
x 0 0 0 0 
x x 0 0 0 
x x x 0 0 
x x x x 0 


In [11]:
draw_ascii_pattern(3)

0 0 0 
x 0 0 
x x 0 


\*args     

- 在 Markdown 或其他文档格式中，*args 前加反斜杠 \ 的原因是为了 转义星号 *，防止它被解析为 Markdown 的强调语法。
- arbitrary number of positional arguments 任意数量的位置参数

In [12]:
def mean_(*args):
    print(args)

mean_(1,2,3,4)

(1, 2, 3, 4)


In [13]:
mean_(1,2)

(1, 2)


In [14]:
def mean_(*args):
    sum_ = 0

    for arg in args:
        sum_ += arg
    return sum_/len(args)

mean_(1,2,3)

2.0

\**kwargs

- 接收任意数量的字典                  

- 在 Python 中，**kwargs 是一种常见的命名约定，但并不是一个必须的名称。实际上，你可以用任何合法的变量名来替代 kwargs，例如这里用的是 options。核心点在于：

** 的含义                
** 是语法规则，用于将传入的关键字参数打包成字典。    
**options 中，options 是一个变量名，用来接收关键字参数。        
关键点：**kwargs 中的 kwargs 并不是特殊关键字，仅仅是一个常规的变量名，完全可以用其他名字替代。        

In [16]:
# **options 是一个用来接受任意数量的关键字参数的特殊语法。
# 当调用函数并传递类似 key=value 的关键字参数时，这些参数会被打包成一个 字典，并存储在 options 变量中


def print_kwargs(**options):
    print(options)
    print(f"{options.keys() = }")
    print(f"{options.values() = }")

print_kwargs(a = 5, is_active = True, age = 33)

{'a': 5, 'is_active': True, 'age': 33}
options.keys() = dict_keys(['a', 'is_active', 'age'])
options.values() = dict_values([5, True, 33])


## File handling

In [19]:
with open("data/ml_text_raw.txt", 'r') as file:
    print(file.read())

SUperViseD    LEARNinG IS a    PaRt    oF MaCHinE    LEARniNG,   wheRE aLgORithms    LEARn FRom a tRAINIng    DaTa Set. THese   aLgORithms   TRY   TO    MaKE   SeNSe   Of    ThE    DaTa   BY    MaTChiNG    INpUtS   TO    CoRResPonDInG   OutpUTs. In    suPERviseD    LEARNing,    EACH    DaTa   PoINt in    ThE   tRAINIng    Set    IS    LaBELEd WiTH    ThE    CoRReCT    OutpUT,    WHich   aLLOWS   thE ALgORithM    To    LEARn   FRom ThE    ExAMPles. THis   alLOWS   thE    ALgORithM    To    MaKe    PREDIcTions    On    UnSEEN    DaTa, BaSED On    ITs    TRaiNIng. iT    IS    USEd FoR    taSKS SuCH    AS CLaSSIFICaTion, WheRE ThE    GoAL    IS    To    aSSIGn    a LaBEL To    InpUt DaTa,    anD REGrESsIoN, WheRE ThE    GoAL    IS    To    PREDIcT    a CoNtINuoUS    OutpUT VaRIabLE. SuPERviseD    LEARNing    HaS    MaNY    APPLIcatIoNS In    ArEas LIke    Image ReCOGNitiON, NatuRaL    LaNGuaGE PRoCESSINg,    anD FiNaNCiaL FoRECasting.


In [23]:
import re     # re：正则表达式

with open("data/ml_text_raw.txt", 'r') as file:
    raw_text = file.read()
    print(file.read())


text_fixed_spacing = re.sub(r"\s+", " ", raw_text)    # \s 表示任意空白字符，1个以上空格变成1个空格
text_fixed_spacing




'SUperViseD LEARNinG IS a PaRt oF MaCHinE LEARniNG, wheRE aLgORithms LEARn FRom a tRAINIng DaTa Set. THese aLgORithms TRY TO MaKE SeNSe Of ThE DaTa BY MaTChiNG INpUtS TO CoRResPonDInG OutpUTs. In suPERviseD LEARNing, EACH DaTa PoINt in ThE tRAINIng Set IS LaBELEd WiTH ThE CoRReCT OutpUT, WHich aLLOWS thE ALgORithM To LEARn FRom ThE ExAMPles. THis alLOWS thE ALgORithM To MaKe PREDIcTions On UnSEEN DaTa, BaSED On ITs TRaiNIng. iT IS USEd FoR taSKS SuCH AS CLaSSIFICaTion, WheRE ThE GoAL IS To aSSIGn a LaBEL To InpUt DaTa, anD REGrESsIoN, WheRE ThE GoAL IS To PREDIcT a CoNtINuoUS OutpUT VaRIabLE. SuPERviseD LEARNing HaS MaNY APPLIcatIoNS In ArEas LIke Image ReCOGNitiON, NatuRaL LaNGuaGE PRoCESSINg, anD FiNaNCiaL FoRECasting.'

In [25]:
text_fixed_spacing.split(". ")

# ". ": 这是分隔符，表示句号后跟一个空格。即，每当字符串中出现 .  的地方，字符串会被切分成两部分。
# 特点：
# 不包含分隔符：分割后，返回的每个片段不再包含 . 。
# 句号末尾的情况：如果字符串以 . 结尾，最后的空字符串也会被返回

['SUperViseD LEARNinG IS a PaRt oF MaCHinE LEARniNG, wheRE aLgORithms LEARn FRom a tRAINIng DaTa Set',
 'THese aLgORithms TRY TO MaKE SeNSe Of ThE DaTa BY MaTChiNG INpUtS TO CoRResPonDInG OutpUTs',
 'In suPERviseD LEARNing, EACH DaTa PoINt in ThE tRAINIng Set IS LaBELEd WiTH ThE CoRReCT OutpUT, WHich aLLOWS thE ALgORithM To LEARn FRom ThE ExAMPles',
 'THis alLOWS thE ALgORithM To MaKe PREDIcTions On UnSEEN DaTa, BaSED On ITs TRaiNIng',
 'iT IS USEd FoR taSKS SuCH AS CLaSSIFICaTion, WheRE ThE GoAL IS To aSSIGn a LaBEL To InpUt DaTa, anD REGrESsIoN, WheRE ThE GoAL IS To PREDIcT a CoNtINuoUS OutpUT VaRIabLE',
 'SuPERviseD LEARNing HaS MaNY APPLIcatIoNS In ArEas LIke Image ReCOGNitiON, NatuRaL LaNGuaGE PRoCESSINg, anD FiNaNCiaL FoRECasting.']

In [26]:
[text.capitalize() for text in text_fixed_spacing.split(". ")]

# 通过列表推导式实现的操作，将字符串 text_fixed_spacing 按照句子（以 ". " 分隔的部分）拆分成多个部分，然后对每一部分的首字母进行大写，最终生成一个新列表

['Supervised learning is a part of machine learning, where algorithms learn from a training data set',
 'These algorithms try to make sense of the data by matching inputs to corresponding outputs',
 'In supervised learning, each data point in the training set is labeled with the correct output, which allows the algorithm to learn from the examples',
 'This allows the algorithm to make predictions on unseen data, based on its training',
 'It is used for tasks such as classification, where the goal is to assign a label to input data, and regression, where the goal is to predict a continuous output variable',
 'Supervised learning has many applications in areas like image recognition, natural language processing, and financial forecasting.']

In [29]:
# text.strip()：移除每个分割出来的句子首尾的多余空白字符。

sentences = [text.strip().capitalize() for text in text_fixed_spacing.split(". ")]
sentences = sentences[:-1]  # 获取列表中从第一个元素到倒数第二个元素的所有内容（不包含最后一个元素）。
sentences

['Supervised learning is a part of machine learning, where algorithms learn from a training data set',
 'These algorithms try to make sense of the data by matching inputs to corresponding outputs',
 'In supervised learning, each data point in the training set is labeled with the correct output, which allows the algorithm to learn from the examples',
 'This allows the algorithm to make predictions on unseen data, based on its training',
 'It is used for tasks such as classification, where the goal is to assign a label to input data, and regression, where the goal is to predict a continuous output variable']

In [30]:
cleaned_text = ".\n\n".join(sentences)
print(cleaned_text)

# 每个句子后面加 ".\n\n"，以便形成段落分隔，两个换行符 \n\n 的作用是：在每个句子后面加一个空行，形成段落的效果。
# .join() 将句子列表合并成一个字符串。


Supervised learning is a part of machine learning, where algorithms learn from a training data set.

These algorithms try to make sense of the data by matching inputs to corresponding outputs.

In supervised learning, each data point in the training set is labeled with the correct output, which allows the algorithm to learn from the examples.

This allows the algorithm to make predictions on unseen data, based on its training.

It is used for tasks such as classification, where the goal is to assign a label to input data, and regression, where the goal is to predict a continuous output variable


In [31]:
with open("data/cleaned_ml_text.txt", 'w') as file:
    file.write(cleaned_text)   # 将字符串写入文件

# 1. with open() 这是用来打开文件的语句，with 语句确保文件操作完成后，文件会自动关闭，无需手动调用 file.close()。
# 'w' 模式表示 写入模式：如果文件不存在，会自动创建文件。如果文件已存在，会清空文件内容，然后写入新内容