In this work, you need to do data processing with some python libraries like Numpy and Pandas. Detail descriptions will be found in the folowing code blocks.

In this part, you need to follow the instructions and implement the functions with numpy. You should use the following normalize function:

$$
x_i = \frac{x_i}{\sqrt{\sum_{j=1}x_j^2}}
$$

In [40]:
import numpy as np

def add(x, y):
    """A basic add function to add numpy.array or python numbers.
    For example: the output of add(5.3, 2) is 7.3.

    Parameters
    ----------
    x: python number or numpy.array
    y: python number or numpy.array

    Returns
    -------
    z: python number or numpy.array

    """
    if isinstance(x, np.ndarray) and isinstance(y, np.ndarray): #isinstance(a,b)函数判断a,b是否是同一类型的，其实这里也可以不用判断，统一采用X+Y就行
        return x + y
    elif isinstance(x, np.ndarray):
        return x + y
    elif isinstance(y, np.ndarray):
        return x + y
    else:
        return x + y


In [41]:
import numpy as np

def norm(matrix, axis=0):
    """
    A function to normalize a certain dimension of a matrix.
    Normalizes matrix such that for each element x_i in the matrix,
    x_i = x_i / sqrt(sum_j(x_j^2)) along the specified axis.

    Parameters
    ----------
    matrix : numpy.array
        The input matrix of two dimensions.
    axis : int
        Axis along which normalization is performed.

    Returns
    -------
    numpy.array
        The normalized matrix.
    """
    # Calculate the norm (denominator) along the specified axis
    norms = np.sqrt(np.sum(matrix**2, axis=axis, keepdims=True))
    
    # Avoid division by zero
    norms[norms == 0] = 1
    
    # Normalize the matrix
    normalized_matrix = matrix / norms
    return normalized_matrix




In [85]:
import pandas as pd
import numpy as np


def load_data():
    """Load and preprocess the dataset."""
    
    # 第一步：加载数据集
    data = pd.read_csv("dataset.csv")
    
    # 第二步，移除某行中包含空数据的行或者只含有_的行
    data.replace("_",np.nan, inplace=True)
    data.dropna(inplace=True)
    
    # 第三步：处理含有_和数字的地方
    data.replace("_", "", inplace=True, regex=True)
    
   
   #第四步，计算每一列的平均值，添加到最后一行，为了避免错误，首先先把data转为float类型，方便计算
    data = pd.DataFrame(data)
    data = data.astype(float)
    column_means = data.mean()
    data = data.append(column_means, ignore_index=True)
  

    # 第五步，把处理完的data放大新文件中以供查看
    data.to_csv("processed_dataset.csv", index=False, sep=',')
    
    return data


Run the following code to get the dataset of this work.

Try to use pandas to load the dataset "dataset2.csv", and process it following the instructrons in docstrings.

After you implement these functions, you may run the following code to check your answer.

In [86]:
assert add(5,6) == 11
assert add(3.2,1.0) == 4.2
assert type(add(4., 4)) == float
np.testing.assert_allclose(add(np.array([1,2]), np.array([3,4])),
                np.array([4,6]))

data = np.array([[2, 4, 6], [1, 3, 5], [3, 6, 9]])
normalized_data = norm(data, axis=1)
assert np.allclose(normalized_data, [[0.26726124, 0.53452248, 0.80178373],
                    [0.16903085, 0.50709255, 0.84515425],
                    [0.26726124, 0.53452248, 0.80178373]])

data = load_data()

assert len(data) == 328#（未加平均值之前是327，但是加了绝对值之后会多一行，变成328）

assert np.allclose(data.values[-1], [293734.71875, 1.9938838481903076, 29.373088836669922,
                      9.994159698486328, 1739.4874267578125, 794.76171875, 388.9890441894531])
