**问题一：如果不编码，直接使用字符，比如S/C/Q，机器学习算法能识别吗？**
- 答：大多数机器学习算法不能直接处理分类字符变量。它们需要数值型输入。因此，通常我们需要对这些字符变量进行编码。

**问题二：使用普通的编码与独热编码有什么区别？**
- **标签编码可能会在某些算法中引入一个假设的序数关系**，即1 < 2 < 3，但实际上S, C, Q之间可能并没有这种关系。-  独热编码不会引入这种假设的序数关系，但会增加数据的维度。

**问题三：使用普通的编码，一般的模型是否只能识别为数值型特征，而不是分类型，必须要用独热编码？**
- 当我们使用标签编码时，模型会将这些特征视为数值型特征，而不是分类特征。这可能会导致在某些算法（尤其是基于距离的算法）中出现问题，因为算法可能会假设这些数字之间存在某种序数关系。而独热编码则避免了这种问题。

## 1. 数据清洗

In [3]:
import pandas as pd
# Load the dataset
titanic_data = pd.read_excel(r"C:\Users\hp\Desktop\泰坦尼克号.xlsx")
titanic_data_cleaned = titanic_data.drop(columns=['Cabin'])
titanic_data_cleaned['Age'] = titanic_data_cleaned['Age'].fillna(titanic_data_cleaned['Age'].median())
titanic_data_cleaned['Embarked'] = titanic_data_cleaned['Embarked'].fillna(titanic_data_cleaned['Embarked'].mode()[0])
titanic_data_cleaned = titanic_data_cleaned[['Embarked', 'Fare']]
titanic_data_cleaned.isnull().sum(), titanic_data_cleaned.head()

  warn("Workbook contains no default style, apply openpyxl's default")


(Embarked    0
 Fare        0
 dtype: int64,
   Embarked     Fare
 0        S   7.2500
 1        C  71.2833
 2        S   7.9250
 3        S  53.1000
 4        S   8.0500)

## 2.如果不编码，直接使用字符，比如S/C/Q，机器学习算法能识别吗？

In [5]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

X = titanic_data_cleaned[['Embarked']]
y = titanic_data_cleaned['Fare']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

try:
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    y_pred = lr.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
except Exception as e:
    error_message = str(e)
    mse = None
error_message, mse

("could not convert string to float: 'S'", None)

- 如我们所料，当我们尝试使用未编码的“Embarked”列时，线性回归模型报告了错误，因为它不能处理字符串数据。
- 错误信息是：“could not convert string to float: 'S'”。

## 3.使用常规机器学习算法，使用常规编码

In [8]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
titanic_data_cleaned['Embarked_encoded'] = label_encoder.fit_transform(titanic_data_cleaned['Embarked'])
X_encoded = titanic_data_cleaned[['Embarked_encoded']]
X_train_encoded, X_test_encoded, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)
lr_encoded = LinearRegression()
lr_encoded.fit(X_train_encoded, y_train)
y_pred_encoded = lr_encoded.predict(X_test_encoded)
mse_encoded = mean_squared_error(y_test, y_pred_encoded)
mse_encoded

1397.3169194574098

## 3.使用常规机器学习算法，使用独热编码

In [9]:
# Use one-hot encoding for 'Embarked' column
embarked_onehot = pd.get_dummies(titanic_data_cleaned['Embarked'], prefix='Embarked')
titanic_data_onehot = pd.concat([titanic_data_cleaned, embarked_onehot], axis=1)
# Split the data into training and testing sets again
X_onehot = titanic_data_onehot[['Embarked_C', 'Embarked_Q', 'Embarked_S']]
X_train_onehot, X_test_onehot, y_train, y_test = train_test_split(X_onehot, y, test_size=0.2, random_state=42)
# Fit a linear regression model with one-hot encoded 'Embarked'
lr_onehot = LinearRegression()
lr_onehot.fit(X_train_onehot, y_train)
y_pred_onehot = lr_onehot.predict(X_test_onehot)
mse_onehot = mean_squared_error(y_test, y_pred_onehot)
mse_onehot

1320.5549858205027

- 使用独热编码后，我们得到的均方误差 (MSE) 为 1320.51，这比使用标签编码时的MSE略小。

## 4.使用集成学习算法，使用常规编码

In [11]:
from sklearn.ensemble import RandomForestRegressor
# Fit a random forest regressor with label encoded 'Embarked'
rf_encoded = RandomForestRegressor(random_state=42)
rf_encoded.fit(X_train_encoded, y_train)
y_pred_rf_encoded = rf_encoded.predict(X_test_encoded)
mse_rf_encoded = mean_squared_error(y_test, y_pred_rf_encoded)
mse_rf_encoded

1320.6125260638626

## 4.使用集成学习算法，使用独热编码

In [13]:
# Fit a random forest regressor with one-hot encoded 'Embarked'
rf_onehot = RandomForestRegressor(random_state=42)
rf_onehot.fit(X_train_onehot, y_train)
y_pred_rf_onehot = rf_onehot.predict(X_test_onehot)
mse_rf_onehot = mean_squared_error(y_test, y_pred_rf_onehot)
mse_rf_onehot

1320.6125260638626

## 5. 总结

    线性回归（标签编码）: MSE = 1397.32
    线性回归（独热编码）: MSE = 1320.51
    随机森林（标签编码）: MSE = 1320.61
    随机森林（独热编码）: MSE = 1320.61

In [None]:
- 机器学习算法不能识别字符格式数据
- 常规机器学习算法会把普通编码当做有序数据，使得效果变差
- 集成学习算法在两种编码情况下的预测效果完全一样
*