### homework1：决策树练习并调参，从而提升准确率，随机森林练习，结合网格搜索调参，找到最佳参数模型

In [90]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.ensemble import RandomForestClassifier
import pandas as pd

In [91]:
"""
决策树对泰坦尼克号进行预测生死
:return: None
"""
# 获取数据
titan = pd.read_csv("../data/titanic.txt")
titan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   row.names  1313 non-null   int64  
 1   pclass     1313 non-null   object 
 2   survived   1313 non-null   int64  
 3   name       1313 non-null   object 
 4   age        633 non-null    float64
 5   embarked   821 non-null    object 
 6   home.dest  754 non-null    object 
 7   room       77 non-null     object 
 8   ticket     69 non-null     object 
 9   boat       347 non-null    object 
 10  sex        1313 non-null   object 
dtypes: float64(1), int64(2), object(8)
memory usage: 113.0+ KB


In [92]:
titan

Unnamed: 0,row.names,pclass,survived,name,age,embarked,home.dest,room,ticket,boat,sex
0,1,1st,1,"Allen, Miss Elisabeth Walton",29.0000,Southampton,"St Louis, MO",B-5,24160 L221,2,female
1,2,1st,0,"Allison, Miss Helen Loraine",2.0000,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
2,3,1st,0,"Allison, Mr Hudson Joshua Creighton",30.0000,Southampton,"Montreal, PQ / Chesterville, ON",C26,,(135),male
3,4,1st,0,"Allison, Mrs Hudson J.C. (Bessie Waldo Daniels)",25.0000,Southampton,"Montreal, PQ / Chesterville, ON",C26,,,female
4,5,1st,1,"Allison, Master Hudson Trevor",0.9167,Southampton,"Montreal, PQ / Chesterville, ON",C22,,11,male
...,...,...,...,...,...,...,...,...,...,...,...
1308,1309,3rd,0,"Zakarian, Mr Artun",,,,,,,male
1309,1310,3rd,0,"Zakarian, Mr Maprieder",,,,,,,male
1310,1311,3rd,0,"Zenn, Mr Philip",,,,,,,male
1311,1312,3rd,0,"Zievens, Rene",,,,,,,female


In [93]:
x = titan[['pclass', 'age', 'sex']]
y = titan['survived']

df.info(): 用于查看DataFrame的结构和数据类型，适合检查数据完整性和类型。

df.describe(): 用于获取数值型列的统计信息，适合分析数据的分布和统计特征。

In [94]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  1313 non-null   object 
 1   age     633 non-null    float64
 2   sex     1313 non-null   object 
dtypes: float64(1), object(2)
memory usage: 30.9+ KB


默认情况下，df.describe()仅对数值型列（int、float等）进行统计计算。如果设置include='all'，则会尝试对所有列（包括非数值型列）生成统计信息。

In [95]:
x.describe()

Unnamed: 0,age
count,633.0
mean,31.194181
std,14.747525
min,0.1667
25%,21.0
50%,30.0
75%,41.0
max,71.0


In [96]:
x.describe(include='all')

Unnamed: 0,pclass,age,sex
count,1313,633.0,1313
unique,3,,2
top,3rd,,male
freq,711,,850
mean,,31.194181,
std,,14.747525,
min,,0.1667,
25%,,21.0,
50%,,30.0,
75%,,41.0,


In [97]:
# 一定要进行缺失值处理，填为均值
mean_age = x['age'].mean()
print(mean_age)
x.loc[:, 'age'] = x.loc[:, 'age'].fillna(mean_age)

31.19418104265403


In [98]:
# 进行缺失值处理后
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1313 entries, 0 to 1312
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  1313 non-null   object 
 1   age     1313 non-null   float64
 2   sex     1313 non-null   object 
dtypes: float64(1), object(2)
memory usage: 30.9+ KB


In [99]:
# 分割数据集到训练集和测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=4)

下面只是了解数据集的结构，并没有实际使用决策树模型。

In [100]:
# Pandas 中的一个常用方法，用于查看 DataFrame 或 Series 的前几行数据（默认显示前 5 行）。它通常用于快速检查数据的结构和内容。
x_train.head()

Unnamed: 0,pclass,age,sex
598,2nd,30.0,male
246,1st,62.0,male
905,3rd,31.194181,female
300,1st,31.194181,female
509,2nd,64.0,male


In [101]:
type(x_train)

pandas.core.frame.DataFrame

In [102]:
sum(y_train)

334

In [103]:
# 性别是女性的数量
x_train[x_train['sex'] == 'female'].count()

pclass    341
age       341
sex       341
dtype: int64

In [104]:
x_train

Unnamed: 0,pclass,age,sex
598,2nd,30.000000,male
246,1st,62.000000,male
905,3rd,31.194181,female
300,1st,31.194181,female
509,2nd,64.000000,male
...,...,...,...
360,2nd,31.194181,male
709,3rd,28.000000,male
439,2nd,34.000000,male
174,1st,46.000000,male


In [105]:
y_train

598     0
246     0
905     0
300     0
509     0
       ..
360     0
709     0
439     0
174     0
1146    0
Name: survived, Length: 984, dtype: int64

In [106]:
# 女性中存活的情况对比
z = x_train.copy()  # z是为了把特征和目标存储到一起
z['survived'] = y_train  # 把目标值存储到z中

In [107]:
z[z['sex'] == 'female']['survived'].value_counts()  # 统计女性中存活的情况

survived
1    230
0    111
Name: count, dtype: int64

In [108]:
z[z['sex'] == 'male']['survived'].value_counts()  # 统计男性中存活的情况

survived
0    539
1    104
Name: count, dtype: int64

In [109]:
y_train.value_counts()  # 统计总体存活情况

survived
0    650
1    334
Name: count, dtype: int64

In [110]:
x_train.loc[:, 'sex'].value_counts()  # 性别分布

sex
male      643
female    341
Name: count, dtype: int64

下面是决策树模型的训练过程。

In [111]:
x_train.to_dict(orient="records")

[{'pclass': '2nd', 'age': 30.0, 'sex': 'male'},
 {'pclass': '1st', 'age': 62.0, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '1st', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '2nd', 'age': 64.0, 'sex': 'male'},
 {'pclass': '1st', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '3rd', 'age': 24.0, 'sex': 'female'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '2nd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 21.0, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '2nd', 'age': 23.0, 'sex': 'female'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'male'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '3rd', 'age': 31.19418104265403, 'sex': 'female'},
 {'pclass': '1st', 'age': 4

In [112]:
type(x_train.to_dict(orient="records"))  #把df变为字典，样本变为一个一个的字典，字典中列名变为键  

list

In [113]:
# 进行处理（特征工程）特征-》类别-》one_hot编码
dict = DictVectorizer(sparse=False)

# 这一步是对字典进行特征抽取,to_dict可以把df变为字典，records代表列名变为键
# DictVectorizer 本身不支持直接输入 list，但如果 list 中的每个元素是一个字典，则可以将其作为输入。
x_train = dict.fit_transform(x_train.to_dict(orient="records"))
print(type(x_train))
print(dict.get_feature_names_out())

<class 'numpy.ndarray'>
['age' 'pclass=1st' 'pclass=2nd' 'pclass=3rd' 'sex=female' 'sex=male']


In [114]:
x_test = dict.transform(x_test.to_dict(orient="records"))

In [115]:
print(x_train)

[[30.          0.          1.          0.          0.          1.        ]
 [62.          1.          0.          0.          0.          1.        ]
 [31.19418104  0.          0.          1.          1.          0.        ]
 ...
 [34.          0.          1.          0.          0.          1.        ]
 [46.          1.          0.          0.          0.          1.        ]
 [31.19418104  0.          0.          1.          0.          1.        ]]


In [116]:
# 用决策树进行预测，修改max_depth试试,修改criterion为entropy
# 树过于复杂，就会产生过拟合
dec = DecisionTreeClassifier()

#训练
dec.fit(x_train, y_train)

# 预测准确率
print("预测的准确率：", dec.score(x_test, y_test))

# 导出决策树的结构
# export_graphviz(dec, out_file="tree.dot",
#                 feature_names=['age', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', 'female', 'male'])

export_graphviz(dec, out_file="tree.dot",
                feature_names=dict.get_feature_names_out())


预测的准确率： 0.8085106382978723


对决策树进行参数调优

In [117]:
#调整决策树的参数
# 分割数据集到训练集合测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=4)
# 进行处理（特征工程）特征-》类别-》one_hot编码
dict = DictVectorizer(sparse=False)

# 这一步是对字典进行特征抽取
x_train = dict.fit_transform(x_train.to_dict(orient="records"))
x_test = dict.transform(x_test.to_dict(orient="records"))

# print(x_train)
# # 用决策树进行预测，修改max_depth为10，发现提升了,min_impurity_decrease带来的增益要大于0.01才会进行划分
# 通过设置超参数来优化model的性能
dec = DecisionTreeClassifier(max_depth=7, min_impurity_decrease=0.01, min_samples_split=20)

dec.fit(x_train, y_train)
#
# # 预测准确率
print("预测的准确率：", dec.score(x_test, y_test))
#
# # 导出决策树的结构
export_graphviz(dec, out_file="tree1.dot",
                feature_names=dict.get_feature_names_out())

预测的准确率： 0.8206686930091185


通过随机森林模型进行预测，超参数调优

In [118]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=4)
# 进行处理（特征工程）特征-》类别-》one_hot编码
dict = DictVectorizer(sparse=False)

# 这一步是对字典进行特征抽取
x_train = dict.fit_transform(x_train.to_dict(orient="records"))
x_test = dict.transform(x_test.to_dict(orient="records"))

In [119]:
# 随机森林进行预测 （超参数调优），n_jobs充分利用多核的一个参数
rf = RandomForestClassifier(n_jobs=-1)
# 120, 200, 300, 500, 800, 1200,n_estimators森林中决策树的数目，也就是分类器的数目
# max_samples  是最大样本数
# bagging类型
param = {"n_estimators": [1500, 2000, 5000], "max_depth": [2, 3, 5, 8, 15, 25]}

# 网格搜索与交叉验证
gc = GridSearchCV(rf, param_grid=param, cv=3)

gc.fit(x_train, y_train)

print("准确率：", gc.score(x_test, y_test))

print("查看选择的参数模型：", gc.best_params_)

print("选择最好的模型是：", gc.best_estimator_)

准确率： 0.8328267477203647
查看选择的参数模型： {'max_depth': 3, 'n_estimators': 2000}
选择最好的模型是： RandomForestClassifier(max_depth=3, n_estimators=2000, n_jobs=-1)


In [120]:
print("每个超参数每次交叉验证的结果：", gc.cv_results_)

每个超参数每次交叉验证的结果： {'mean_fit_time': array([1.07055863, 1.38813424, 3.48044658, 1.00253455, 1.30010192,
       3.31668591, 0.9691515 , 1.3292559 , 3.33942143, 0.99864586,
       1.39611896, 3.44516007, 1.02673841, 1.41205319, 3.46519899,
       1.02915231, 1.40738654, 5.30615091]), 'std_fit_time': array([0.02574293, 0.02523532, 0.14288436, 0.02235626, 0.02914515,
       0.07743438, 0.00998786, 0.03773662, 0.03388782, 0.02705515,
       0.0250025 , 0.07694435, 0.04178913, 0.03101283, 0.04621442,
       0.00125954, 0.01517   , 1.25383713]), 'mean_score_time': array([0.10427872, 0.121586  , 0.3025496 , 0.08857139, 0.13095888,
       0.31017939, 0.10004632, 0.12815452, 0.28758558, 0.10323318,
       0.12515203, 0.3379751 , 0.11790999, 0.13269051, 0.32806849,
       0.10068615, 0.14861886, 0.51131288]), 'std_score_time': array([0.00497233, 0.0008563 , 0.01691403, 0.00029253, 0.01529304,
       0.01475169, 0.00112137, 0.01043924, 0.00980162, 0.01017795,
       0.00487578, 0.03588041, 0.01293443