# Tensorflow with GPU

This notebook provides an introduction to computing on a [GPU](https://cloud.google.com/gpu) in Colab. In this notebook you will connect to a GPU, and then run some basic TensorFlow operations on both the CPU and a GPU, observing the speedup provided by using the GPU.


## Enabling and testing the GPU

First, you'll need to enable GPUs for the notebook:

- Navigate to Edit→Notebook Settings
- select GPU from the Hardware Accelerator drop-down

Next, we'll confirm that we can connect to the GPU with tensorflow:

In [1]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# import numpy as np
# labels = np.load('/content/drive/Shareddrives/Data245/Project/Dataset/labels.npy')

In [None]:
# # 获取唯一值及其数量
# unique_values = np.unique(labels)
# num_unique_values = len(unique_values)

# print("Unique values in the array:", unique_values)
# print("Number of unique values:", num_unique_values)

Unique values in the array: ['!' '"' '#' ... 'zirconia' 'zirconium-95' 'zone']
Number of unique values: 13550


In [None]:
# from collections import Counter
# # Count the number of occurrences of each element
# label_counts = Counter(labels)

# # Get the top 100 elements with the most occurrences and their times
# top_100_labels_counts = label_counts.most_common(500)

# # Print results
# for label, count in top_100_labels_counts:
#     print(f"Label {label} appears {count} times")

In [24]:
# prompt: load /content/drive/Shareddrives/Data245/Project/Dataset/labels.npy

import numpy as np
features_pca = np.load('/content/drive/Shareddrives/Data245/Project/Dataset/top100_features_Proportional_pca.npy')
labels = np.load('/content/drive/Shareddrives/Data245/Project/Dataset/top100_Labels_Proportional.npy')

In [26]:
print("labels：", labels.shape)
print("features_pca：", features_pca.shape)

labels： (60261,)
features_pca： (60261, 201)


In [None]:
# # prompt: take 1000 rows from labels and features_pca

# labels_1000 = labels[:100]
# features_pca_1000 = features_pca[:100]

In [None]:
# from sklearn.preprocessing import LabelEncoder
# from sklearn.model_selection import train_test_split

# # 创建 LabelEncoder 实例
# label_encoder = LabelEncoder()

# # 将字符串类别标签转换为整数
# y_encoded = label_encoder.fit_transform(labels_1000)

# # 将 PCA 处理后的特征赋值给 X
# X = features_pca_1000

# # 使用转换后的标签和特征进行数据集分割
# X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)


In [27]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# create LabelEncoder instance
label_encoder = LabelEncoder()

# Convert the string category label to an integer
y_encoded = label_encoder.fit_transform(labels)

# Assign the PCA-processed features to X
X = features_pca

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, random_state=42)


In [13]:
print(X_train.dtype)
print(y_train.dtype)

float64
int64


In [28]:
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score
import time

# 定义模型
xgb_model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='mlogloss',
                              tree_method = "hist", device = "cuda")

# 定义参数网格
param_distributions = {
    'max_depth': [3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.2],
    'n_estimators': [50, 100, 150],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
}

# 创建 RandomizedSearchCV 对象
random_search = RandomizedSearchCV(estimator=xgb_model, param_distributions=param_distributions,
                                   n_iter=10,  # 可以调整进行的随机抽样的数量
                                   scoring='accuracy', cv=3, verbose=1, random_state=42)

# 执行随机搜索
start_time = time.time()
random_search.fit(X_train, y_train)
end_time = time.time()

print(f"Training started at: {datetime.datetime.fromtimestamp(start_time)}")
print(f"Training ended at: {datetime.datetime.fromtimestamp(end_time)}")
print(f"Training duration: {end_time - start_time} seconds")

# 打印最优参数和最高分数
print("Best parameters found: ", random_search.best_params_)
print("Best cross-validation score: {:.2f}".format(random_search.best_score_))

# 使用最佳参数对测试集进行评估
best_model = random_search.best_estimator_
predictions = best_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy on test set: {:.2f}".format(accuracy))

Fitting 3 folds for each of 10 candidates, totalling 30 fits
Training started at: 2024-04-28 04:11:36.471732
Training ended at: 2024-04-28 04:34:27.665627
Training duration: 1371.1938955783844 seconds
Best parameters found:  {'subsample': 1.0, 'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 1.0}
Best cross-validation score: 0.56
Accuracy on test set: 0.58


In [None]:
import xgboost as xgb

# Create DMatrix data structure
dtrain = xgb.DMatrix(X_train, label=y_train)
dtest = xgb.DMatrix(X_test, label=y_test)

# Set parameters
params = {
    'max_depth': 6,         # maximum depth of tree
    'eta': 0.3,             # learning rate
    'objective': 'multi:softmax',  # Objective function, softmax for multi-class classification
    'num_class': len(set(y_encoded)),  # Number of categories
    'tree_method': "hist", 'device' :"cuda"  # Histogram algorithm using GPU
}

# Number of training rounds
num_round = 100

# Training
bst = xgb.train(params, dtrain, num_round)

# Predict
preds = bst.predict(dtest)

# Evaluation
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, preds)
print(f"Accuracy: {accuracy:.2f}")


Accuracy: 0.44


In [None]:
# 如果需要，将预测的整数标签转回原始字符串
y_pred_labels = label_encoder.inverse_transform(y_pred.astype(int))