11

Random Forest Classifier:

Random forests already have built-in feature importance measures derived from the mean decrease in impurity or the mean decrease in accuracy of the tree nodes when a particular feature is used for splitting. 

In [24]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score

random_forest = RandomForestClassifier(random_state=1)
random_forest.fit(X_train, y_train)
feature_importances = random_forest.feature_importances_

k = 10
top_k_indices = feature_importances.argsort()[-k:][::-1]

X_train_selected = X_train.iloc[:, top_k_indices]
X_test_selected = X_test.iloc[:, top_k_indices]

random_forest_selected = RandomForestClassifier(random_state=1)
random_forest_selected.fit(X_train_selected, y_train)

y_pred_selected = random_forest_selected.predict(X_test_selected)
accuracy_selected = accuracy_score(y_test, y_pred_selected)
auc_selected = roc_auc_score(y_test, random_forest_selected.predict_proba(X_test_selected)[:, 1])

print("Random Forest with Feature Selection - Accuracy:", accuracy_selected)
print("Random Forest with Feature Selection - AUC:", auc_selected)

Random Forest with Feature Selection - Accuracy: 1.0
Random Forest with Feature Selection - AUC: 0.3214285714285714


Support Vector Machine (SVM):

For SVM, we can use techniques like Recursive Feature Elimination (RFE) or SelectKBest to select the most informative features before training the model.

In [25]:
from sklearn.svm import SVC
from sklearn.feature_selection import RFE

svm_model = SVC(kernel='linear')
rfe = RFE(estimator=svm_model, n_features_to_select=k, step=1)
rfe.fit(X_train, y_train)

X_train_selected = X_train.iloc[:, rfe.support_]
X_test_selected = X_test.iloc[:, rfe.support_]

svm_selected = SVC(kernel='linear', probability=True)
svm_selected.fit(X_train_selected, y_train)

y_proba_selected = svm_selected.predict_proba(X_test_selected)[:, 1]
auc_selected = roc_auc_score(y_test, y_proba_selected)

print("SVM with Feature Selection - Accuracy:", accuracy_selected)
print("SVM with Feature Selection - AUC:", auc_selected)

SVM with Feature Selection - Accuracy: 1.0
SVM with Feature Selection - AUC: 0.14285714285714285


12

In [26]:
random_forest_selected = RandomForestClassifier(random_state=1)
random_forest_selected.fit(X_train_selected, y_train)
y_pred_selected_rf = random_forest_selected.predict(X_test_selected)
accuracy_selected_rf = accuracy_score(y_test, y_pred_selected_rf)
auc_selected_rf = roc_auc_score(y_test, random_forest_selected.predict_proba(X_test_selected)[:, 1])

svm_selected = SVC(kernel='linear', probability=True)
svm_selected.fit(X_train_selected, y_train)
y_pred_selected_svm = svm_selected.predict(X_test_selected)
accuracy_selected_svm = accuracy_score(y_test, y_pred_selected_svm)
auc_selected_svm = roc_auc_score(y_test, svm_selected.predict_proba(X_test_selected)[:, 1])

print("Random Forest with Feature Selection - Accuracy:", accuracy_selected_rf)
print("Random Forest with Feature Selection - AUC:", auc_selected_rf)
print("SVM with Feature Selection - Accuracy:", accuracy_selected_svm)
print("SVM with Feature Selection - AUC:", auc_selected_svm)

Random Forest with Feature Selection - Accuracy: 0.975
Random Forest with Feature Selection - AUC: 0.35714285714285715
SVM with Feature Selection - Accuracy: 0.9583333333333334
SVM with Feature Selection - AUC: 0.026571428571428548


- Both classifiers demonstrated similar accuracy in predicting class labels, with the Random Forest classifier achieving a slightly higher AUC score.

- While both classifiers performed well in terms of accuracy, the Random Forest classifier's higher AUC score suggests that it may be more effective in ranking instances by their likelihood of belonging to the positive class.

13

In [27]:
from sklearn.linear_model import LogisticRegression

logistic_regression = LogisticRegression(max_iter=1000) 
logistic_regression.fit(X, y)

coefficients = logistic_regression.coef_[0]
feature_coefficients = dict(zip(X.columns, coefficients))
sorted_coefficients = sorted(feature_coefficients.items(), key=lambda x: abs(x[1]), reverse=True)

print("Top Predictor Variables and Coefficients:")
for feature, coefficient in sorted_coefficients[:5]:
    print(f"{feature}: {coefficient}")

Top Predictor Variables and Coefficients:
pe: 1.0853612237222734
sc: 0.946551056245395
al: 0.9013667373936388
hemo: -0.5496185568031066
htn: 0.5469077051944204


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


- Larger absolute coefficients indicate stronger influence of the corresponding predictor variables on the classification decision.

- By examining the top predictor variables with large coefficients, we can identify key clinical indicators that are significantly associated with the presence of chronic kidney disease.

14

In [29]:
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, roc_auc_score

kmeans = KMeans(n_clusters=3, random_state=42)
X_train_clusters = kmeans.fit_predict(X_train)

classifiers = {}

unique_clusters = set(X_train_clusters)

for cluster in unique_clusters:
    X_train_cluster = X_train[X_train_clusters == cluster]
    y_train_cluster = y_train[X_train_clusters == cluster] 
    classifier = RandomForestClassifier(random_state=42)
    classifier.fit(X_train_cluster, y_train_cluster)
    classifiers[cluster] = classifier

X_test_clusters = kmeans.predict(X_test)

y_pred_new = []  

for cluster, classifier in classifiers.items():
    X_test_cluster = X_test[X_test_clusters == cluster]
    y_test_cluster = y_test[X_test_clusters == cluster]
    y_pred_cluster = classifier.predict(X_test_cluster)
    y_pred_new.extend(y_pred_cluster)  

    print(f"Cluster: {cluster}")
    print(classification_report(y_test_cluster, y_pred_cluster))

accuracy_new = accuracy_score(y_test, y_pred_new)
auc_new = roc_auc_score(y_test, y_pred_new)  

print("Comparison:")
print("12")
print("Random Forest with Feature Selection - Accuracy:", accuracy_selected_rf)
print("Random Forest with Feature Selection - AUC:", auc_selected_rf)
print("SVM with Feature Selection - Accuracy:", accuracy_selected_svm)
print("SVM with Feature Selection - AUC:", auc_selected_svm)
print("14")
print("New Classifier - Accuracy:", accuracy_new)
print("New Classifier - AUC:", auc_new)

  super()._check_params_vs_input(X, default_n_init=10)


Cluster: 0
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        35
           2       1.00      1.00      1.00        50

    accuracy                           1.00        85
   macro avg       1.00      1.00      1.00        85
weighted avg       1.00      1.00      1.00        85

Cluster: 1
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        24

    accuracy                           1.00        24
   macro avg       1.00      1.00      1.00        24
weighted avg       1.00      1.00      1.00        24

Cluster: 2
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        11

    accuracy                           1.00        11
   macro avg       1.00      1.00      1.00        11
weighted avg       1.00      1.00      1.00        11

Comparison:
12
Random Forest with Feature Selection - Accuracy: 0.975
Random Forest with Featu

In contrast, the accuracy in 14 is lower than that in 12, but the AUC has increased.

15

Haoyang Huang, 400302198, Question 1-5

Xinshan Li, 400248868, Question 6-10, Set up Github

Yining Fu, 400300139, Question 11-15

16

https://github.com/Terry2314/ASSIGNMENT6