READ
DOI: https://archive.ics.uci.edu/dataset/327/phishing+websites



In [3]:
import arff
import pandas as pd
import os
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# ========== ƒê∆∞·ªùng d·∫´n ==========
BASE_DIR = os.getcwd()
DATA_PATH = os.path.join(BASE_DIR, "phishing.arff")

# ========== ƒê·ªçc d·ªØ li·ªáu t·ª´ file .arff ==========
with open(DATA_PATH, "r") as f:
    data = arff.load(f)

columns = [col[0] for col in data["attributes"]]
df = pd.DataFrame(data["data"], columns=columns).astype(int)

# ========== T·ªïng quan ==========
print("üßæ T·ªïng quan v·ªÅ Dataset:")
print(f"- S·ªë d√≤ng (samples): {df.shape[0]}")
print(f"- S·ªë thu·ªôc t√≠nh (features): {df.shape[1] - 1}")
print("\nüìå Danh s√°ch c√°c thu·ªôc t√≠nh:")
for i, col in enumerate(df.columns[:-1]):
    print(f"{i+1:2d}. {col}")

# ========== Ph√¢n t√≠ch ƒë·∫∑c tr∆∞ng ==========
X = df.drop("Result", axis=1)
y = df["Result"]

# D√πng RandomForest ƒë·ªÉ ƒë√°nh gi√° m·ª©c ƒë·ªô quan tr·ªçng
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
importances = rf.feature_importances_

# T·∫°o DataFrame ƒë·ªÉ hi·ªÉn th·ªã r√µ h∆°n
importance_df = pd.DataFrame({
    "Feature": X.columns,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

print("\nüî• 10 ƒë·∫∑c tr∆∞ng quan tr·ªçng nh·∫•t (theo RandomForest):")
print(importance_df.head(10).to_string(index=False))

# ========== Top ƒë·∫∑c tr∆∞ng theo Mutual Information ==========
selector = SelectKBest(score_func=mutual_info_classif, k=10)
selector.fit(X, y)
mi_scores = pd.DataFrame({
    "Feature": X.columns,
    "MI Score": selector.scores_
}).sort_values(by="MI Score", ascending=False)

print("\nüí° 10 ƒë·∫∑c tr∆∞ng quan tr·ªçng nh·∫•t (theo Mutual Information):")
print(mi_scores.head(10).to_string(index=False))


üßæ T·ªïng quan v·ªÅ Dataset:
- S·ªë d√≤ng (samples): 11055
- S·ªë thu·ªôc t√≠nh (features): 30

üìå Danh s√°ch c√°c thu·ªôc t√≠nh:
 1. having_IP_Address
 2. URL_Length
 3. Shortining_Service
 4. having_At_Symbol
 5. double_slash_redirecting
 6. Prefix_Suffix
 7. having_Sub_Domain
 8. SSLfinal_State
 9. Domain_registeration_length
10. Favicon
11. port
12. HTTPS_token
13. Request_URL
14. URL_of_Anchor
15. Links_in_tags
16. SFH
17. Submitting_to_email
18. Abnormal_URL
19. Redirect
20. on_mouseover
21. RightClick
22. popUpWidnow
23. Iframe
24. age_of_domain
25. DNSRecord
26. web_traffic
27. Page_Rank
28. Google_Index
29. Links_pointing_to_page
30. Statistical_report

üî• 10 ƒë·∫∑c tr∆∞ng quan tr·ªçng nh·∫•t (theo RandomForest):
                    Feature  Importance
             SSLfinal_State    0.318529
              URL_of_Anchor    0.262463
                web_traffic    0.070082
          having_Sub_Domain    0.060848
              Links_in_tags    0.041492
              Prefix_S

Nh√¨n nhanh ta th·∫•y ƒë√¢y l√† b√†i to√°n ph√¢n l·ªõp d·∫°ng nh·ªã ph√¢n. Website c√≥ ph·∫£i l√† l·ª´a ƒë·∫£o (phishing) hay kh√¥ng?

M√î T·∫¢ D·ªÆ LI·ªÜU:
C√≥ 30 thu·ªôc t√≠nh v√† c√°c ƒë·∫∑c tr∆∞ng (nh√£n) r·ªùi r·∫°c:
-1: phishing L·ª´a ƒë·∫£o
1: legitimate H·ª£p ph√°p
0: suspicious Kh·∫£ nghi

TH∆Ø VI·ªÜN S·ª¨ D·ª§NG:
- Decision Tree C√¢y quy·∫øt ƒë·ªãnh.
- Random Forest R·ª´ng ng·∫´u nhi√™n. C√≥ Bagging
- K-Nearest Neighbors 

1. Gi·ªõi thi·ªáu
2. C∆° s·ªü l√Ω thuy·∫øt Machine Learning
3. D·ªØ li·ªáu v√† x·ª≠ l√Ω
4. M√¥ h√¨nh v√† hu·∫•n luy·ªán
5. ƒê√°nh gi√° k·∫øt qu·∫£
6. ·ª®ng d·ª•ng v√†o th·ª±c t·∫ø
7. K·∫øt lu·∫≠n v√† h∆∞·ªõng ph√°t tri·ªÉn
8. Ph·ª• l·ª•c: Code, demo

In [4]:
df.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,...,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1
