## Modelo de Detecção de Modelos de Detecção de Smishing

Ankit Kumar Jain, B.B. Gupta,
Rule-Based Framework for Detection of Smishing Messages in Mobile Environment,
Procedia Computer Science,
Volume 125,
2018,
Pages 617-623,
ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2017.12.079.
(https://www.sciencedirect.com/science/article/pii/S1877050917328478)
Abstract: Smishing is a cyber-security attack, which utilizes Short Message Service (SMS) to steal personal credentials of mobile users. The trust level of users on their smart devices has attracted attackers for performing various mobile security attacks like Smishing. In this paper, we implement the rule-based data mining classification approach in the detection of smishing messages. The proposed approach identified nine rules which can efficiently filter smishing SMS from the genuine one. Further, our approach applies rule-based classification algorithms to train these outstanding rules. Since the SMS text messages are very short and generally written in Lingo language, we have used text normalization to convert them into standard form to obtain better rules. The performance of the proposed approach is evaluated, and it achieved more than 99% true negative rate. Furthermore, the proposed approach is very efficient for the detection of the zero hour attack too.
Keywords: Smishing; Mobile Phishing; Data mining; Short messaging service; Machine learning

> Reprodução de resultados

In [1]:
import pandas
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import ruleset

#### 1. Pré processamento dos dados

In [2]:
# Import the CSV dataset as a dataframe
# Since pandas is already imported in cell 1, we can use it directly
df = pandas.read_csv('SMSSpamCollectionDataset.csv', encoding='latin-1')
df = df[['label', 'text']]

# Display the first few rows to get a glimpse of the data

df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ì_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
# Download dos recursos necessários do NLTK
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True) 
nltk.download('stopwords', quiet=True)

def normalize_text(original_text) -> str:
    '''
    Recebe um SMS
    
    Retorna texto original normalizado (mais conservador para melhor performance)
    '''

    # Convert to lowercase
    text = original_text.lower()

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize (método mais simples se houver problemas com punkt)
    try:
        words = nltk.word_tokenize(text)
    except:
        # Fallback para tokenização simples
        words = text.split()
    
    # Remove stopwords apenas as mais comuns (mais conservador)
    # IMPORTANTE: Preservar símbolos financeiros e matemáticos mesmo que sejam curtos
    common_stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should'}
    
    # Símbolos financeiros e matemáticos importantes para detecção de spam/smishing
    important_symbols = {
        # Símbolos financeiros
        '$', '£', '€', '¥', '₹', '¢', 
        # Símbolos matemáticos
        '+', '-', '*', '/', '=', '<', '>', '≤', '≥', '≠', '±', '×', '÷',
        # Outros símbolos importantes  
        '%', '#', '@', '&', '!', '?'
    }
    
    # Manter palavra se: não é stopword E (tem mais de 1 char OU é símbolo importante)
    words = [word for word in words if word not in common_stopwords and (len(word) > 1 or word in important_symbols)]
    
    # NÃO aplicar stemming agressivo - manter palavras mais íntegras
    # Preservar símbolos importantes e palavras relevantes
    words = [word for word in words if len(word) > 2 or word in important_symbols]
    
    # Join words back into a string
    normalized_text = ' '.join(words)
    
    return normalized_text

# Testar com um exemplo
sample_text = "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005!"
print(f"Original: {sample_text}")
print(f"Normalizado: {normalize_text(sample_text)}")

Original: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005!
Normalizado: free entry wkly comp win cup final tkts 21st may 2005 !


(debugando o por quê do símbolo de dolar estar sumindo e verificando que ainda funciona...)

In [4]:
# DEBUG: Vamos rastrear passo a passo onde o $ está sendo perdido
def debug_normalize_text(original_text):
    print(f"1. Original: '{original_text}'")
    
    # Convert to lowercase
    text = original_text.lower()
    print(f"2. Lowercase: '{text}'")
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    print(f"3. Remove spaces: '{text}'")
    
    # Tokenize
    try:
        words = nltk.word_tokenize(text)
        print(f"4. NLTK tokenize: {words}")
    except:
        words = text.split()
        print(f"4. Simple split: {words}")
    
    # Remove stopwords
    common_stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should'}
    important_symbols = {
        # Símbolos financeiros
        '$', '£', '€', '¥', '₹', '¢', 
        # Símbolos matemáticos
        '+', '-', '*', '/', '=', '<', '>', '≤', '≥', '≠', '±', '×', '÷',
        # Outros símbolos importantes  
        '%', '#', '@', '&', '!', '?'
    }
    
    # Manter palavra se: não é stopword E (tem mais de 1 char OU é símbolo importante)
    words_after_stopwords = [word for word in words if word not in common_stopwords and (len(word) > 1 or word in important_symbols)]
    print(f"5. After stopwords removal: {words_after_stopwords}")
    
    # Remove short words mas preserva símbolos importantes
    words_after_length = [word for word in words_after_stopwords if len(word) > 2 or word in important_symbols]
    print(f"6. After length filter (>2): {words_after_length}")
    
    result = ' '.join(words_after_length)
    print(f"7. Final result: '{result}'")
    print("-" * 50)
    return result

# Testar com exemplos problemáticos
test_cases = [
    "Get $500 bonus!",
    "Only $19.99 today!",
    "Win £1000 prize!",
    "Cost €250 only"
]

for test in test_cases:
    debug_normalize_text(test)

1. Original: 'Get $500 bonus!'
2. Lowercase: 'get $500 bonus!'
3. Remove spaces: 'get $500 bonus!'
4. NLTK tokenize: ['get', '$', '500', 'bonus', '!']
5. After stopwords removal: ['get', '$', '500', 'bonus', '!']
6. After length filter (>2): ['get', '$', '500', 'bonus', '!']
7. Final result: 'get $ 500 bonus !'
--------------------------------------------------
1. Original: 'Only $19.99 today!'
2. Lowercase: 'only $19.99 today!'
3. Remove spaces: 'only $19.99 today!'
4. NLTK tokenize: ['only', '$', '19.99', 'today', '!']
5. After stopwords removal: ['only', '$', '19.99', 'today', '!']
6. After length filter (>2): ['only', '$', '19.99', 'today', '!']
7. Final result: 'only $ 19.99 today !'
--------------------------------------------------
1. Original: 'Win £1000 prize!'
2. Lowercase: 'win £1000 prize!'
3. Remove spaces: 'win £1000 prize!'
4. NLTK tokenize: ['win', '£1000', 'prize', '!']
5. After stopwords removal: ['win', '£1000', 'prize', '!']
6. After length filter (>2): ['win', '£10

In [5]:
# Testar com exemplos que contêm símbolos matemáticos e financeiros
math_test_cases = [
    "Get $500 + 20% bonus = $600 total!",
    "Discount: 50% - 10% = 40% off today!",
    "Calculate: 2 + 2 = 4, 10 * 5 = 50",
    "Price: £100 > £80 (save £20!)",
    "Rate: 5% < 10% but > 2%",
    "Win €250 × 2 = €500 prize!",
    "Cost: $19.99 ÷ 2 = $9.99 each"
]

print("=== TESTE COM SÍMBOLOS MATEMÁTICOS E FINANCEIROS ===")
for i, text in enumerate(math_test_cases, 1):
    normalized = normalize_text(text)
    print(f"\n{i}. Original: {text}")
    print(f"   Normalizado: {normalized}")
    
    # Verificar se símbolos importantes foram preservados
    important_symbols_in_text = {
        'financial': ['$', '£', '€', '¥', '₹', '¢'],
        'mathematical': ['+', '-', '*', '/', '=', '<', '>', '≤', '≥', '≠', '±', '×', '÷'],
        'other': ['%', '#', '@', '&', '!', '?']
    }
    
    found_symbols = []
    for category, symbols in important_symbols_in_text.items():
        for sym in symbols:
            if sym in text and sym in normalized:
                found_symbols.append(f"{sym}({category})")
            elif sym in text and sym not in normalized:
                found_symbols.append(f"❌{sym}({category})")
    
    if found_symbols:
        print(f"   Símbolos preservados: {', '.join(found_symbols)}")

=== TESTE COM SÍMBOLOS MATEMÁTICOS E FINANCEIROS ===

1. Original: Get $500 + 20% bonus = $600 total!
   Normalizado: get $ 500 + % bonus = $ 600 total !
   Símbolos preservados: $(financial), +(mathematical), =(mathematical), %(other), !(other)

2. Original: Discount: 50% - 10% = 40% off today!
   Normalizado: discount % - % = % off today !
   Símbolos preservados: -(mathematical), =(mathematical), %(other), !(other)

3. Original: Calculate: 2 + 2 = 4, 10 * 5 = 50
   Normalizado: calculate + = * =
   Símbolos preservados: +(mathematical), *(mathematical), =(mathematical)

4. Original: Price: £100 > £80 (save £20!)
   Normalizado: price £100 > £80 save £20 !
   Símbolos preservados: £(financial), >(mathematical), !(other)

5. Original: Rate: 5% < 10% but > 2%
   Normalizado: rate % < % > %
   Símbolos preservados: <(mathematical), >(mathematical), %(other)

6. Original: Win €250 × 2 = €500 prize!
   Normalizado: win €250 × = €500 prize !
   Símbolos preservados: €(financial), =(mathema

In [6]:
# Apply the normalize_text function to the text column
df['normalized_text'] = df['text'].apply(normalize_text)

# Display the first few rows to see the normalized text
print(df[['text', 'normalized_text']].head())

                                                text  \
0  Go until jurong point, crazy.. Available only ...   
1                      Ok lar... Joking wif u oni...   
2  Free entry in 2 a wkly comp to win FA Cup fina...   
3  U dun say so early hor... U c already then say...   
4  Nah I don't think he goes to usf, he lives aro...   

                                     normalized_text  
0  until jurong point crazy available only bugis ...  
1                         lar ... joking wif oni ...  
2  free entry wkly comp win cup final tkts 21st m...  
3         dun say early hor ... already then say ...  
4    nah n't think goes usf lives around here though  


### 2. Extração de features



In [7]:
# Apply each rule function from the ruleset module to create new columns
df['rule1'] = df['normalized_text'].apply(ruleset.rule1)
df['rule2'] = df['normalized_text'].apply(ruleset.rule2)
df['rule3'] = df['normalized_text'].apply(ruleset.rule3)
df['rule4'] = df['normalized_text'].apply(ruleset.rule4)
df['rule5'] = df['normalized_text'].apply(ruleset.rule5)
df['rule6'] = df['normalized_text'].apply(ruleset.rule6)
df['rule7'] = df['normalized_text'].apply(ruleset.rule7)
df['rule8'] = df['normalized_text'].apply(ruleset.rule8)
df['rule9'] = df['normalized_text'].apply(ruleset.rule9)

# Display the dataframe with all rule columns
print("Shape after adding rule columns:", df.shape)
df.head()

Shape after adding rule columns: (5572, 12)


Unnamed: 0,label,text,normalized_text,rule1,rule2,rule3,rule4,rule5,rule6,rule7,rule8,rule9
0,ham,"Go until jurong point, crazy.. Available only ...",until jurong point crazy available only bugis ...,0,0,0,0,1,0,0,1,0
1,ham,Ok lar... Joking wif u oni...,lar ... joking wif oni ...,0,0,0,0,0,0,0,1,0
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...,free entry wkly comp win cup final tkts 21st m...,0,0,0,1,1,0,0,1,0
3,ham,U dun say so early hor... U c already then say...,dun say early hor ... already then say ...,0,0,0,0,0,0,0,1,0
4,ham,"Nah I don't think he goes to usf, he lives aro...",nah n't think goes usf lives around here though,0,0,0,0,0,0,0,0,0


In [8]:
import os
# Remove the problematic environment variable
if 'MPLBACKEND' in os.environ:
    del os.environ['MPLBACKEND']

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Convert labels to binary values (ham=0, spam=1)
df['binary_label'] = df['label'].map({'ham': 0, 'spam': 1})

# Extract features (all rule columns) and target variable
X = df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']]
y = df['binary_label']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions on test data
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f"Decision Tree Classifier Results:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(class_report)
print("\nConfusion Matrix:")
print(conf_matrix)

# Calculate feature importances
feature_importances = dt_classifier.feature_importances_
feature_names = X.columns
importance_df = pandas.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(importance_df)

# Print true negative rate as mentioned in the paper
tn, fp, fn, tp = conf_matrix.ravel()
tnr = tn / (tn + fp)
print(f"\nTrue Negative Rate: {tnr * 100:.2f}%")
print(f"True Positive Rate (Sensitivity/Recall): {tp / (tp + fn) * 100:.2f}%")
print(f"False Positive Rate: {fp / (fp + tn) * 100:.2f}%")
print(f"False Negative Rate: {fn / (fn + tp) * 100:.2f}%")

Decision Tree Classifier Results:
Accuracy: 98.92%

Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       965
           1       0.97      0.95      0.96       150

    accuracy                           0.99      1115
   macro avg       0.98      0.97      0.98      1115
weighted avg       0.99      0.99      0.99      1115


Confusion Matrix:
[[960   5]
 [  7 143]]

Feature Importance:
  Feature  Importance
3   rule4    0.902025
4   rule5    0.045402
0   rule1    0.016887
1   rule2    0.009734
2   rule3    0.009129
6   rule7    0.006816
5   rule6    0.005381
7   rule8    0.004627
8   rule9    0.000000

True Negative Rate: 99.48%
True Positive Rate (Sensitivity/Recall): 95.33%
False Positive Rate: 0.52%
False Negative Rate: 4.67%


In [9]:
# Make predictions on test data
y_pred = dt_classifier.predict(X_train)

# Evaluate the model
accuracy = accuracy_score(y_train, y_pred)
class_report = classification_report(y_train, y_pred)
conf_matrix = confusion_matrix(y_train, y_pred)

# Print the results
print(f"Decision Tree Classifier Results:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(class_report)
print("\nConfusion Matrix:")
print(conf_matrix)

# Calculate feature importances
feature_importances = dt_classifier.feature_importances_
feature_names = X.columns
importance_df = pandas.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(importance_df)

# Print true negative rate as mentioned in the paper
tn, fp, fn, tp = conf_matrix.ravel()
tnr = tn / (tn + fp)
print(f"\nTrue Negative Rate: {tnr * 100:.2f}%")
print(f"True Positive Rate (Sensitivity/Recall): {tp / (tp + fn) * 100:.2f}%")
print(f"False Positive Rate: {fp / (fp + tn) * 100:.2f}%")
print(f"False Negative Rate: {fn / (fn + tp) * 100:.2f}%")

Decision Tree Classifier Results:
Accuracy: 97.78%

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      3860
           1       0.95      0.88      0.91       597

    accuracy                           0.98      4457
   macro avg       0.97      0.94      0.95      4457
weighted avg       0.98      0.98      0.98      4457


Confusion Matrix:
[[3832   28]
 [  71  526]]

Feature Importance:
  Feature  Importance
3   rule4    0.902025
4   rule5    0.045402
0   rule1    0.016887
1   rule2    0.009734
2   rule3    0.009129
6   rule7    0.006816
5   rule6    0.005381
7   rule8    0.004627
8   rule9    0.000000

True Negative Rate: 99.27%
True Positive Rate (Sensitivity/Recall): 88.11%
False Positive Rate: 0.73%
False Negative Rate: 11.89%


In [10]:
## Análise de Overfitting

# Vamos usar validação cruzada para verificar se há overfitting
from sklearn.model_selection import cross_val_score, validation_curve, learning_curve
import numpy as np

# 1. Validação Cruzada com 5 folds
print("=== ANÁLISE DE OVERFITTING ===\n")

# Validação cruzada
cv_scores = cross_val_score(dt_classifier, X, y, cv=5, scoring='accuracy')
print(f"Validação Cruzada (5-fold):")
print(f"Scores: {cv_scores}")
print(f"Média: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Desvio padrão: {cv_scores.std():.4f}")

# Se o desvio padrão for muito alto, pode indicar overfitting
if cv_scores.std() > 0.02:
    print("⚠️  ALERTA: Alto desvio padrão pode indicar overfitting!")
else:
    print("✅ Desvio padrão baixo - modelo parece estável")

print("\n" + "="*50)

=== ANÁLISE DE OVERFITTING ===

Validação Cruzada (5-fold):
Scores: [0.9793722  0.9838565  0.97576302 0.97666068 0.97486535]
Média: 0.9781 (+/- 0.0065)
Desvio padrão: 0.0032
✅ Desvio padrão baixo - modelo parece estável



In [11]:
# 2. Análise da Distribuição das Features
print("\n=== ANÁLISE DAS FEATURES ===")

# Verificar distribuição das regras
feature_distribution = df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']].sum()
total_samples = len(df)

print("Distribuição das regras (quantos SMS triggeram cada regra):")
for rule, count in feature_distribution.items():
    percentage = (count / total_samples) * 100
    print(f"{rule}: {count}/{total_samples} ({percentage:.1f}%)")

# Verificar quantos SMS não triggeraram nenhuma regra
no_rules_triggered = df[(df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']].sum(axis=1) == 0)]
print(f"\nSMS que não triggeram NENHUMA regra: {len(no_rules_triggered)}")
print(f"Destes, quantos são spam: {len(no_rules_triggered[no_rules_triggered['label'] == 'spam'])}")
print(f"Destes, quantos são ham: {len(no_rules_triggered[no_rules_triggered['label'] == 'ham'])}")

# Verificar correlação entre regras
correlation_matrix = df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']].corr()
print(f"\nCorrelações mais altas entre regras:")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.3:  # Correlação moderada ou alta
            print(f"{correlation_matrix.columns[i]} vs {correlation_matrix.columns[j]}: {corr_value:.3f}")

print("\n" + "="*50)


=== ANÁLISE DAS FEATURES ===
Distribuição das regras (quantos SMS triggeram cada regra):
rule1: 357/5572 (6.4%)
rule2: 777/5572 (13.9%)
rule3: 318/5572 (5.7%)
rule4: 711/5572 (12.8%)
rule5: 1844/5572 (33.1%)
rule6: 146/5572 (2.6%)
rule7: 199/5572 (3.6%)
rule8: 1509/5572 (27.1%)
rule9: 0/5572 (0.0%)

SMS que não triggeram NENHUMA regra: 2516
Destes, quantos são spam: 14
Destes, quantos são ham: 2502

Correlações mais altas entre regras:
rule2 vs rule4: 0.402
rule3 vs rule4: 0.560
rule3 vs rule8: 0.348
rule4 vs rule5: 0.437
rule4 vs rule8: 0.383



In [12]:
# 3. Análise específica da Rule4 (dominante)
print("\n=== ANÁLISE DETALHADA DA RULE4 ===")

# Analisar a performance da rule4 sozinha
rule4_analysis = df.groupby(['rule4', 'label']).size().unstack(fill_value=0)
print("Distribuição Rule4 vs Label:")
print(rule4_analysis)

# Calcular métricas se usássemos apenas a rule4
rule4_only_accuracy = ((rule4_analysis.loc[0, 'ham'] + rule4_analysis.loc[1, 'spam']) / len(df))
print(f"\nSe usássemos APENAS a Rule4:")
print(f"Acurácia: {rule4_only_accuracy:.4f} ({rule4_only_accuracy*100:.2f}%)")

# Verificar quantos spam/ham triggeraram rule4
spam_with_rule4 = len(df[(df['label'] == 'spam') & (df['rule4'] == 1)])
total_spam = len(df[df['label'] == 'spam'])
ham_with_rule4 = len(df[(df['label'] == 'ham') & (df['rule4'] == 1)])
total_ham = len(df[df['label'] == 'ham'])

print(f"\nRule4 detecta {spam_with_rule4}/{total_spam} spam ({spam_with_rule4/total_spam*100:.1f}%)")
print(f"Rule4 é triggerrada por {ham_with_rule4}/{total_ham} ham ({ham_with_rule4/total_ham*100:.1f}%)")

# Isso mostra se a rule4 é muito específica
if ham_with_rule4/total_ham < 0.02:  # Menos de 2% dos ham triggeram rule4
    print("⚠️  Rule4 pode estar sendo muito específica para spam!")

print("\n" + "="*50)


=== ANÁLISE DETALHADA DA RULE4 ===
Distribuição Rule4 vs Label:
label   ham  spam
rule4            
0      4766    95
1        59   652

Se usássemos APENAS a Rule4:
Acurácia: 0.9724 (97.24%)

Rule4 detecta 652/747 spam (87.3%)
Rule4 é triggerrada por 59/4825 ham (1.2%)
⚠️  Rule4 pode estar sendo muito específica para spam!



In [13]:
# 4. Teste sem a Rule4 dominante
print("\n=== TESTE SEM RULE4 ===")

# Treinar modelo sem a rule4
X_without_rule4 = df[['rule1', 'rule2', 'rule3', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']]
y = df['binary_label']

# Split dos dados
X_train_no4, X_test_no4, y_train_no4, y_test_no4 = train_test_split(
    X_without_rule4, y, test_size=0.2, random_state=42
)

# Treinar novo modelo
dt_no_rule4 = DecisionTreeClassifier(random_state=42)
dt_no_rule4.fit(X_train_no4, y_train_no4)

# Fazer predições
y_pred_no4 = dt_no_rule4.predict(X_test_no4)
accuracy_no4 = accuracy_score(y_test_no4, y_pred_no4)

print(f"Acurácia SEM Rule4: {accuracy_no4:.4f} ({accuracy_no4*100:.2f}%)")
print(f"Queda de performance: {(accuracy - accuracy_no4)*100:.2f} pontos percentuais")

# Feature importance sem rule4
feature_imp_no4 = dt_no_rule4.feature_importances_
feature_names_no4 = X_without_rule4.columns
importance_df_no4 = pandas.DataFrame({'Feature': feature_names_no4, 'Importance': feature_imp_no4})
importance_df_no4 = importance_df_no4.sort_values('Importance', ascending=False)

print(f"\nNova distribuição de importância (sem rule4):")
for idx, row in importance_df_no4.iterrows():
    print(f"{row['Feature']}: {row['Importance']:.3f}")

# Validação cruzada sem rule4
cv_scores_no4 = cross_val_score(dt_no_rule4, X_without_rule4, y, cv=5, scoring='accuracy')
print(f"\nValidação cruzada sem rule4: {cv_scores_no4.mean():.4f} (+/- {cv_scores_no4.std() * 2:.4f})")

print("\n" + "="*50)


=== TESTE SEM RULE4 ===
Acurácia SEM Rule4: 0.9354 (93.54%)
Queda de performance: 4.24 pontos percentuais

Nova distribuição de importância (sem rule4):
rule3: 0.497
rule5: 0.225
rule2: 0.168
rule8: 0.035
rule1: 0.029
rule7: 0.029
rule6: 0.017
rule9: 0.000

Validação cruzada sem rule4: 0.9363 (+/- 0.0073)



In [14]:
# 5. Conclusões sobre Overfitting
print("\n=== CONCLUSÕES SOBRE OVERFITTING ===")

print("✅ EVIDÊNCIAS CONTRA OVERFITTING:")
print("• Validação cruzada estável (std = 0.32%)")
print("• Modelo sem rule4 ainda tem 93.54% de acurácia")
print("• 2516 SMS não triggeram nenhuma regra, sendo 2502 ham e apenas 14 spam")

print("\n⚠️  POSSÍVEIS PREOCUPAÇÕES:")
print("• Rule4 domina com 90.2% de importância")
print("• Rule4 sozinha já dá 97.24% de acurácia")
print("• Rule4 triggera apenas 1.2% dos ham vs 87.3% dos spam")
print("• Rule9 (email) nunca é triggerrada (0% dos casos)")

print("\n🔍 INTERPRETAÇÃO:")
print("• Não há overfitting clássico (modelo generaliza bem)")
print("• Mas há DEPENDÊNCIA EXCESSIVA da rule4 (números de telefone)")
print("• Isso pode ser um problema de generalização para outros tipos de spam")
print("• Rule4 pode estar sendo um 'atalho' muito específico para este dataset")

print(f"\n📊 RESUMO NUMÉRICO:")
print(f"• Acurácia original: {accuracy*100:.2f}%")
print(f"• Acurácia sem rule4: {accuracy_no4*100:.2f}%")
print(f"• Contribuição da rule4: {(accuracy - accuracy_no4)*100:.2f} pontos percentuais")
print(f"• Rule4 = {(accuracy - accuracy_no4)/accuracy*100:.1f}% da performance total")

print("\n" + "="*60)


=== CONCLUSÕES SOBRE OVERFITTING ===
✅ EVIDÊNCIAS CONTRA OVERFITTING:
• Validação cruzada estável (std = 0.32%)
• Modelo sem rule4 ainda tem 93.54% de acurácia
• 2516 SMS não triggeram nenhuma regra, sendo 2502 ham e apenas 14 spam

⚠️  POSSÍVEIS PREOCUPAÇÕES:
• Rule4 domina com 90.2% de importância
• Rule4 sozinha já dá 97.24% de acurácia
• Rule4 triggera apenas 1.2% dos ham vs 87.3% dos spam
• Rule9 (email) nunca é triggerrada (0% dos casos)

🔍 INTERPRETAÇÃO:
• Não há overfitting clássico (modelo generaliza bem)
• Mas há DEPENDÊNCIA EXCESSIVA da rule4 (números de telefone)
• Isso pode ser um problema de generalização para outros tipos de spam
• Rule4 pode estar sendo um 'atalho' muito específico para este dataset

📊 RESUMO NUMÉRICO:
• Acurácia original: 97.78%
• Acurácia sem rule4: 93.54%
• Contribuição da rule4: 4.24 pontos percentuais
• Rule4 = 4.3% da performance total

