## Modelo de Detec√ß√£o de Modelos de Detec√ß√£o de Smishing

Ankit Kumar Jain, B.B. Gupta,
Rule-Based Framework for Detection of Smishing Messages in Mobile Environment,
Procedia Computer Science,
Volume 125,
2018,
Pages 617-623,
ISSN 1877-0509,
https://doi.org/10.1016/j.procs.2017.12.079.
(https://www.sciencedirect.com/science/article/pii/S1877050917328478)
Abstract: Smishing is a cyber-security attack, which utilizes Short Message Service (SMS) to steal personal credentials of mobile users. The trust level of users on their smart devices has attracted attackers for performing various mobile security attacks like Smishing. In this paper, we implement the rule-based data mining classification approach in the detection of smishing messages. The proposed approach identified nine rules which can efficiently filter smishing SMS from the genuine one. Further, our approach applies rule-based classification algorithms to train these outstanding rules. Since the SMS text messages are very short and generally written in Lingo language, we have used text normalization to convert them into standard form to obtain better rules. The performance of the proposed approach is evaluated, and it achieved more than 99% true negative rate. Furthermore, the proposed approach is very efficient for the detection of the zero hour attack too.
Keywords: Smishing; Mobile Phishing; Data mining; Short messaging service; Machine learning

> Reprodu√ß√£o de resultados

In [1]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
import ruleset

#### 1. Pr√© processamento dos dados

In [2]:
# Import the CSV dataset as a dataframe
# Since pandas is already imported in cell 1, we can use it directly
df = pd.read_csv('SMSSpamCollectionDataset.csv', encoding='latin-1')
df = df[['label', 'text']]

# Display the first few rows to get a glimpse of the data

df

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will √å_ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [3]:
# Download dos recursos necess√°rios do NLTK
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True) 
nltk.download('stopwords', quiet=True)

def normalize_text(original_text) -> str:
    '''
    Recebe um SMS
    
    Retorna texto original normalizado (mais conservador para melhor performance)
    '''

    # Convert to lowercase
    text = original_text.lower()

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Tokenize (m√©todo mais simples se houver problemas com punkt)
    try:
        words = nltk.word_tokenize(text)
    except:
        # Fallback para tokeniza√ß√£o simples
        words = text.split()
    
    # Remove stopwords apenas as mais comuns (mais conservador)
    # IMPORTANTE: Preservar s√≠mbolos financeiros e matem√°ticos mesmo que sejam curtos
    common_stopwords = {'the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'with', 'by', 'is', 'are', 'was', 'were', 'be', 'been', 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'could', 'should'}
    
    # S√≠mbolos financeiros e matem√°ticos importantes para detec√ß√£o de spam/smishing
    important_symbols = {
        # S√≠mbolos financeiros
        '$', '¬£', '‚Ç¨', '¬•', '‚Çπ', '¬¢', 
        # S√≠mbolos matem√°ticos
        '+', '-', '*', '/', '=', '<', '>', '‚â§', '‚â•', '‚â†', '¬±', '√ó', '√∑',
        # Outros s√≠mbolos importantes  
        '%', '#', '@', '&', '!', '?'
    }
    
    # Manter palavra se: n√£o √© stopword E (tem mais de 1 char OU √© s√≠mbolo importante)
    words = [word for word in words if word not in common_stopwords and (len(word) > 1 or word in important_symbols)]
    
    # N√ÉO aplicar stemming agressivo - manter palavras mais √≠ntegras
    # Preservar s√≠mbolos importantes e palavras relevantes
    words = [word for word in words if len(word) > 2 or word in important_symbols]
    
    # Join words back into a string
    normalized_text = ' '.join(words)
    
    return normalized_text

# Testar com um exemplo
sample_text = "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005!"
print(f"Original: {sample_text}")
print(f"Normalizado: {normalize_text(sample_text)}")

Original: Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005!
Normalizado: free entry wkly comp win cup final tkts 21st may 2005 !


(debugando o por qu√™ do s√≠mbolo de dolar estar sumindo e verificando que ainda funciona...)

### Balanceando dataset

In [4]:
df_ham = df[df["label"] == "ham"]
df_spam = df[df["label"] == "spam"]

min_len = min(len(df_ham), len(df_spam))


df_ham_sample = df_ham.sample(n=min_len, random_state=42)
df_spam_sample = df_spam.sample(n=min_len, random_state=42)

big_df = df
df = pd.concat([df_ham_sample,df_spam_sample])
df


Unnamed: 0,label,text
3714,ham,"I am late,so call you tomorrow morning.take ca..."
1311,ham,U r too much close to my heart. If u go away i...
548,ham,Wait &lt;#&gt; min..
1324,ham,Can you call me plz. Your number shows out of ...
3184,ham,MAYBE IF YOU WOKE UP BEFORE FUCKING 3 THIS WOU...
...,...,...
504,spam,+123 Congratulations - in this week's competit...
737,spam,Hi. Customer Loyalty Offer:The NEW Nokia6650 M...
1928,spam,Call from 08702490080 - tells u 2 call 0906635...
3228,spam,Ur cash-balance is currently 500 pounds - to m...


In [5]:
# Apply the normalize_text function to the text column
df['normalized_text'] = df['text'].apply(normalize_text)

# Display the first few rows to see the normalized text
print(df[['text', 'normalized_text']].head())

                                                   text  \
3714  I am late,so call you tomorrow morning.take ca...   
1311  U r too much close to my heart. If u go away i...   
548                              Wait  &lt;#&gt;  min..   
1324  Can you call me plz. Your number shows out of ...   
3184  MAYBE IF YOU WOKE UP BEFORE FUCKING 3 THIS WOU...   

                                        normalized_text  
3714  late call you tomorrow morning.take care sweet...  
1311       too much close heart away shattered plz stay  
548                                      wait & # & min  
1324  can you call plz your number shows out coverag...  
3184     maybe you woke before fucking this n't problem  


### 2. Extra√ß√£o de features



In [6]:
# Apply each rule function from the ruleset module to create new columns
df['rule1'] = df['normalized_text'].apply(ruleset.rule1)
df['rule2'] = df['normalized_text'].apply(ruleset.rule2)
df['rule3'] = df['normalized_text'].apply(ruleset.rule3)
df['rule4'] = df['normalized_text'].apply(ruleset.rule4)
df['rule5'] = df['normalized_text'].apply(ruleset.rule5)
df['rule6'] = df['normalized_text'].apply(ruleset.rule6)
df['rule7'] = df['normalized_text'].apply(ruleset.rule7)
df['rule8'] = df['normalized_text'].apply(ruleset.rule8)
df['rule9'] = df['normalized_text'].apply(ruleset.rule9)

# Display the dataframe with all rule columns
print("Shape after adding rule columns:", df.shape)
df.head()

Shape after adding rule columns: (1494, 12)


Unnamed: 0,label,text,normalized_text,rule1,rule2,rule3,rule4,rule5,rule6,rule7,rule8,rule9
3714,ham,"I am late,so call you tomorrow morning.take ca...",late call you tomorrow morning.take care sweet...,1,0,0,0,1,0,0,1,0
1311,ham,U r too much close to my heart. If u go away i...,too much close heart away shattered plz stay,0,0,0,0,0,0,0,0,0
548,ham,Wait &lt;#&gt; min..,wait & # & min,0,0,0,0,0,0,0,0,0
1324,ham,Can you call me plz. Your number shows out of ...,can you call plz your number shows out coverag...,0,0,0,0,1,0,0,0,0
3184,ham,MAYBE IF YOU WOKE UP BEFORE FUCKING 3 THIS WOU...,maybe you woke before fucking this n't problem,0,0,0,0,1,0,0,0,0


In [7]:
import os
# Remove the problematic environment variable
if 'MPLBACKEND' in os.environ:
    del os.environ['MPLBACKEND']

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Convert labels to binary values (ham=0, spam=1)
df['binary_label'] = df['label'].map({'ham': 0, 'spam': 1})

# Extract features (all rule columns) and target variable
X = df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']]
y = df['binary_label']

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train a Decision Tree Classifier
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

# Make predictions on test data
y_pred = dt_classifier.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the results
print(f"Decision Tree Classifier Results:")
print(f"Accuracy: {accuracy * 100:.2f}%")
print("\nClassification Report:")
print(class_report)
print("\nConfusion Matrix:")
print(conf_matrix)

# Calculate feature importances
feature_importances = dt_classifier.feature_importances_
feature_names = X.columns
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
importance_df = importance_df.sort_values('Importance', ascending=False)

print("\nFeature Importance:")
print(importance_df)

# Print true negative rate as mentioned in the paper
tn, fp, fn, tp = conf_matrix.ravel()
tnr = tn / (tn + fp)
print(f"\nTrue Negative Rate: {tnr * 100:.2f}%")
print(f"True Positive Rate (Sensitivity/Recall): {tp / (tp + fn) * 100:.2f}%")
print(f"False Positive Rate: {fp / (fp + tn) * 100:.2f}%")
print(f"False Negative Rate: {fn / (fn + tp) * 100:.2f}%")

Decision Tree Classifier Results:
Accuracy: 93.98%

Classification Report:
              precision    recall  f1-score   support

           0       0.91      0.97      0.94       145
           1       0.97      0.91      0.94       154

    accuracy                           0.94       299
   macro avg       0.94      0.94      0.94       299
weighted avg       0.94      0.94      0.94       299


Confusion Matrix:
[[141   4]
 [ 14 140]]

Feature Importance:
  Feature  Importance
3   rule4    0.805397
1   rule2    0.067014
4   rule5    0.044107
6   rule7    0.035224
2   rule3    0.017686
7   rule8    0.014607
0   rule1    0.011468
5   rule6    0.004496
8   rule9    0.000000

True Negative Rate: 97.24%
True Positive Rate (Sensitivity/Recall): 90.91%
False Positive Rate: 2.76%
False Negative Rate: 9.09%


In [8]:
## An√°lise de Overfitting

# Vamos usar valida√ß√£o cruzada para verificar se h√° overfitting
from sklearn.model_selection import cross_val_score, validation_curve, learning_curve
import numpy as np

# 1. Valida√ß√£o Cruzada com 5 folds
print("=== AN√ÅLISE DE OVERFITTING ===\n")

# Valida√ß√£o cruzada
cv_scores = cross_val_score(dt_classifier, X, y, cv=5, scoring='accuracy')
print(f"Valida√ß√£o Cruzada (5-fold):")
print(f"Scores: {cv_scores}")
print(f"M√©dia: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
print(f"Desvio padr√£o: {cv_scores.std():.4f}")

# Se o desvio padr√£o for muito alto, pode indicar overfitting
if cv_scores.std() > 0.02:
    print("‚ö†Ô∏è  ALERTA: Alto desvio padr√£o pode indicar overfitting!")
else:
    print("‚úÖ Desvio padr√£o baixo - modelo parece est√°vel")

print("\n" + "="*50)

=== AN√ÅLISE DE OVERFITTING ===

Valida√ß√£o Cruzada (5-fold):
Scores: [0.93645485 0.92976589 0.93311037 0.93979933 0.93624161]
M√©dia: 0.9351 (+/- 0.0068)
Desvio padr√£o: 0.0034
‚úÖ Desvio padr√£o baixo - modelo parece est√°vel



In [9]:
# 2. An√°lise da Distribui√ß√£o das Features
print("\n=== AN√ÅLISE DAS FEATURES ===")

# Verificar distribui√ß√£o das regras
feature_distribution = df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']].sum()
total_samples = len(df)

print("Distribui√ß√£o das regras (quantos SMS triggeram cada regra):")
for rule, count in feature_distribution.items():
    percentage = (count / total_samples) * 100
    print(f"{rule}: {count}/{total_samples} ({percentage:.1f}%)")

# Verificar quantos SMS n√£o triggeraram nenhuma regra
no_rules_triggered = df[(df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']].sum(axis=1) == 0)]
print(f"\nSMS que n√£o triggeram NENHUMA regra: {len(no_rules_triggered)}")
print(f"Destes, quantos s√£o spam: {len(no_rules_triggered[no_rules_triggered['label'] == 'spam'])}")
print(f"Destes, quantos s√£o ham: {len(no_rules_triggered[no_rules_triggered['label'] == 'ham'])}")

# Verificar correla√ß√£o entre regras
correlation_matrix = df[['rule1', 'rule2', 'rule3', 'rule4', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']].corr()
print(f"\nCorrela√ß√µes mais altas entre regras:")
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.3:  # Correla√ß√£o moderada ou alta
            print(f"{correlation_matrix.columns[i]} vs {correlation_matrix.columns[j]}: {corr_value:.3f}")

print("\n" + "="*50)


=== AN√ÅLISE DAS FEATURES ===
Distribui√ß√£o das regras (quantos SMS triggeram cada regra):
rule1: 184/1494 (12.3%)
rule2: 442/1494 (29.6%)
rule3: 291/1494 (19.5%)
rule4: 595/1494 (39.8%)
rule5: 849/1494 (56.8%)
rule6: 40/1494 (2.7%)
rule7: 106/1494 (7.1%)
rule8: 640/1494 (42.8%)
rule9: 0/1494 (0.0%)

SMS que n√£o triggeram NENHUMA regra: 425
Destes, quantos s√£o spam: 14
Destes, quantos s√£o ham: 411

Correla√ß√µes mais altas entre regras:
rule2 vs rule4: 0.428
rule2 vs rule5: 0.364
rule2 vs rule8: 0.334
rule3 vs rule4: 0.494
rule3 vs rule5: 0.367
rule3 vs rule8: 0.507
rule4 vs rule5: 0.607
rule4 vs rule8: 0.506
rule5 vs rule8: 0.334



In [10]:
# 3. An√°lise espec√≠fica da Rule4 (dominante)
print("\n=== AN√ÅLISE DETALHADA DA RULE4 ===")

# Analisar a performance da rule4 sozinha
rule4_analysis = df.groupby(['rule4', 'label']).size().unstack(fill_value=0)
print("Distribui√ß√£o Rule4 vs Label:")
print(rule4_analysis)

# Calcular m√©tricas se us√°ssemos apenas a rule4
rule4_only_accuracy = ((rule4_analysis.loc[0, 'ham'] + rule4_analysis.loc[1, 'spam']) / len(df))
print(f"\nSe us√°ssemos APENAS a Rule4:")
print(f"Acur√°cia: {rule4_only_accuracy:.4f} ({rule4_only_accuracy*100:.2f}%)")

# Verificar quantos spam/ham triggeraram rule4
spam_with_rule4 = len(df[(df['label'] == 'spam') & (df['rule4'] == 1)])
total_spam = len(df[df['label'] == 'spam'])
ham_with_rule4 = len(df[(df['label'] == 'ham') & (df['rule4'] == 1)])
total_ham = len(df[df['label'] == 'ham'])

print(f"\nRule4 detecta {spam_with_rule4}/{total_spam} spam ({spam_with_rule4/total_spam*100:.1f}%)")
print(f"Rule4 √© triggerrada por {ham_with_rule4}/{total_ham} ham ({ham_with_rule4/total_ham*100:.1f}%)")

# Isso mostra se a rule4 √© muito espec√≠fica
if ham_with_rule4/total_ham < 0.02:  # Menos de 2% dos ham triggeram rule4
    print("‚ö†Ô∏è  Rule4 pode estar sendo muito espec√≠fica para spam!")

print("\n" + "="*50)


=== AN√ÅLISE DETALHADA DA RULE4 ===
Distribui√ß√£o Rule4 vs Label:
label  ham  spam
rule4           
0      747   152
1        0   595

Se us√°ssemos APENAS a Rule4:
Acur√°cia: 0.8983 (89.83%)

Rule4 detecta 595/747 spam (79.7%)
Rule4 √© triggerrada por 0/747 ham (0.0%)
‚ö†Ô∏è  Rule4 pode estar sendo muito espec√≠fica para spam!



In [11]:
# 4. Teste sem a Rule4 dominante
print("\n=== TESTE SEM RULE4 ===")

# Treinar modelo sem a rule4
X_without_rule4 = df[['rule1', 'rule2', 'rule3', 'rule5', 'rule6', 'rule7', 'rule8', 'rule9']]
y = df['binary_label']

# Split dos dados
X_train_no4, X_test_no4, y_train_no4, y_test_no4 = train_test_split(
    X_without_rule4, y, test_size=0.2, random_state=42
)

# Treinar novo modelo
dt_no_rule4 = DecisionTreeClassifier(random_state=42)
dt_no_rule4.fit(X_train_no4, y_train_no4)

# Fazer predi√ß√µes
y_pred_no4 = dt_no_rule4.predict(X_test_no4)
accuracy_no4 = accuracy_score(y_test_no4, y_pred_no4)

print(f"Acur√°cia SEM Rule4: {accuracy_no4:.4f} ({accuracy_no4*100:.2f}%)")
print(f"Queda de performance: {(accuracy - accuracy_no4)*100:.2f} pontos percentuais")

# Feature importance sem rule4
feature_imp_no4 = dt_no_rule4.feature_importances_
feature_names_no4 = X_without_rule4.columns
importance_df_no4 = pd.DataFrame({'Feature': feature_names_no4, 'Importance': feature_imp_no4})
importance_df_no4 = importance_df_no4.sort_values('Importance', ascending=False)

print(f"\nNova distribui√ß√£o de import√¢ncia (sem rule4):")
for idx, row in importance_df_no4.iterrows():
    print(f"{row['Feature']}: {row['Importance']:.3f}")

# Valida√ß√£o cruzada sem rule4
cv_scores_no4 = cross_val_score(dt_no_rule4, X_without_rule4, y, cv=5, scoring='accuracy')
print(f"\nValida√ß√£o cruzada sem rule4: {cv_scores_no4.mean():.4f} (+/- {cv_scores_no4.std() * 2:.4f})")

print("\n" + "="*50)


=== TESTE SEM RULE4 ===
Acur√°cia SEM Rule4: 0.8562 (85.62%)
Queda de performance: 8.36 pontos percentuais

Nova distribui√ß√£o de import√¢ncia (sem rule4):
rule5: 0.628
rule8: 0.136
rule2: 0.097
rule7: 0.049
rule3: 0.048
rule1: 0.030
rule6: 0.011
rule9: 0.000

Valida√ß√£o cruzada sem rule4: 0.8768 (+/- 0.0152)



In [12]:
# 5. Conclus√µes sobre Overfitting - AN√ÅLISE FINAL ATUALIZADA
print("\n=== CONCLUS√ïES SOBRE OVERFITTING (DATASET BALANCEADO) ===")

# Primeiro, vamos obter os dados corretos das an√°lises anteriores
cv_std = cv_scores.std()
no_rules_count = len(no_rules_triggered)
no_rules_spam = len(no_rules_triggered[no_rules_triggered['label'] == 'spam'])
no_rules_ham = len(no_rules_triggered[no_rules_triggered['label'] == 'ham'])

# Obter import√¢ncia da rule4 do modelo atual
rule4_importance = importance_df[importance_df['Feature'] == 'rule4']['Importance'].iloc[0]

print("‚úÖ EVID√äNCIAS CONTRA OVERFITTING:")
print(f"‚Ä¢ Valida√ß√£o cruzada est√°vel (std = {cv_std:.3f} = {cv_std*100:.1f}%)")
print(f"‚Ä¢ Modelo sem rule4 ainda tem {accuracy_no4*100:.2f}% de acur√°cia")
print(f"‚Ä¢ {no_rules_count} SMS n√£o triggeram nenhuma regra ({no_rules_ham} ham, {no_rules_spam} spam)")
print(f"‚Ä¢ Dataset balanceado: {total_spam} spam vs {total_ham} ham")

print("\n‚ö†Ô∏è  POSS√çVEIS PREOCUPA√á√ïES:")
print(f"‚Ä¢ Rule4 domina com {rule4_importance*100:.1f}% de import√¢ncia")
print(f"‚Ä¢ Rule4 sozinha j√° d√° {rule4_only_accuracy*100:.2f}% de acur√°cia")
print(f"‚Ä¢ Rule4 detecta {spam_with_rule4/total_spam*100:.1f}% dos spam mas apenas {ham_with_rule4/total_ham*100:.1f}% dos ham")

# Verificar rule9
rule9_count = feature_distribution['rule9']
print(f"‚Ä¢ Rule9 (email) triggera apenas {rule9_count} casos ({rule9_count/total_samples*100:.1f}%)")

print("\nüîç INTERPRETA√á√ÉO:")
print("‚Ä¢ N√ÉO h√° overfitting cl√°ssico (modelo generaliza bem)")
print("‚Ä¢ Dataset balanceado reduz vi√©s, mas rule4 ainda domina")
print("‚Ä¢ DEPEND√äNCIA EXCESSIVA da rule4 (n√∫meros de telefone)")
print("‚Ä¢ Modelo encontrou um 'atalho' muito espec√≠fico para este dataset")
print("‚Ä¢ Risco de falsos negativos em spam sem n√∫meros de telefone")

print(f"\nüìä RESUMO NUM√âRICO ATUALIZADO:")
print(f"‚Ä¢ Acur√°cia com todas as features: {accuracy*100:.2f}%")
print(f"‚Ä¢ Acur√°cia sem rule4: {accuracy_no4*100:.2f}%")
print(f"‚Ä¢ Perda de performance sem rule4: {(accuracy - accuracy_no4)*100:.2f} pontos percentuais")
print(f"‚Ä¢ Depend√™ncia da rule4: {(accuracy - accuracy_no4)/accuracy*100:.1f}% da performance total")

print(f"\nüéØ M√âTRICAS DE BALANCEAMENTO:")
print(f"‚Ä¢ True Negative Rate: {tnr*100:.2f}%")
print(f"‚Ä¢ True Positive Rate: {tp/(tp+fn)*100:.2f}%")
print(f"‚Ä¢ Precis√£o: {tp/(tp+fp)*100:.2f}%")
print(f"‚Ä¢ F1-Score: {2*tp/(2*tp+fp+fn)*100:.2f}%")

print(f"\nüí° RECOMENDA√á√ïES:")
print("1. Investigar rule4 - pode estar muito permissiva para n√∫meros")
print("2. Melhorar rules menos usadas (rule6, rule7, rule9)")
print("3. Testar com dataset de spam sem n√∫meros de telefone")
print("4. Considerar ensemble de modelos para reduzir depend√™ncia")
print("5. Aplicar regulariza√ß√£o para balancear import√¢ncia das features")

print("\n" + "="*70)


=== CONCLUS√ïES SOBRE OVERFITTING (DATASET BALANCEADO) ===
‚úÖ EVID√äNCIAS CONTRA OVERFITTING:
‚Ä¢ Valida√ß√£o cruzada est√°vel (std = 0.003 = 0.3%)
‚Ä¢ Modelo sem rule4 ainda tem 85.62% de acur√°cia
‚Ä¢ 425 SMS n√£o triggeram nenhuma regra (411 ham, 14 spam)
‚Ä¢ Dataset balanceado: 747 spam vs 747 ham

‚ö†Ô∏è  POSS√çVEIS PREOCUPA√á√ïES:
‚Ä¢ Rule4 domina com 80.5% de import√¢ncia
‚Ä¢ Rule4 sozinha j√° d√° 89.83% de acur√°cia
‚Ä¢ Rule4 detecta 79.7% dos spam mas apenas 0.0% dos ham
‚Ä¢ Rule9 (email) triggera apenas 0 casos (0.0%)

üîç INTERPRETA√á√ÉO:
‚Ä¢ N√ÉO h√° overfitting cl√°ssico (modelo generaliza bem)
‚Ä¢ Dataset balanceado reduz vi√©s, mas rule4 ainda domina
‚Ä¢ DEPEND√äNCIA EXCESSIVA da rule4 (n√∫meros de telefone)
‚Ä¢ Modelo encontrou um 'atalho' muito espec√≠fico para este dataset
‚Ä¢ Risco de falsos negativos em spam sem n√∫meros de telefone

üìä RESUMO NUM√âRICO ATUALIZADO:
‚Ä¢ Acur√°cia com todas as features: 93.98%
‚Ä¢ Acur√°cia sem rule4: 85.62%
‚Ä¢ Perda de performa