### Feature Engineering Best Practices: Handling Text Data
**Question**: Load a dataset with text data (e.g., SMS Spam Collection), perform text
preprocessing, and extract numerical features using TF-IDF.

In [1]:
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer

# Step 1: Create sample SMS Spam dataset inline
data = {
    "label": ["ham", "spam", "ham", "spam", "ham"],
    "message": [
        "Hey, are we still meeting today?",
        "WINNER!! You have won a free ticket. Reply YES to claim.",
        "Please call me when you get this message.",
        "Congratulations! You've been selected for a $1000 Walmart gift card. Go to http://bit.ly/12345",
        "Can you send me the report by tomorrow?"
    ]
}

df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Step 2: Text preprocessing function
def preprocess_text(text):
    text = text.lower()                     # Lowercase
    text = re.sub(r'[^\w\s]', '', text)    # Remove punctuation
    return text

df['clean_message'] = df['message'].apply(preprocess_text)

print("\nAfter preprocessing:")
print(df[['message', 'clean_message']])

# Step 3: Extract TF-IDF features
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(df['clean_message'])

print("\nTF-IDF feature names:")
print(vectorizer.get_feature_names_out())

print("\nTF-IDF feature matrix (as dense array):")
print(tfidf_matrix.toarray())


Original DataFrame:
  label                                            message
0   ham                   Hey, are we still meeting today?
1  spam  WINNER!! You have won a free ticket. Reply YES...
2   ham          Please call me when you get this message.
3  spam  Congratulations! You've been selected for a $1...
4   ham            Can you send me the report by tomorrow?

After preprocessing:
                                             message  \
0                   Hey, are we still meeting today?   
1  WINNER!! You have won a free ticket. Reply YES...   
2          Please call me when you get this message.   
3  Congratulations! You've been selected for a $1...   
4            Can you send me the report by tomorrow?   

                                       clean_message  
0                     hey are we still meeting today  
1  winner you have won a free ticket reply yes to...  
2           please call me when you get this message  
3  congratulations youve been selected for a 10