---

Student Information:
Name: 邱子恩

Student ID: 113065526

GitHub ID: LiLChiu5388

### Instructions

1. First: do the **take home** exercises in the [DM2024-Lab1-Master](https://github.com/didiersalazar/DM2024-Lab1-Master.git). You may need to copy some cells from the Lab notebook to this notebook. __This part is worth 20% of your grade.__


2. Second: follow the same process from the [DM2024-Lab1-Master](https://github.com/didiersalazar/DM2024-Lab1-Master.git) on **the new dataset**. You don't need to explain all details as we did (some **minimal comments** explaining your code are useful though).  __This part is worth 30% of your grade.__
    - Download the [the new dataset](https://huggingface.co/datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data). The dataset contains a `sentiment` and `comment` columns, with the sentiment labels being: 'nostalgia' and 'not nostalgia'. Read the specificiations of the dataset for background details. 
    - You are allowed to use and modify the `helper` functions in the folder of the first lab session (notice they may need modification) or create your own.


3. Third: please attempt the following tasks on **the new dataset**. __This part is worth 30% of your grade.__
    - Generate meaningful **new data visualizations**. Refer to online resources and the Data Mining textbook for inspiration and ideas. 
    - Generate **TF-IDF features** from the tokens of each text. This will generating a document matrix, however, the weights will be computed differently (using the TF-IDF value of each word per document as opposed to the word frequency). Refer to this Scikit-learn [guide](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) .
    - Implement a simple **Naive Bayes classifier** that automatically classifies the records into their categories. Use both the TF-IDF features and word frequency features to build two seperate classifiers. Note that for the TF-IDF features you might need to use other type of NB classifier different than the one in the Master Notebook. Comment on the differences.  Refer to this [article](https://hub.packtpub.com/implementing-3-naive-bayes-classifiers-in-scikit-learn/).


4. Fourth: In the lab, we applied each step really quickly just to illustrate how to work with your dataset. There are somethings that are not ideal or the most efficient/meaningful. Each dataset can be handled differently as well. What are those inefficent parts you noticed? How can you improve the Data preprocessing for these specific datasets? __This part is worth 10% of your grade.__


5. Fifth: It's hard for us to follow if your code is messy, so please **tidy up your notebook** and **add minimal comments where needed**. __This part is worth 10% of your grade.__


You can submit your homework following these guidelines: [Git Intro & How to hand your homework](https://github.com/didiersalazar/DM2024-Lab1-Master/blob/main/Git%20Intro%20%26%20How%20to%20hand%20your%20homework.ipynb). Make sure to commit and save your changes to your repository __BEFORE the deadline (October 27th 11:59 pm, Sunday)__. 

In [2]:
### Begin Assignment Here

In [16]:
import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import plotly as py
import math
import PAMI
import umap
%matplotlib inline

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

def prepare_dataset(data_path):
    
    df = pd.read_csv(data_path)
    
    # 1. Drop duplicates
    df = df.drop_duplicates()
    
    # 2. Handle missing values
    df = df.dropna(subset=['sentiment', 'comment'])
    
    # 3. Text preprocessing function
    def preprocess_text(text):
        # Convert to lowercase
        text = str(text).lower()
        
        # Tokenization
        tokens = nltk.word_tokenize(text)
        
        # Remove stopwords
        stop_words = set(nltk.corpus.stopwords.words('english'))
        tokens = [token for token in tokens if token not in stop_words]
        
        # Lemmatization
        lemmatizer = nltk.WordNetLemmatizer()
        tokens = [lemmatizer.lemmatize(token) for token in tokens]
        
        return ' '.join(tokens)
    
    df['processed_comment'] = df['comment'].apply(preprocess_text)
    
    # 4. Convert sentiment to numerical values for nostalgia classification
    sentiment_map = {'nostalgia': 1, 'not nostalgia': 0}  
    df['sentiment_label'] = df['sentiment'].map(sentiment_map)
    
    print("Unique sentiments in original data:", df['sentiment'].unique())
    print("Unique sentiment labels after mapping:", df['sentiment_label'].unique())
    print("\nSentiment distribution:")
    print(df['sentiment'].value_counts())
    
    return df

# create feature vectors
def create_features(df, max_features=5000):
    vectorizer = CountVectorizer(max_features=max_features)
    X = vectorizer.fit_transform(df['processed_comment'])
    
    print(f"\nNumber of features created: {X.shape[1]}")
    print("Sample feature names:", list(vectorizer.get_feature_names_out())[:10])
    
    return X, vectorizer

data_path = "hf://datasets/Senem/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data/Nostalgic_Sentiment_Analysis_of_YouTube_Comments_Data.csv"
processed_df = prepare_dataset(data_path)
X, vectorizer = create_features(processed_df)
y = processed_df['sentiment_label'].values

print("\nDataset Statistics:")
print(f"Total number of samples: {len(processed_df)}")
print(f"Number of nostalgia comments: {sum(y == 1)}")
print(f"Number of non-nostalgia comments: {sum(y == 0)}")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Arthur\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Arthur\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Arthur\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


ImportError: Install huggingface_hub to access HfFileSystem

In [14]:
# TEST necessary for when working with external scripts
%load_ext autoreload
%autoreload 2

Data Transformation & Data Preprocessing

In [None]:
def format_comments(comments, sentiments):
    """Format comments and sentiments into a list of dictionaries"""
    return [{'comment': comment, 'sentiment': sentiment} 
            for comment, sentiment in zip(comments, sentiments)]

# 令processed_df是之前處理好的數據
data = pd.DataFrame.from_records(format_comments(processed_df['processed_comment'], 
                                               processed_df['sentiment']))

print(f"Dataset size: {len(data)}")

print("\nFirst two comments:")
for comment in data["comment"][:2]:
    print(comment)

data['sentiment_value'] = processed_df['sentiment_label']

data['comment_length'] = data['comment'].str.len()
data['sentiment_intensity'] = data['comment_length'].apply(lambda x: 
    'high' if x > 100 else 'medium' if x > 50 else 'low')

# 添加時間戳列（如果原始數據中有的話）
if 'timestamp' in processed_df.columns:
    data['timestamp'] = processed_df['timestamp']

print("\nFirst 10 records (comment and sentiment):")
print(data.loc[:10, ['comment', 'sentiment']])

print("\nUsing iloc for the first 10 comments:")
print(data.iloc[:10, 0])

print("\nLast 10 records:")
print(data.iloc[-10:])

data['word_count'] = data['comment'].str.split().str.len()

# 建立情感分布
sentiment_summary = data.groupby('sentiment').agg({
    'comment': 'count',
    'word_count': ['mean', 'median', 'std']
}).round(2)

print("\nSentiment Distribution Summary:")
print(sentiment_summary)

In [10]:
# sample
from sklearn.model_selection import train_test_split
import seaborn as sns

X_sample, _ = train_test_split(data, test_size=0.8, stratify=data['sentiment'], random_state=42)


plt.figure(figsize=(10, 6))
sns.countplot(data=X_sample, x='sentiment')
plt.title('Sentiment Distribution in Sample')
plt.show()


plt.figure(figsize=(10, 6))
sns.boxplot(data=X_sample, x='sentiment', y='word_count')
plt.title('Word Count Distribution by Sentiment')
plt.show()

NameError: name 'data' is not defined

In [None]:
from textblob import TextBlob
from scipy.stats import skew

def create_text_features(df):
    # Text-based features
    df['caps_ratio'] = df['comment'].apply(lambda x: sum(1 for c in x if c.isupper()) / len(x))
    df['has_question'] = df['comment'].apply(lambda x: '?' in x).astype(int)
    df['has_exclamation'] = df['comment'].apply(lambda x: '!' in x).astype(int)
    
    # Sentiment intensity features
    df['polarity'] = df['comment'].apply(lambda x: TextBlob(x).sentiment.polarity)
    df['subjectivity'] = df['comment'].apply(lambda x: TextBlob(x).sentiment.subjectivity)
    
    # Statistical features
    df['word_avg_length'] = df['comment'].apply(lambda x: np.mean([len(word) for word in x.split()]))
    df['word_skew'] = df['comment'].apply(lambda x: skew([len(word) for word in x.split()]))
    
    return df

data = create_text_features(data)

In [None]:
# Feature Subset Selection
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler

feature_cols = ['word_count', 'caps_ratio', 'polarity', 'subjectivity', 'word_avg_length']
X_features = data[feature_cols]

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_features)

# Select top k features
k = 3
selector = SelectKBest(chi2, k=k)
X_selected = selector.fit_transform(X_scaled, data['sentiment_value'])

# Get selected feature names
selected_features = np.array(feature_cols)[selector.get_support()]
print("Selected features:", selected_features)

In [None]:
# Attribute Transformation

from sklearn.preprocessing import StandardScaler, PowerTransformer

def transform_attributes(df, features):
   
    scaler = StandardScaler()
    df_scaled = pd.DataFrame(scaler.fit_transform(df[features]), 
                           columns=[f"{col}_scaled" for col in features])
    
    # Power transformation
    pt = PowerTransformer()
    df_power = pd.DataFrame(pt.fit_transform(df[features]), 
                          columns=[f"{col}_power" for col in features])
    
    # Log transformation for positive features
    df_log = pd.DataFrame()
    for col in features:
        if (df[col] > 0).all():
            df_log[f"{col}_log"] = np.log1p(df[col])
    
    return pd.concat([df_scaled, df_power, df_log], axis=1)

transformed_features = transform_attributes(data, selected_features)

In [None]:
# Dimensionality Reduction
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

numeric_features = data[selected_features]

# PCA
pca = PCA(n_components=2)
pca_result = pca.fit_transform(numeric_features)

# t-SNE
tsne = TSNE(n_components=2, random_state=42)
tsne_result = tsne.fit_transform(numeric_features)

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Plot PCA
scatter1 = ax1.scatter(pca_result[:, 0], pca_result[:, 1], 
                      c=data['sentiment_value'], cmap='viridis')
ax1.set_title('PCA Results')
plt.colorbar(scatter1, ax=ax1)

# Plot t-SNE
scatter2 = ax2.scatter(tsne_result[:, 0], tsne_result[:, 1], 
                      c=data['sentiment_value'], cmap='viridis')
ax2.set_title('t-SNE Results')
plt.colorbar(scatter2, ax=ax2)

plt.tight_layout()
plt.show()

# Print explained variance ratio for PCA
print("PCA explained variance ratio:", pca.explained_variance_ratio_)

In [None]:
from sklearn.preprocessing import KBinsDiscretizer, Binarizer

# Discretization
kbd = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
discretized_features = pd.DataFrame(kbd.fit_transform(data[selected_features]), 
                                  columns=[f"{col}_disc" for col in selected_features])

# Binarization
binarizer = Binarizer()
binary_features = pd.DataFrame(binarizer.fit_transform(data[selected_features]), 
                             columns=[f"{col}_bin" for col in selected_features])

data = pd.concat([data, discretized_features, binary_features], axis=1)

Data exploration

In [None]:
# Document Similarity Analysis

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import seaborn as sns

# Select three sample
sample_docs = df['processed_comment'].head(3).tolist()

# Create TF-IDF vectors
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(sample_docs)

# Calculate cosine similarity
similarity_matrix = cosine_similarity(tfidf_matrix)


plt.figure(figsize=(8, 6))
sns.heatmap(similarity_matrix, 
            annot=True, 
            cmap='YlOrRd', 
            xticklabels=['Doc 1', 'Doc 2', 'Doc 3'],
            yticklabels=['Doc 1', 'Doc 2', 'Doc 3'])
plt.title('Document Similarity Matrix')
plt.show()

# Print the original documents for reference
print("\nOriginal documents used for comparison:")
for i, doc in enumerate(df['comment'].head(3), 1):
    print(f"\nDocument {i}:")
    print(doc)
    print(f"Processed version:")
    print(df['processed_comment'].iloc[i-1])

# Additional similarity analysis
print("\nPairwise similarity scores:")
for i in range(len(sample_docs)):
    for j in range(i+1, len(sample_docs)):
        print(f"Similarity between Document {i+1} and Document {j+1}: {similarity_matrix[i][j]:.4f}")

# Analyze vocabulary overlap
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_docs)
vocabulary = vectorizer.get_feature_names_out()

print("\nCommon terms across documents:")
print("Vocabulary size:", len(vocabulary))
print("Sample terms:", sorted(vocabulary)[:10]) 

# Document statistics
print("\nDocument Statistics:")
for i, doc in enumerate(sample_docs, 1):
    words = doc.split()
    print(f"\nDocument {i}:")
    print(f"Word count: {len(words)}")
    print(f"Unique words: {len(set(words))}")

***Requirement: TF-IDF features

In [None]:
#reference from internet, Scikit-learn, GAI
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# 1. Create TF-IDF features
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,
    min_df=5,              # Min frequency
    max_df=0.95,           # Max frequency
    ngram_range=(1, 2)     # both unigrams and bigrams
)

tfidf_matrix = tfidf_vectorizer.fit_transform(processed_df['processed_comment'])

feature_names = tfidf_vectorizer.get_feature_names_out()

print(f"Number of TF-IDF features: {len(feature_names)}")
print(f"TF-IDF matrix shape: {tfidf_matrix.shape}")

# 2. Analyze top terms by TF-IDF scores
def get_top_tfidf_terms(tfidf_matrix, feature_names, n_top=10):
    # Sum tfidf values
    sums = np.array(tfidf_matrix.sum(axis=0)).flatten()
    # Sort term-score pairs
    terms_scores = [(term, score) for term, score in zip(feature_names, sums)]
    terms_scores = sorted(terms_scores, key=lambda x: x[1], reverse=True)
    return terms_scores[:n_top]

top_terms = get_top_tfidf_terms(tfidf_matrix, feature_names)

plt.figure(figsize=(12, 6))
terms, scores = zip(*top_terms)
plt.barh(terms, scores)
plt.title('Top Terms by Overall TF-IDF Score')
plt.xlabel('TF-IDF Score Sum')
plt.ylabel('Terms')
plt.show()

# 3. Analyze TF-IDF scores by sentiment
def get_sentiment_specific_tfidf(tfidf_matrix, sentiment_labels, feature_names, sentiment_value, n_top=10):
    
    sentiment_docs = tfidf_matrix[sentiment_labels == sentiment_value]
    avg_scores = np.array(sentiment_docs.mean(axis=0)).flatten()
    terms_scores = [(term, score) for term, score in zip(feature_names, avg_scores)]
    return sorted(terms_scores, key=lambda x: x[1], reverse=True)[:n_top]

nostalgia_terms = get_sentiment_specific_tfidf(tfidf_matrix, y, feature_names, 1)
non_nostalgia_terms = get_sentiment_specific_tfidf(tfidf_matrix, y, feature_names, 0)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# nostalgia
terms, scores = zip(*nostalgia_terms)
ax1.barh(terms, scores)
ax1.set_title('Top Terms in Nostalgia Comments')
ax1.set_xlabel('Average TF-IDF Score')

# non-nostalgia
terms, scores = zip(*non_nostalgia_terms)
ax2.barh(terms, scores)
ax2.set_title('Top Terms in Non-Nostalgia Comments')
ax2.set_xlabel('Average TF-IDF Score')

plt.tight_layout()
plt.show()

# 4. Create document similarity matrix using TF-IDF
from sklearn.metrics.pairwise import cosine_similarity

sample_size = min(1000, tfidf_matrix.shape[0])
sample_indices = np.random.choice(tfidf_matrix.shape[0], sample_size, replace=False)
sample_matrix = tfidf_matrix[sample_indices]
similarity_matrix = cosine_similarity(sample_matrix)

plt.figure(figsize=(10, 8))
sns.heatmap(similarity_matrix[:50, :50], cmap='YlOrRd')
plt.title('Document Similarity Matrix (First 50 Documents)')
plt.show()

# 5. Analyze TF-IDF distribution
sparsity = 1.0 - (np.count_nonzero(tfidf_matrix) / float(tfidf_matrix.shape[0] * tfidf_matrix.shape[1]))
print(f"\nTF-IDF matrix sparsity: {sparsity:.2%}")

tfidf_values = tfidf_matrix.data
plt.figure(figsize=(10, 6))
plt.hist(tfidf_values, bins=50, edgecolor='black')
plt.title('Distribution of TF-IDF Values')
plt.xlabel('TF-IDF Value')
plt.ylabel('Frequency')
plt.show()


tfidf_features = pd.DataFrame.sparse.from_spmatrix(
    tfidf_matrix,
    columns=feature_names
)

print("\nTF-IDF Statistics:")
print(f"Average number of non-zero terms per document: {np.diff(tfidf_matrix.indptr).mean():.2f}")
print(f"Maximum TF-IDF value: {tfidf_matrix.max():.4f}")
print(f"Minimum non-zero TF-IDF value: {tfidf_matrix.data.min():.4f}")

***Requirement: Data visualizations

In [None]:
#reference from internet, GAI
import seaborn as sns
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from collections import Counter

# 1. Sentiment Distribution Plot
plt.figure(figsize=(10, 6))
sns.countplot(data=processed_df, x='sentiment')
plt.title('Distribution of Nostalgia vs Non-Nostalgia Comments')
plt.xlabel('Sentiment Category')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.show()

# 2. Comment Length Analysis
processed_df['comment_length'] = processed_df['comment'].str.len()

plt.figure(figsize=(12, 5))
sns.boxplot(data=processed_df, x='sentiment', y='comment_length')
plt.title('Comment Length Distribution by Sentiment')
plt.xlabel('Sentiment Category')
plt.ylabel('Comment Length')
plt.show()

# 3. Word Count Distribution
processed_df['word_count'] = processed_df['processed_comment'].str.split().str.len()

plt.figure(figsize=(10, 6))
sns.histplot(data=processed_df, x='word_count', hue='sentiment', bins=50, multiple="layer", alpha=0.6)
plt.title('Word Count Distribution by Sentiment')
plt.xlabel('Number of Words')
plt.ylabel('Frequency')
plt.show()

# 4. Word Clouds for Nostalgia and Non-nos
plt.figure(figsize=(15, 6))

# Nostalgia WordCloud
plt.subplot(1, 2, 1)
nostalgia_text = ' '.join(processed_df[processed_df['sentiment'] == 'nostalgia']['processed_comment'])
wordcloud_nostalgia = WordCloud(width=800, height=400, background_color='white').generate(nostalgia_text)
plt.imshow(wordcloud_nostalgia)
plt.title('Most Common Words in Nostalgia Comments')
plt.axis('off')

# Non-nostalgia WordCloud
plt.subplot(1, 2, 2)
non_nostalgia_text = ' '.join(processed_df[processed_df['sentiment'] == 'not nostalgia']['processed_comment'])
wordcloud_non_nostalgia = WordCloud(width=800, height=400, background_color='white').generate(non_nostalgia_text)
plt.imshow(wordcloud_non_nostalgia)
plt.title('Most Common Words in Non-Nostalgia Comments')
plt.axis('off')

plt.tight_layout()
plt.show()

# 5. Top Words Comparison
def get_top_words(text, n=20):
    words = ' '.join(text).split()
    return Counter(words).most_common(n)

nostalgia_words = get_top_words(processed_df[processed_df['sentiment'] == 'nostalgia']['processed_comment'])
non_nostalgia_words = get_top_words(processed_df[processed_df['sentiment'] == 'not nostalgia']['processed_comment'])

plt.figure(figsize=(15, 8))

# Plot for nostalgia
plt.subplot(1, 2, 1)
words, counts = zip(*nostalgia_words)
plt.barh(words, counts)
plt.title('Top 20 Words in Nostalgia Comments')
plt.xlabel('Frequency')

# Plot for non-nostalgia
plt.subplot(1, 2, 2)
words, counts = zip(*non_nostalgia_words)
plt.barh(words, counts)
plt.title('Top 20 Words in Non-Nostalgia Comments')
plt.xlabel('Frequency')

plt.tight_layout()
plt.show()

# 6. Time Series Analysis (if timestamp is available)
if 'timestamp' in processed_df.columns:
    processed_df['date'] = pd.to_datetime(processed_df['timestamp'])
    daily_sentiments = processed_df.groupby([processed_df['date'].dt.date, 'sentiment']).size().unstack()
    
    plt.figure(figsize=(15, 6))
    daily_sentiments.plot(kind='line')
    plt.title('Sentiment Trends Over Time')
    plt.xlabel('Date')
    plt.ylabel('Number of Comments')
    plt.legend(title='Sentiment')
    plt.show()

# 7. Character Length Distribution
plt.figure(figsize=(12, 6))
sns.violinplot(data=processed_df, x='sentiment', y='comment_length')
plt.title('Comment Length Distribution (Violin Plot)')
plt.xlabel('Sentiment Category')
plt.ylabel('Number of Characters')
plt.show()

# 8. Correlation Heatmap (if num features exist)
numerical_features = processed_df.select_dtypes(include=[np.number]).columns
if len(numerical_features) > 1:
    plt.figure(figsize=(10, 8))
    sns.heatmap(processed_df[numerical_features].corr(), annot=True, cmap='coolwarm')
    plt.title('Correlation Heatmap of Numerical Features')
    plt.show()

Data Classificaion

***Requirement: Naive Bayes classifier

In [22]:
#reference from internet, GAI
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import MinMaxScaler

def train_and_evaluate_nb_classifiers(X, text_column='text'):
    """
    Parameters:
    X (pd.DataFrame): DataFrame containing text data and categories
    text_column (str): Name of the column containing text data
    
    Returns:
    tuple: Results from both classifiers
    """
    # Create category mapping
    category_mapping = dict(X[['category', 'category_name']].drop_duplicates().values)
    target_names = [category_mapping[label] for label in sorted(category_mapping.keys())]
    
    X_train_raw, X_test_raw, y_train, y_test = train_test_split(
        X[text_column], X['category'], test_size=0.3, random_state=42
    )
    
    # 1. Word Frequency Based Classifier
    print("Training Word Frequency Based Classifier...")
    count_vectorizer = CountVectorizer(max_features=5000)
    X_train_counts = count_vectorizer.fit_transform(X_train_raw)
    X_test_counts = count_vectorizer.transform(X_test_raw)
    
    nb_freq_classifier = MultinomialNB()
    nb_freq_classifier.fit(X_train_counts, y_train)
    y_pred_freq = nb_freq_classifier.predict(X_test_counts)
    
    freq_accuracy = accuracy_score(y_test, y_pred_freq)
    freq_report = classification_report(y_test, y_pred_freq, 
                                     target_names=target_names, 
                                     digits=4)
    
    # 2. TF-IDF Based Classifier
    print("\nTraining TF-IDF Based Classifier...")
    tfidf_vectorizer = TfidfVectorizer(max_features=5000)
    X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_raw)
    X_test_tfidf = tfidf_vectorizer.transform(X_test_raw)
    
    # Convert sparse matrix to dense and scale features
    X_train_tfidf_dense = X_train_tfidf.toarray()
    X_test_tfidf_dense = X_test_tfidf.toarray()
    
    scaler = MinMaxScaler()
    X_train_tfidf_scaled = scaler.fit_transform(X_train_tfidf_dense)
    X_test_tfidf_scaled = scaler.transform(X_test_tfidf_dense)
    
    # Use GaussianNB for TF-IDF features
    nb_tfidf_classifier = GaussianNB()
    nb_tfidf_classifier.fit(X_train_tfidf_scaled, y_train)
    y_pred_tfidf = nb_tfidf_classifier.predict(X_test_tfidf_scaled)
    
    tfidf_accuracy = accuracy_score(y_test, y_pred_tfidf)
    tfidf_report = classification_report(y_test, y_pred_tfidf, 
                                       target_names=target_names, 
                                       digits=4)
    
    print("\n=== Word Frequency Based Results ===")
    print(f"Accuracy: {freq_accuracy:.4f}")
    print("\nClassification Report:")
    print(freq_report)
    
    print("\n=== TF-IDF Based Results ===")
    print(f"Accuracy: {tfidf_accuracy:.4f}")
    print("\nClassification Report:")
    print(tfidf_report)
    
    return {
        'frequency_based': {
            'vectorizer': count_vectorizer,
            'classifier': nb_freq_classifier,
            'accuracy': freq_accuracy,
            'report': freq_report
        },
        'tfidf_based': {
            'vectorizer': tfidf_vectorizer,
            'classifier': nb_tfidf_classifier,
            'accuracy': tfidf_accuracy,
            'report': tfidf_report
        }
    }

In [26]:
#I implemented GaussianNB with TF-IDF (as GauNB)
"""
Differences(v.s. MultinomialNB with TF-IDF (as MulNB)):
1. Feature: GauNB models distribution of TF-IDF values distribution more appropriately
because it treats them as continuous while MulNB interprets them as frequencies which they are not.
2. Task: GauNB is more suitable for capturing subtle differences in word usage frequencies within document
while MulNB is more suitable for handling discrete features in document classification tasks.
3. Performance: GauNB performs better when the distinctions between categories are based on word usage patterns
while MulNB performs better when there are clear keyword differences between categories.
"""

'\nDifferences(v.s. MultinomialNB with TF-IDF (as MulNB)):\n1. Feature: GauNB models distribution of TF-IDF values distribution more appropriately\nbecause it treats them as continuous while MulNB interprets them as frequencies which they are not.\n2. Task: GauNB is more suitable for capturing subtle differences in word usage frequencies within document\nwhile MulNB is more suitable for handling discrete features in document classification tasks.\n3. Performance: GauNB performs better when the distinctions between categories are based on word usage patterns\nwhile MulNB performs better when there are clear keyword differences between categories.\n'

***Requirement: Thoughts of the lab

##Current Limitations and Inefficiencies:

1. Missing Values:

    A. Current Approach: Directly removing missing values or filling them with a simple mean.

    B. Issues:

        1) Risk of losing valuable information, especially if missing data follows a specific pattern.
        2) Using mean imputation may distort the true data distribution.
        3) No consideration of the missing data mechanism (MCAR, MAR, MNAR).
    
    C. Possible improvements:

        I. Recommendations:
            1) Analyze missing patterns to understand the missing data mechanism.
            2) Use advanced imputation techniques, such as:
                MICE (Multiple Imputation by Chained Equations): Accounts for relationships between features.
            3) Add an indicator for "missingness" as a new feature, which may have predictive value.
        
        II. Reasons:
            1) Retains more original information.
            2) Underlying data structure.
        
2. Feature Engineering:

    A. Current Approach: Creating numerous frequency-based features.

    B. Issues:

        1) Overly sparse features, leading to higher computational cost.
        2) Lack of consideration for the semantic relationships between words.
        3) Potential for high correlation among features.

    C. Possible improvements:

        I. Recommendations:
            1) Use advanced text features, including:
                word embeddings (e.g., Word2Vec, BERT).

            2) Incorporate domain knowledge, such as

               specialized keyword dictionaries.
        II. Reasons:
            1) Captures deeper semantic information.
            2) Increases feature interpretability.


3. Dimensionality Reduction:

    A. Current Approach: Direct application of PCA.

    B. Issues:

        1) Dimensionality reduction without prior feature importance assessment.
        2) Risk of losing essential class-separating information.

    C. Possible improvements:

        I. Recommendations:
            1) Perform feature selection, such as
               Use Lasso or Ridge regression.
            2) Employ nonlinear dimensionality reduction methods
        II. Reasons:
            1) Retains more classification information.

4. Binarization:

    A. Current Approach: Fixed threshold.

    B. Issues:

        1) Threshold selection lacks data-driven support.
        2) No consideration of the distributional characteristics of features.
        3) Potential information loss.

    C. Possible improvements:

        I. Recommendations:
            1) Distribution-based thresholds
            2) Supervised binarization, such as:
                Maximize class separation.
                Minimize information loss.
        II. Reasons:
            1) Adapts to data characteristics.
            2) Retains essential information.
            3) Enhances classification performance.

5. Text Preprocessing:

    A. Recommendations:
        1) Include more NLP steps:
            Lemmatization rather than stemming, etc.
    B. Reason: Retains more semantic information, enhancing feature quality.

6. Class Imbalance Handling:

    A. Recommendations:
        1) Use advanced oversampling methods like SMOTE or ADASYN.
        2) Combine oversampling and undersampling.
    B. Reason: Prevents model bias towards majority classes, improving minority class recognition.