### Student Information
Name: 蘇立光

Student ID: 110000162

GitHub ID: 92088440

Kaggle name: Emperor Augusto

Kaggle private scoreboard snapshot: 

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home exercises** in the [DM2024-Lab2-master Repo](https://github.com/didiersalazar/DM2024-Lab2-Master). You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework) regarding Emotion Recognition on Twitter by this link: https://www.kaggle.com/competitions/dm-2024-isa-5810-lab-2-homework. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (0.6N + 1 - x) / (0.6N) * 10 + 20 points, where N is the total number of participants, and x is your rank. (ie. If there are 100 participants and you rank 3rd your score will be (0.6 * 100 + 1 - 3) / (0.6 * 100) * 10 + 20 = 29.67% out of 30%.)   
    Submit your last submission **BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)**. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developing the model for the competition (You can use code and comment on it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Nov. 26th, 11:59 pm, Tuesday)__. 

In [3]:
### Begin Assignment Here

## Initial fetch and data preprocessing

### Fetch data

In [4]:
import pandas as pd
import numpy as np

tweets_file = "/kaggle/input/lab2-dataset/tweets_DM.json"
emotion_file = "/kaggle/input/lab2-dataset/emotion.csv"
data_identification_file = "/kaggle/input/lab2-dataset/data_identification.csv"

tweets_df = pd.read_json(tweets_file, lines=True)
emotion_df = pd.read_csv(emotion_file)
data_identification_df = pd.read_csv(data_identification_file)

FileNotFoundError: File /kaggle/input/lab2-dataset/tweets_DM.json does not exist

In [None]:
tweets_df

### Extract and merge the data

In [None]:
# Extract data from `_source` column
tweets_df['tweet_id'] = tweets_df['_source'].apply(lambda x: x['tweet']['tweet_id'])
tweets_df['text'] = tweets_df['_source'].apply(lambda x: x['tweet']['text'])
tweets_df['hashtags'] = tweets_df['_source'].apply(lambda x: x['tweet'].get('hashtags', []))

# And merge the data with emotion_df and data_identification_df
merged_df = tweets_df.merge(emotion_df, on="tweet_id", how="left")
merged_df = merged_df.merge(data_identification_df, on="tweet_id", how="left")

In [None]:
merged_df

### Fetch training and testing data

In [None]:
# Separate training and testing data defined in data_identification_df
train_df = merged_df[merged_df['identification'] == 'train'].copy()
test_df = merged_df[merged_df['identification'] == 'test'].copy()

In [None]:
# Check content
print("\nTraining Data Sample:")
print(train_df.head())

print("\nTesting Data Sample:")
print(test_df.head())

In [None]:
# Check shape
print("Training Data Shape:", train_df.shape)
print("Testing Data Shape:", test_df.shape)

# Check for NaN value(s) in training
print("\nMissing Values in Training Data:")
print(train_df['emotion'].isnull().sum())

# Check for NaN value(s) in testing
print("\nMissing Values in Testing Data:")
print(test_df['emotion'].isnull().sum())

### Data Analysis

In [None]:
import matplotlib.pyplot as plt

class_counts = train_df['emotion'].value_counts()

plt.figure(figsize=(10, 6))
class_counts.plot(kind='bar', color='skyblue')
plt.title('Class Distribution')
plt.xlabel('Emotion Class')
plt.ylabel('Num of Samples')
plt.xticks(rotation=45)
plt.grid(axis='y')
plt.show()

print("Class Counts:")
print(class_counts)

print("\nOccurance Rate:")
print(class_counts / len(train_df))

## Encoding and model run

### TF-IDF Vectorizer

In [None]:
import re
import nltk
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Preprocessing function
def preprocess_text(text):
    # Remove links, URLS, etc.
    text = re.sub(r"http\S+|www\S+", '', text) 
    # Remove special characters and numbers
    text = re.sub(r"[^a-zA-Z\s]", '', text)
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # Convert to lowercase
    return text.lower()

train_df['text'] = train_df['text'].apply(preprocess_text)
test_df['text'] = test_df['text'].apply(preprocess_text)

# Do TF-IDF
tfidf = TfidfVectorizer(max_features=10000, stop_words='english', ngram_range=(1, 2), tokenizer=nltk.word_tokenize)
tfidf.fit(train_df['text'])

# Transform the data using TF-IDF
X_train = tfidf.transform(train_df['text'])
y_train = train_df['emotion']

X_test = tfidf.transform(test_df['text'])
y_test = test_df['emotion']

In [None]:
# Check shape
print('X_train.shape: ', X_train.shape)
print('y_train.shape: ', y_train.shape)
print('X_test.shape: ', X_test.shape)
print('y_test.shape: ', y_test.shape)

### Label encoding

In [None]:
## deal with label (string -> one-hot)
import keras
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
label_encoder.fit(y_train)
print('check label: ', label_encoder.classes_)
print('\n## Before convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)

def label_encode(le, labels):
    enc = le.transform(labels)
    return keras.utils.to_categorical(enc)

def label_decode(le, one_hot_label):
    dec = np.argmax(one_hot_label, axis=1)
    return le.inverse_transform(dec)

y_train = label_encode(label_encoder, y_train)
#y_test = label_encode(label_encoder, y_test)

print('\n\n## After convert')
print('y_train[0:4]:\n', y_train[0:4])
print('\ny_train.shape: ', y_train.shape)
print('y_test.shape: ', y_test.shape)


### Split training data

In [None]:
from sklearn.model_selection import train_test_split

# Since the test data classification is empty
# Split training data into 80% acuracy and 20% validation

X_t, X_val, y_t, y_val = train_test_split(
    X_train, 
    y_train, 
    test_size=0.1, 
    random_state=42, 
    stratify=y_train
)
print("X_train.shape: ", X_t.shape)
print("y_train.shape: ", y_t.shape)
print("X_val.shape: ", X_val.shape)
print("y_val.shape: ", y_val.shape)
print("X_test.shape: ", X_test.shape)

In [None]:
# I/O check
input_shape = X_t.shape[1]
print('input_shape: ', input_shape)

output_shape = len(label_encoder.classes_)
print('output_shape: ', output_shape)

### Define model

In [None]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout, BatchNormalization, Softmax, LeakyReLU
from keras.regularizers import l2
from keras.optimizers import SGD

# Using the DNN from the master code

# input layer
model_input = Input(shape=(input_shape, ))  # 500
X = model_input

# 1st hidden layer
X_W1 = Dense(units=256)(X)  # 64
H1 = BatchNormalization()(X_W1)
H1 = LeakyReLU(alpha=0.01)(H1)
H1 = Dropout(0.5)(H1)

# 2nd hidden layer
H1_W2 = Dense(units=256)(H1)  # 64
H2 = BatchNormalization()(H1_W2)
H2 = LeakyReLU(alpha=0.01)(H2)
H2 = Dropout(0.5)(H2)

# 3rd hidden layer
H2_W3 = Dense(units=256)(H2)  # 64
H3 = BatchNormalization()(H2_W3)
H3 = LeakyReLU(alpha=0.01)(H3)
H3 = Dropout(0.5)(H3)

# output layer
H4_W5 = Dense(units=output_shape)(H3)  # 4
H5 = Softmax()(H4_W5)

model_output = H5

# create model
model = Model(inputs=[model_input], outputs=[model_output])

sgd = SGD(learning_rate=0.1, momentum=0.9, nesterov=True)

# loss function & optimizer
model.compile(optimizer="adam",
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# show model construction
model.summary()

In [None]:
from keras.callbacks import CSVLogger


# Training settings
csv_logger = CSVLogger('training_log.csv')
epochs = 15
batch_size = 128

# Train the model with class weights
history = model.fit(
    X_t,
    y_t,
    epochs=epochs, 
    batch_size=batch_size, 
    callbacks=[csv_logger],
    validation_data=(X_val, y_val)
)

print('Training complete!')

## Do prediction

In [None]:
## predict
from sklearn.metrics import accuracy_score
pred_result = model.predict(X_test, batch_size=128)

#print('testing accuracy: {}'.format(round(accuracy_score(label_decode(label_encoder, y_test), pred_result), 2)))
pred_labels = np.argmax(pred_result, axis=1)

# Decode the class indices to their string labels
pred_emotions = label_encoder.inverse_transform(pred_labels)

# Step 3: Update the test_df with predictions
test_df['emotion'] = pred_emotions

# Step 4: Prepare the submission file
submission_df = test_df[['tweet_id', 'emotion']]
submission_df.columns = ['id', 'emotion']  # Rename to match submission format
submission_file = "submission.csv"

# Save to CSV
submission_df.to_csv(submission_file, index=False)

print(f"Submission file created: {submission_file}")