# DM2024 Lab 2 Homework: Emotion Recognition on Twitter

## Objective
The goal of this notebook is to develop a machine learning model that predicts the emotion behind tweets from the provided dataset. This involves:
- Cleaning and preprocessing the text data.
- Engineering meaningful features.
- Experimenting with different models and evaluating their performance.
- Submitting predictions in the required format for the Kaggle competition.

Importing Required Libraries

In [7]:
import pandas as pd
import numpy as np
import nltk
import matplotlib.pyplot as plt
import seaborn as sns
import umap
import gensim
import tensorflow
import keras
import ollama
import langchain
import langchain_community
import langchain_core
import bs4
import chromadb
import gradio

%matplotlib inline

print("gensim: " + gensim.__version__)
print("tensorflow: " + tensorflow.__version__)
print("keras: " + keras.__version__)

gensim: 4.3.3
tensorflow: 2.18.0
keras: 3.6.0


Data Loading:

In [8]:
import pandas as pd

from pathlib import Path

base_path = Path(r'C:\Users\katy6\dm-2024-isa-5810-lab-2-homework (2)')
tweets_df = pd.read_json(base_path / 'tweets_DM.json', lines=True)
emotion_df = pd.read_csv(base_path / 'emotion.csv')  # 含情緒標籤
data_identification_df = pd.read_csv(base_path / 'data_identification.csv')  # 訓練/測試區分
sample_submission_df = pd.read_csv(base_path / 'sampleSubmission.csv')  # 提交模板
# 檢查每個數據集的前幾行
print("Tweets Data:")
print(tweets_df.head())

print("\nEmotion Data:")
print(emotion_df.head())

print("\nData Identification:")
print(data_identification_df.head())

print("\nSample Submission:")
print(sample_submission_df.head())

Tweets Data:
   _score          _index                                            _source  \
0     391  hashtag_tweets  {'tweet': {'hashtags': ['Snapchat'], 'tweet_id...   
1     433  hashtag_tweets  {'tweet': {'hashtags': ['freepress', 'TrumpLeg...   
2     232  hashtag_tweets  {'tweet': {'hashtags': ['bibleverse'], 'tweet_...   
3     376  hashtag_tweets  {'tweet': {'hashtags': [], 'tweet_id': '0x1cd5...   
4     989  hashtag_tweets  {'tweet': {'hashtags': [], 'tweet_id': '0x2de2...   

            _crawldate   _type  
0  2015-05-23 11:42:47  tweets  
1  2016-01-28 04:52:09  tweets  
2  2017-12-25 04:39:20  tweets  
3  2016-01-24 23:53:05  tweets  
4  2016-01-08 17:18:59  tweets  

Emotion Data:
   tweet_id       emotion
0  0x3140b1       sadness
1  0x368b73       disgust
2  0x296183  anticipation
3  0x2bd6e1           joy
4  0x2ee1dd  anticipation

Data Identification:
   tweet_id identification
0  0x28cc61           test
1  0x29e452          train
2  0x2b3819          train
3  0x2d

Purpose: This function cleans raw tweet text by:
Removing URLs, mentions, hashtags, and punctuation.
Converting text to lowercase.
Tokenizing the text into individual words.
Lemmatizing the words (reducing them to their base form).
Removing stopwords to focus on meaningful words.

In [None]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk

# 確保必要的 NLTK 資源已下載
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

# 讀取數據
tweets_df = pd.read_json(r'C:\Users\katy6\dm-2024-isa-5810-lab-2-homework (2)\tweets_DM.json', lines=True)
emotion_df = pd.read_csv(r'C:\Users\katy6\dm-2024-isa-5810-lab-2-homework (2)\emotion.csv')  # 含情緒標籤
data_identification_df = pd.read_csv(r'C:\Users\katy6\dm-2024-isa-5810-lab-2-homework (2)\data_identification.csv')  # 訓練/測試區分

# 提取推文的 tweet_id 和 text
tweets_df['tweet_id'] = tweets_df['_source'].apply(lambda x: x['tweet']['tweet_id'])
tweets_df['text'] = tweets_df['_source'].apply(lambda x: x['tweet']['text'])

# 合併數據
merged_df = data_identification_df.merge(tweets_df[['tweet_id', 'text']], on='tweet_id', how='left')
merged_df = merged_df.merge(emotion_df, on='tweet_id', how='left')

# 定義清理文本函數
def clean_text(text):
    text = re.sub(r"http\S+", "", text)  # 移除網址
    text = re.sub(r"@\w+", "", text)  # 移除提及
    text = re.sub(r"#\w+", "", text)  # 移除Hashtags
    text = re.sub(r"[^\w\s]", "", text)  # 移除標點符號
    text = text.lower()  # 全部轉小寫
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stopwords.words('english')]
    return " ".join(tokens)

# 應用清理函數
merged_df['cleaned_text'] = merged_df['text'].apply(clean_text)

# 查看清理後的結果
print(merged_df[['text', 'cleaned_text']].head())

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\katy6\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\katy6\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\katy6\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [18]:
# 檢查 _source 結構
print(tweets_df['_source'].iloc[0])  # 查看第一行的 _source 結構

{'tweet': {'hashtags': ['Snapchat'], 'tweet_id': '0x376b20', 'text': 'People who post "add me on #Snapchat" must be dehydrated. Cuz man.... that\'s <LH>'}}


In [20]:
tweets_df['tweet_id'] = tweets_df['_source'].apply(lambda x: x.get('tweet', {}).get('tweet_id', ''))
tweets_df['text'] = tweets_df['_source'].apply(lambda x: x.get('tweet', {}).get('text', ''))

In [21]:
test_text = "Check out this amazing #tutorial on @Twitter: https://example.com!"
cleaned = clean_text(test_text)
print("Original:", test_text)
print("Cleaned:", cleaned)

Original: Check out this amazing #tutorial on @Twitter: https://example.com!
Cleaned: check amazing


In [22]:
print(merged_df[['text', 'cleaned_text']].head())

                                                text  \
0  @Habbo I've seen two separate colours of the e...   
1  Huge Respect🖒 @JohnnyVegasReal talking about l...   
2  Yoooo we hit all our monthly goals with the ne...   
3  @FoxNews @KellyannePolls No serious self respe...   
4  @KIDSNTS @PICU_BCH @uhbcomms @BWCHBoss Well do...   

                                        cleaned_text  
0  ive seen two separate colour elegant furni hom...  
1  huge respect talking losing dad cancerif dont ...  
2         yoooo hit monthly goal new app two week lh  
3  serious self respecting individual belief much...  
4                        well done team lh every one  


Split Data into Training and Testing Sets

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

# 分離訓練和測試數據
train_data = merged_df[merged_df['identification'] == 'train']
test_data = merged_df[merged_df['identification'] == 'test']

# TF-IDF 向量化
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['cleaned_text'])
X_test = vectorizer.transform(test_data['cleaned_text'])

# 提取訓練數據的情緒標籤
y_train = train_data['emotion']

In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# 訓練模型
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

# 評估模型
y_pred_train = model.predict(X_train)
print(classification_report(y_train, y_pred_train))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


              precision    recall  f1-score   support

       anger       0.57      0.12      0.20     39867
anticipation       0.54      0.44      0.48    248935
     disgust       0.38      0.25      0.30    139101
        fear       0.63      0.24      0.34     63999
         joy       0.47      0.81      0.60    516017
     sadness       0.40      0.34      0.37    193437
    surprise       0.67      0.11      0.19     48729
       trust       0.50      0.18      0.26    205478

    accuracy                           0.47   1455563
   macro avg       0.52      0.31      0.34   1455563
weighted avg       0.48      0.47      0.44   1455563



In [27]:
test_data = test_data.copy()
test_data['emotion'] = model.predict(X_test)

In [28]:
# 預測測試集情緒
test_data['emotion'] = model.predict(X_test)

# 生成提交文件
submission = test_data[['tweet_id', 'emotion']]
submission.columns = ['id', 'emotion']
submission.to_csv('submission.csv', index=False)

print("提交文件已成功生成：submission.csv")

提交文件已成功生成：submission.csv
