### Student Information
Name: Fan Kai Jie

Student ID: X1120029

GitHub ID: FanKJ13

Kaggle name: Fan Kai Jie

Kaggle private scoreboard snapshot:

[Snapshot](img/pic0.png)

---

### Instructions

1. First: __This part is worth 30% of your grade.__ Do the **take home** exercises in the DM2023-Lab2-master. You may need to copy some cells from the Lab notebook to this notebook. 


2. Second: __This part is worth 30% of your grade.__ Participate in the in-class [Kaggle Competition](https://www.kaggle.com/t/09b1d0f3f8584d06848252277cb535f2) regarding Emotion Recognition on Twitter by this link https://www.kaggle.com/t/09b1d0f3f8584d06848252277cb535f2. The scoring will be given according to your place in the Private Leaderboard ranking: 
    - **Bottom 40%**: Get 20% of the 30% available for this section.

    - **Top 41% - 100%**: Get (60-x)/6 + 20 points, where x is your ranking in the leaderboard (ie. If you rank 3rd your score will be (60-3)/6 + 20 = 29.5% out of 30%)   
    Submit your last submission __BEFORE the deadline (Dec. 27th 11:59 pm, Wednesday)_. Make sure to take a screenshot of your position at the end of the competition and store it as '''pic0.png''' under the **img** folder of this repository and rerun the cell **Student Information**.
    

3. Third: __This part is worth 30% of your grade.__ A report of your work developping the model for the competition (You can use code and comment it). This report should include what your preprocessing steps, the feature engineering steps and an explanation of your model. You can also mention different things you tried and insights you gained. 


4. Fourth: __This part is worth 10% of your grade.__ It's hard for us to follow if your code is messy :'(, so please **tidy up your notebook** and **add minimal comments where needed**.


Upload your files to your repository then submit the link to it on the corresponding e-learn assignment.

Make sure to commit and save your changes to your repository __BEFORE the deadline (Dec. 31th 11:59 pm, Sunday)__. 

In [1]:
### Begin Assignment Here

In [2]:
# Import packages
import csv
import json
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to C:\Users\Fan Kai
[nltk_data]     Jie\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Fan Kai
[nltk_data]     Jie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
# Load tweets data
data = []
with open('tweets_DM.json', 'r') as f:
    for i in f:
        data.append(json.loads(i))

# Flatten json into dataframe
df = pd.json_normalize(data)

In [None]:
# Rename _source.tweet.tweet_id to tweet_id for merging with emotion and identification data later
df.rename(columns={'_source.tweet.tweet_id' : 'tweet_id'}, inplace=True)

In [None]:
# Load emotion and identification data
emotion = pd.read_csv('emotion.csv')
identification = pd.read_csv('data_identification.csv')

In [None]:
# Visualise emotion data
emotion

In [None]:
# Visualise identification data
identification

In [None]:
# Visualise dataframe
df

In [None]:
# Merge df and identification data first as they have the same number of rows
overall = pd.merge(df, identification, on='tweet_id')

In [None]:
# Split overall dataframe into train and test dataframes
train = overall[overall['identification'] == 'train']
test = overall[overall['identification'] == 'test']

In [None]:
# Visualise train dataframe
train

In [None]:
# Merge train dataframe with emotion dataframe only as emotion dataframe only consists of training labels
train = pd.merge(train, emotion, on='tweet_id')

In [None]:
# Split the labels column out after merging, so that the index is aligned with the train dataframe
y_train = train.pop('emotion').to_frame()

In [None]:
# Visualise y_train dataframe
y_train

In [None]:
# Visualise train dataframe to confirm that it is only left with the features
train

In [None]:
# Visualise test dataframe after merging
test

In [None]:
# Check if there are any missing values in train dataframe
train.isna().sum()

# Conclusion: There are no missing values in train dataframe

In [None]:
# Check if there are any missing values in test dataframe
test.isna().sum()

# Conclusion: There are no missing values in test dataframe


In [None]:
# To pre-process tweets by removing irrelavant characters and standardise all to lowercase
def preprocess_tweet(tweet):
    tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE) # To remove URLs
    tweet = re.sub(r'\@w+', '', tweet) # To remove mentions
    tweet = re.sub(r'#', '', tweet) # To remove hashtags
    tweet = re.sub(r'\d+', '', tweet) # To remove numbers
    tweet = tweet.lower() # Convert to lowercase
    tweet = re.sub(r'\s+', ' ', tweet).strip() # To remove extra whitespace
    return tweet

In [None]:
# Pre-process tweets in train dataframe
train_processed = [preprocess_tweet(r['_source.tweet.text']) for i, r in train.iterrows()]

In [None]:
# Visualise train_processed dataframe
train_processed

In [None]:
# As what I did for train dataframe, I pre-process tweets in test dataframe as well
test_processed = [preprocess_tweet(r['_source.tweet.text']) for i, r in test.iterrows()]

In [None]:
# Visualise test_processed dataframe
test_processed

In [None]:
# Initialise the vectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words=stopwords.words('english'))

In [None]:
# Fit and transform train_processed dataframe
X = vectorizer.fit_transform(train_processed)

In [None]:
# Split the train_processed dataframe into train and validation dataframes
X_train, X_val, y_train, y_val = train_test_split(X, y_train, test_size=0.2, random_state=42)

In [None]:
# Initialise the RandomForest model 
model = RandomForestClassifier(n_estimators=100, random_state=42)

In [None]:
# Train the model 
model.fit(X_train, y_train)

In [None]:
# Make predictions with validation data first
predictions = model.predict(X_val)

In [None]:
# Evaluate the model
print(classification_report(y_val, predictions))

In [None]:
# As what I did to train_processed dataframe, I fit and transform test_processed dataframe as well
X_test = vectorizer.fit_transform(test_processed)

In [None]:
# Make predictions with test data
actual_pred = model.predict(X_test)

In [None]:
# Visualise actual_pred array
actual_pred

In [None]:
# Check if it is the expected output, as after writing predcitions into csv file, the word is split up into columns. 
# For example, instead of 'sadness', it wrote 's', 'a', 'd', 'n', 'e', 's', 's' into the csv file.

actual_pred[0]

In [None]:
# Separate out the tweet_id
test_id = test['tweet_id']

In [None]:
# Check if test_id is in the format I want
list(test_id)

In [None]:
# Write the predictions into the csv file first
csv_file_name = 'output.csv'

with open(csv_file_name, 'w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    
    for row in actual_pred:
        if not isinstance(row, list):
            row = [row]  # Convert a single value into a list so that the spliting up of words do not occur
        csvwriter.writerow(row)

In [None]:
# Set headers
new_column_data = list(test_id)
new_column_header = 'id'
existing_column_header = 'emotion'

# Read the existing data from the above CSV file
existing_data = []
with open('output.csv', 'r', newline='') as csvfile:
    reader = csv.reader(csvfile)
    for row in reader:
        existing_data.append(row)

# Combine the new column data with the existing data
combined_data = [[new_column_header, existing_column_header]]
for i, row in enumerate(existing_data):
    combined_data.append([new_column_data[i], row[0]])

# Write the combined data to a new CSV file
with open('updated_emotions.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(combined_data)