# Emotion Prediction from Text

## Introduction

This project applies machine learning techniques to analyze emotion data and build a predictive model as well deploy a steamlit app to test the model in production environment

The goal is to explore the dataset, select appropriate features, train and evaluate models, interpret the results and practise light deployment using streamlit and fast.api.  

The notebook documents each step, including data preparation, modeling decisions, and evaluation outcomes.

## Problem statement

Emotion is a fundamental aspect of human communication and influences decision-making, behaviour, and social interaction. With the growing use of digital platforms, a large proportion of human expression now occurs through text, such as social media posts, reviews, messages, and online feedback. However, emotional cues that are naturally conveyed through tone, facial expression, and body language are often absent or ambiguous in text-based communication.

This creates a challenge for individuals, organisations, and systems that rely on text data to understand user intent, sentiment, and psychological state. Manually analysing emotional content at scale is time-consuming, subjective, and impractical. Predicting emotion through text using machine learning provides a scalable and consistent way to identify emotional patterns, enabling improved user understanding, mental-health monitoring, customer support, and human-computer interaction.


## Data Description

The dataset used in this study was obtained from the TweetEval benchmark available on Hugging Face. TweetEval is a unified evaluation framework for tweet classification tasks and aggregates several well-established Twitter datasets.

Specifically, this project uses the emotion recognition subset of TweetEval, which is derived from the SemEval-2018 shared task “Affect in Tweets” (Mohammad et al., 2018). The original SemEval dataset was formulated as a multi-label classification problem covering eleven emotion categories, where a single tweet could express multiple emotions simultaneously.

In TweetEval, this dataset was re-purposed into a multi-class classification task by retaining only tweets annotated with a single emotion label. Due to the limited number of single-label instances, the dataset was further restricted to the four most frequent emotions: anger, joy, sadness, and optimism. Each tweet in the dataset is therefore associated with exactly one emotion label.

The dataset consists solely of tweet text and its corresponding emotion label, making it suitable for **supervised emotion classification** based on textual content alone.


## Assumptions and Scope


### Environment and Import

In [40]:
import pandas as pd
import numpy as np
import requests
import matplotlib.pyplot as plt
import seaborn as sns

### Data Loading

In [89]:
#Load data from Huggingface API
url =  "https://datasets-server.huggingface.co/rows"

params = {
    "dataset" : "cardiffnlp/tweet_eval",
    "config" : "emotion",
    "split" : "train",
    "offset" : 0,
    "length" : 100
}

#print("Status:", response.status_code)
#print("Content type:", response.headers.get("Content-type"))

try:
    response = requests.get(url, params=params, timeout=30)
    response.raise_for_status()
    data = response.json()
except requests.exceptions.Timeout:
    print("Request timed out")
except requests.exceptions.RequestException as e:
    print("Request failed:", e)


In [80]:
##Function to download all data from API

BASE_URL = "https://datasets-server.huggingface.co/rows"

def download_split(dataset, config, split, page_size = 100):
    offset = 0
    records = []

    while True:
        params = {
            "dataset": dataset,
            "config": config,
            "split": split,
            "offset": offset,
            "length": page_size
        }
    

        resp = requests.get(BASE_URL, params=params)
        data = resp.json()

        # Stop when there is no more rows
        rows = data.get("rows", [])
        if not rows:
            break

        #Keep only the actual row content
        records.extend([r["row"] for r in rows])

        offset += page_size

    return pd.DataFrame(records)


In [85]:
## Call function to extract all training data

tweet_df = download_split("cardiffnlp/tweet_eval", "emotion", "train")

In [82]:
tweet_df.head()

Unnamed: 0,text,label
0,“Worry is a down payment on a problem you may ...,2
1,My roommate: it's okay that we can't spell bec...,0
2,No but that's so cute. Atsu was probably shy a...,1
3,Rooneys fucking untouchable isn't he? Been fuc...,0
4,it's pretty depressing when u hit pan on ur fa...,3


In [87]:
#Convert json to df

label_names = response.json()["features"][1]["type"]["names"]

tweet_df["emotion"] = tweet_df["label"].map(lambda x: label_names[x])

tweet_df.head(5)


Unnamed: 0,text,label,emotion
0,“Worry is a down payment on a problem you may ...,2,optimism
1,My roommate: it's okay that we can't spell bec...,0,anger
2,No but that's so cute. Atsu was probably shy a...,1,joy
3,Rooneys fucking untouchable isn't he? Been fuc...,0,anger
4,it's pretty depressing when u hit pan on ur fa...,3,sadness


In [96]:
## Saving the dataset
try:
    #Save the dataframe to csv
    tweet_df.to_csv("/Users/soliufatai/Documents/Personal Documents/Data Science_ML_AI_Krish Naik/Complete-Data-Science-With-Machine-Learning-And-NLP-2024-main/2-Introduction/Intro/emotion-text-ml/data/tweet_eval_train.csv", index= False)
    print("Dataframe successfully saved")
    
except (OSError, IOError) as e:
    print(f"Error saving dataframe: {e}")

Dataframe successfully saved


In [90]:
tweet_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3257 entries, 0 to 3256
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   text     3257 non-null   object
 1   label    3257 non-null   int64 
 2   emotion  3257 non-null   object
dtypes: int64(1), object(2)
memory usage: 76.5+ KB
