# 0. Import the data from huggingface

In [5]:
#!pip install datasets

In [6]:
from datasets import load_dataset
dataset = load_dataset("dair-ai/emotion")

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


In [7]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 16000
    })
    validation: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 2000
    })
})

In [8]:
import numpy as np
import matplotlib.pyplot as plt
import json # mark: add this line
import pandas as pd

In [9]:
dataset.data

{'train': MemoryMappedTable
 text: string
 label: int64
 ----
 text: [["i didnt feel humiliated","i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake","im grabbing a minute to post i feel greedy wrong","i am ever feeling nostalgic about the fireplace i will know that it is still on the property","i am feeling grouchy",...,"i should have been depressed but i was actually feeling inspired","i feel like not enough people my age actually think that most are pretty devastated that their s have come and gone","i get home i laze around in my pajamas feeling grouchy","i am feeling pretty homesick this weekend","i started out feeling really optimistic and driven for this paper coz it was gonna teach me the meaning and ways of being a leader"],["i need to do the best i possibly can do and even when i get out at i feel too listless to study like right now","i drove us to the car parts place and terry feels like im safe to drive again so yip

In [10]:
dataset.column_names

{'train': ['text', 'label'],
 'validation': ['text', 'label'],
 'test': ['text', 'label']}

In [11]:
dataset['train']

Dataset({
    features: ['text', 'label'],
    num_rows: 16000
})

### Note: sadness (0), joy (1), love (2), anger (3), fear (4), surprise (5)

# 0. EDA

In [12]:
# It’s easy to get high accuracy if one class is very common (just label everything as that class)
# In this case, precision and recall can be more useful
pd.Series(dataset['train'][:]['label']).value_counts()

1    5362
0    4666
3    2159
4    1937
2    1304
5     572
Name: count, dtype: int64

## Which Evaluation Metrics to use:
Precision and Recall are particularly useful when there are more than two labels

Macro-average (average P/R of all classes):
Useful if performance on all classes is equally important.

Micro-average (average P/R of all items):
Useful if performance on all items is equally important.

f1 score can be a good metric too.

## Basic Info of Logistic classifier

A probabilistic classifier, as discriminative model,  returns the most likely class for input x:

$y^* = \text{argmax}_y P(Y = y \,|\, X = x)$

it only models the probability of the class given the input, and not of the raw data itself.

\\

Whereas naive bayes, as a generative model, models the joint distribution of the class and the data:

$y^* = \text{argmax}_y P(y \,|\, x) = \text{argmax}_y \left( P(x \,|\, y) P(y) \right)$

-> Sample/pick a label with P(y), and then an item x with P(x|y)

\\

Discriminative models draw boundaries in the data space, while generative models try to model how data is placed throughout the space. A generative model explains how the data was generated, while a discriminative model focuses on predicting the labels of the data.

The number of training examples required for a discriminative linear classifier to reach its asymptotic error is $O(n)$, where $n$ is the VC dimension. In contrast, the number of examples for a generative linear classifier is $O(\log{n})$.


-> *Andrew Y. Ng and Michael I. Jordan. 2001. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic (NIPS'01). MIT Press, Cambridge, MA, USA, 841–848.*

# 1. Train, Test, and Validation set for raw data.

In [13]:
dataset['train'][0:4]

{'text': ['i didnt feel humiliated',
  'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
  'im grabbing a minute to post i feel greedy wrong',
  'i am ever feeling nostalgic about the fireplace i will know that it is still on the property'],
 'label': [0, 0, 3, 2]}

In [14]:
# we have X_train stores only text in the training ser
X_train = dataset['train'][:]['text'].copy()

In [15]:
y_train = dataset['train'][:]['label'].copy()

In [16]:
X_val = dataset['validation'][:]['text'].copy()

In [17]:
y_val = dataset['validation'][:]['label'].copy()

In [18]:
X_test = dataset['test'][:]['text'].copy()

In [19]:
y_test = dataset['test'][:]['label'].copy()

In [20]:
X_train[0:4]

['i didnt feel humiliated',
 'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
 'im grabbing a minute to post i feel greedy wrong',
 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property']

In [21]:
y_train[0:4]

[0, 0, 3, 2]

In [22]:
X_val[0:4]

['im feeling quite sad and sorry for myself but ill snap out of it soon',
 'i feel like i am still looking at a blank canvas blank pieces of paper',
 'i feel like a faithful servant',
 'i am just feeling cranky and blue']

In [23]:
y_val[0:4]

[0, 0, 2, 3]

In [24]:
X_test[0:4]

['im feeling rather rotten so im not very ambitious right now',
 'im updating my blog because i feel shitty',
 'i never make her separate from me because i don t ever want her to feel like i m ashamed with her',
 'i left with my bouquet of red and yellow tulips under my arm feeling slightly more optimistic than when i arrived']

In [25]:
y_test[0:4]

[0, 0, 0, 1]

In [26]:
X_all = np.concatenate((X_train, X_val, X_test), axis=0)
X_all.shape

(20000,)

In [27]:
y_all = np.concatenate((y_train, y_val, y_test), axis = 0)
y_all.shape

(20000,)

In [28]:
X_all

array(['i didnt feel humiliated',
       'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake',
       'im grabbing a minute to post i feel greedy wrong', ...,
       'i feel that i am useful to my people and that gives me a great feeling of achievement',
       'im feeling more comfortable with derby i feel as though i can start to step out my shell',
       'i feel all weird when i have to meet w people i text but like dont talk face to face w'],
      dtype='<U300')

In [29]:
y_all

array([0, 0, 3, ..., 1, 1, 4])

In [30]:
data = pd.DataFrame({
    'category': y_all,
    'text': X_all
})
data.head()

Unnamed: 0,category,text
0,0,i didnt feel humiliated
1,0,i can go from feeling so hopeless to so damned...
2,3,im grabbing a minute to post i feel greedy wrong
3,2,i am ever feeling nostalgic about the fireplac...
4,3,i am feeling grouchy


In [31]:
file_path = "../dataset/dataset_raw.csv"
data.to_csv(file_path, index=False)

OSError: Cannot save file into a non-existent directory: '..\dataset'