<a href="https://colab.research.google.com/github/oaarnikoivu/dissertation/blob/master/Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Analysis

### Imports

In [0]:
import pandas as pd 
import numpy as np
import nltk
import re

In [56]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Load Data

In [0]:
isear_data = open('/content/drive/My Drive/datasets/isear.csv')

text = []
labels = []
data = []

for line in isear_data:
  fields = line.split('|')
  text.append(fields[40])
  labels.append(fields[36])

text.pop(0)
labels.pop(0)

isear_data.close() 

data = {'Text': text, 'Emotion': labels}
df = pd.DataFrame(data)

In [58]:
df.head()

Unnamed: 0,Text,Emotion
0,"During the period of falling in love, each tim...",joy
1,When I was involved in a traffic accident.,fear
2,When I was driving home after several days of...,anger
3,When I lost the person who meant the most to me.,sadness
4,The time I knocked a deer down - the sight of ...,disgust


In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7666 entries, 0 to 7665
Data columns (total 2 columns):
Text       7666 non-null object
Emotion    7666 non-null object
dtypes: object(2)
memory usage: 119.9+ KB


### Data & Text Preprocessing

Since the emotion column happens to be categorical we can map the label classes to integers.

In [60]:
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['Emotion']))}
print(class_mapping)

# Use the mapping dictionary to transform the class labels into integers
df['Emotion'] = df['Emotion'].map(class_mapping)
df.head()

{'anger': 0, 'disgust': 1, 'fear': 2, 'guilt': 3, 'joy': 4, 'sadness': 5, 'shame': 6}


Unnamed: 0,Text,Emotion
0,"During the period of falling in love, each tim...",4
1,When I was involved in a traffic accident.,2
2,When I was driving home after several days of...,0
3,When I lost the person who meant the most to me.,5
4,The time I knocked a deer down - the sight of ...,1


Let's have a look at the text in order to determine the necessary preprocessing steps. 

In [61]:
df.loc[3552, 'Text']

'It was a complex situation concerning a relationship with a á boyfriend, I had broken the relationship for some reasons. á Meanwhile, as I felt it, I had most sorrow.'

### Cleaning text data with Regular Expressions

In [0]:
def preprocessor(text):
    text = re.sub('á', '', text)
    text = re.sub('  ', ' ', text)
    text = re.sub('<[^>]*>', '', text) # remove all html markup

    # remove the non-word chars '[\W]+'
    # convert all to lowercase
    # remove nose char for consistency
    text = (re.sub('[\W]+', ' ', text.lower()))

    return text

In [63]:
preprocessor("HELLO!!! [] this is a (:test!)")

'hello this is a test '

### Apply the clean data preprocessor to the text

In [0]:
df['Text'] = df['Text'].apply(preprocessor)

Some text contains 'no response'. We can remove these rows as they add nothing to our data.

In [65]:
df.loc[250, 'Text']

' no response '

In [0]:
df = df[~df['Text'].str.contains('\sno response\s')]

In [74]:
df['Text'][250]

KeyError: ignored