# Data cleaning

This notebooks is for the step by step data cleaning process. All datasets that will be used and need data cleaning, will be processed in this file and converted to a csv file so that it can be easily used in the project

In [237]:
import pandas as pd
import numpy as np
from requests_html import AsyncHTMLSession
from bs4 import BeautifulSoup
import os
import re

## EmotioNL

EmotioNL dataset is a dutch emotion dataset with dutch sentences labeled with six emotions (joy, sadness, fear, neutral, love and anger). The dataset is split in two parts. The parts are:

1. A dataset of 1000 tweets
2. A dataset of 1000 sentences from different tv shows

Both parts are stored in textfiles and contain unnecessary columns for this project. We filter the files so that only the correct columns remain and convert both to a csv file.

### EmotioNL TV shows

Get the path to the file

In [238]:
base_path = "datasets\EmotioNL_dataset_open"
filename = "EmotioNL_captions.txt"

pf = os.path.join(base_path, filename)
pf

'datasets\\EmotioNL_dataset_open\\EmotioNL_captions.txt'

The current dataset. Consisting of 7 columns.

In [239]:
shows_df = pd.read_csv(pf, sep="\t")
shows_df

Unnamed: 0,ID,Text,Show,Valence,Arousal,Dominance,Category
0,0,Kweenie of da gij al es goed hebt stilgestaan ...,OV,0.303732,0.513015,0.664205,neutral
1,1,Santé.,BG,0.712377,0.492071,0.619150,joy
2,10,Ja jongens. Dees is nie gezond ze. Pfff. Dees ...,BG,0.338710,0.827688,0.260467,fear
3,100,Ik had de vorige keer ook al zoiets van: nie m...,OV,0.438437,0.714087,0.471438,anger
4,101,Tis wat het is. Tis ook een belangrijke job eh...,BZL,0.696067,0.552524,0.705887,neutral
...,...,...,...,...,...,...,...
995,995,"Jah ze werken harder eh als wij in België, en ...",BZL,0.344995,0.849752,0.543079,sadness
996,996,Ah gij. Wa een ramp zijt gij. Allemaal hetzelfde.,OV,0.257216,0.690051,0.462198,anger
997,997,"Der sta Lego van Lord of the Rings, van Star W...",BG,0.681752,0.434160,0.673334,neutral
998,998,Nu zijn we precies aant kamperen.,BZL,0.503272,0.293365,0.529894,joy


There are no empty values nor duplicates in this dataset

In [240]:
dupes = shows_df[shows_df.duplicated()]
empty = np.where(pd.isnull(shows_df))

len(dupes), len(empty[0])

(0, 0)

 For this project, we only need the columns text and category.

In [241]:
new_s_df = shows_df.drop(["ID", "Show", "Valence", "Arousal", "Dominance"], axis=1)
new_s_df

Unnamed: 0,Text,Category
0,Kweenie of da gij al es goed hebt stilgestaan ...,neutral
1,Santé.,joy
2,Ja jongens. Dees is nie gezond ze. Pfff. Dees ...,fear
3,Ik had de vorige keer ook al zoiets van: nie m...,anger
4,Tis wat het is. Tis ook een belangrijke job eh...,neutral
...,...,...
995,"Jah ze werken harder eh als wij in België, en ...",sadness
996,Ah gij. Wa een ramp zijt gij. Allemaal hetzelfde.,anger
997,"Der sta Lego van Lord of the Rings, van Star W...",neutral
998,Nu zijn we precies aant kamperen.,joy


Now we convert the new dataframe to a useable csv file to train our models with

In [256]:
new_s_df.to_csv("datasets\emotionl_shows.csv", index=False, mode="w+")

In [257]:
pd.read_csv("datasets\emotionl_shows.csv")

Unnamed: 0,Text,Category
0,Kweenie of da gij al es goed hebt stilgestaan ...,neutral
1,Santé.,joy
2,Ja jongens. Dees is nie gezond ze. Pfff. Dees ...,fear
3,Ik had de vorige keer ook al zoiets van: nie m...,anger
4,Tis wat het is. Tis ook een belangrijke job eh...,neutral
...,...,...
995,"Jah ze werken harder eh als wij in België, en ...",sadness
996,Ah gij. Wa een ramp zijt gij. Allemaal hetzelfde.,anger
997,"Der sta Lego van Lord of the Rings, van Star W...",neutral
998,Nu zijn we precies aant kamperen.,joy


### EmotioNL Tweets

Get the path file

In [23]:
filename = "EmotioNL_tweets.txt"

pf = os.path.join(base_path, filename)
pf

'datasets\\EmotioNL_dataset_open\\EmotioNL_tweets.txt'

The dataset consists of 6 columns.

In [25]:
tweets_df = pd.read_csv(pf, sep="\t")
tweets_df

Unnamed: 0,ID,Tweet ID,Valence,Arousal,Dominance,Category
0,4,['914793795094446082'],0.553364,0.550661,0.732408,anger
1,7,['902122434714841090'],0.074637,0.754539,0.732349,anger
2,8,['864400774251589632'],0.395079,0.838521,0.768225,anger
3,11,['894957660889591810'],0.069375,0.626347,0.647728,anger
4,15,['905476032320462848'],0.150416,0.697961,0.584830,anger
...,...,...,...,...,...,...
995,949,['848596718740545536'],0.409241,0.635159,0.132824,sadness
996,951,['912331156317540358'],0.174462,0.964262,0.349197,sadness
997,963,['913090078582431744'],0.171635,0.781369,0.226771,sadness
998,976,['834451051407360000'],0.078343,0.422965,0.071597,sadness


The dataset has no empty or duplicate values

In [144]:
dupes = tweets_df[tweets_df.duplicated()]
empty = np.where(pd.isnull(tweets_df))

len(dupes), len(empty[0])

(0, 0)

We need only need the tweet ID and the category

In [145]:
new_t_df = shows_df.drop(["ID", "Valence", "Arousal", "Dominance"], axis=1)
new_t_df

Unnamed: 0,Tweet ID,Category
0,['914793795094446082'],anger
1,['902122434714841090'],anger
2,['864400774251589632'],anger
3,['894957660889591810'],anger
4,['905476032320462848'],anger
...,...,...
995,['848596718740545536'],sadness
996,['912331156317540358'],sadness
997,['913090078582431744'],sadness
998,['834451051407360000'],sadness


This dataset needs a lot more data processing. We need to get the text from the tweet id's. For this we will use a simple url call using the requests module.

In [152]:
""" make a html session, because jupyter has it's own loop running in the background we use a AsyncHTMLSession.
This session allows us to use the request html session method in a jupyter notebook

"""
asession = AsyncHTMLSession()

A tweet ID in the dataset has symbols and brackets that need to be removed. 

In [156]:
current_id = new_t_df["Tweet ID"][0]
new_id = new_t_df["Tweet ID"][0].strip("[").strip("]").strip("'")
print(current_id, test_id)

['914793795094446082'] 914793795094446082


In [188]:
# sample of 3 tweets using the request method
texts = []
for i in range(3):
    id = tweets_df['Tweet ID'][i].strip("[").strip("]").strip("'")
    # call the url
    url = "https://twitter.com/anyuser/status/{}".format(id)
    r = await asession.get(url)
    await r.html.arender(sleep=2)
    # fetch the raw html code
    resp=r.html.raw_html
    # process it using bs4
    soup = BeautifulSoup(resp, "lxml")
    soup.prettify()
    # get the correct data and append
    text = soup.title.text
    texts.append(text)

In [189]:
texts

['Voeten on Twitter: "@LINDAnieuws Maandagmorgen en extra druk😣Dat is nu voordeel van niet meer werken op maandag al dit nieuws lezen!" / X',
 'Les on Twitter: "@9Owen1 Echt...zweet en shag..en dat een hele dag😡Je zal ermee op kantoor zitten😝" / X',
 'bilblebons on Twitter: "@ohzitdatzo @telegraaf @VVD D66Pechtold zei 1e BELANGRIJKSTE!poging😂dream on Pechtold gelul id ruimte! PVV id 2e grtste!wat e kleuterklasje.pesten uitsluiten z nie 2017" / X']

The last step is removing emoji's from the tweets, and remove the twitter names and "/ X".

In [234]:
n_text = []
for i in range(len(texts)):
    pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]",
        flags=re.UNICODE)
    r_text = emoji_pattern.sub(r'', texts[i]) # no emoji
    r_text = re.sub(r'^.*?@', '@', r_text)
    r_text = re.sub(re.escape('"')+'.*',"", r_text)
    n_text.append(r_text)

In [235]:
n_text

['@LINDAnieuws Maandagmorgen en extra drukDat is nu voordeel van niet meer werken op maandag al dit nieuws lezen!',
 '@9Owen1 Echt...zweet en shag..en dat een hele dagJe zal ermee op kantoor zitten',
 '@ohzitdatzo @telegraaf @VVD D66Pechtold zei 1e BELANGRIJKSTE!pogingdream on Pechtold gelul id ruimte! PVV id 2e grtste!wat e kleuterklasje.pesten uitsluiten z nie 2017']