# Data Cleaning

WASSA2017 had .txt files which had to be converted .csv format.

In [1]:
# Importing the pandas library
import pandas as pd

We decided to combine the two datasets so that we would have more data to train our model and make better predictions.

In [3]:
# Reading the dataset
text_emotion_recognition = pd.read_csv("../data/Original Data/text_emotion_recognition.csv")

In [4]:
# Viewing the raw dataset
text_emotion_recognition.head()

Unnamed: 0.1,Unnamed: 0,id,tweet,tweettype,score,tweet_id,sentiment,author,content
0,0,10857.0,@ZubairSabirPTI pls dont insult the word 'Molna',anger,0.479,,,,
1,1,10858.0,@ArcticFantasy I would have almost took offens...,anger,0.458,,,,
2,2,10859.0,@IllinoisLoyalty that Rutgers game was an abom...,anger,0.562,,,,
3,3,10860.0,@CozanGaming that's what lisa asked before she...,anger,0.5,,,,
4,4,10861.0,Sometimes I get mad over something so minuscul...,anger,0.708,,,,


As expected, there were a few different columns between the datasets which had to be taken care of before we passed it down to our model for training.

In [5]:
#This funnction is utilized to drop the specified columns in the columns_to_drop list and returns the dataframe
def dropColumns(df, columns_to_drop): 
  for column in columns_to_drop:
    df.drop(column, inplace=True, axis=1)
  return df

In [6]:
#This function concatenates 2 columns and then eliminates the NaN value by first converting the contents of required 2 columns to string and then strips contents of column containing 'nan' off, and returns the dataframe
def concatenateColumns(df, columns_dict): 
  for key,value in columns_dict.items():
    df[key] = df[key].map(str) + ' ' + df[value].map(str)
    df[key] = df[key].map(lambda x: x.lstrip('nan').rstrip('nan'))
    df[key] = df[key].map(lambda x: x.lstrip(' ').rstrip(' '))
  return df

Unnecessary columns in the CROWDFLOWER dataset such as: tweet_id and author, WASSA2017 dataset such as id and score, were removed as they can hamper model prediction.

In [7]:
# Dictionary to be used as merging strategy
columns_dict = {'tweet':'content', 'tweettype':'sentiment'}
columns_to_drop = ['id','tweet_id','Unnamed: 0', 'author', 'sentiment', 'score', 'content'] 
text_emotion_recognition = concatenateColumns(text_emotion_recognition, columns_dict)

#This follwoing three lines are error correction step beacuse earlier in the concatenateColumns, we stripped off strings 'nan' from the column tweettype. 
text_emotion_recognition['tweettype'] = text_emotion_recognition['tweettype'].str.replace('ger','anger') 
text_emotion_recognition['tweettype'] = text_emotion_recognition['tweettype'].str.replace('ananger','anger')
text_emotion_recognition['tweettype'] = text_emotion_recognition['tweettype'].str.replace('fu','fun')

# Extracting columns
author = text_emotion_recognition['author']
score = text_emotion_recognition['score']
text_emotion_recognition.index.name = 'tweet_id'

#Dropping columns
text_emotion_recognition = dropColumns(text_emotion_recognition, columns_to_drop)
text_emotion_recognition

Unnamed: 0_level_0,tweet,tweettype
tweet_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,@ZubairSabirPTI pls dont insult the word 'Molna',anger
1,@ArcticFantasy I would have almost took offens...,anger
2,@IllinoisLoyalty that Rutgers game was an abom...,anger
3,@CozanGaming that's what lisa asked before she...,anger
4,Sometimes I get mad over something so minuscul...,anger
...,...,...
43955,@JohnLloydTaylor,neutral
43956,Happy Mothers Day All my love,love
43957,Happy Mother's Day to all the mommies out ther...,love
43958,@niariley WASSUP BEAUTIFUL!!! FOLLOW ME!! PEE...,happiness


In [8]:
sentiment_categories = list(text_emotion_recognition['tweettype'].unique())
sentiment_categories

['anger',
 'fear',
 'joy',
 'sadness',
 'empty',
 'enthusiasm',
 'neutral',
 'worry',
 'surprise',
 'love',
 'fun',
 'hate',
 'happiness',
 'boredom',
 'relief']

Finally, after merging the 2 datasets, 15 categories were obtained namely: joy, happiness, enthusiasm, fun, sadness, worry, neutral, empty, hate, anger, fear, love boredom, relief, surprise.

In [9]:
text_emotion_recognition.to_csv('../data/Original Data/text_emotion_recognition_updated.csv')