## Task 1 - Loading the dataset

In [1]:
import nltk
import csv
import spacy
import pandas as pd

In [2]:
Tweets_df = pd.read_csv("train.txt", delimiter="\t")
Tweets_df

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Class
0,2017-En-30153,@liamch88 yeah! :) playing well,valence,0: neutral or mixed emotional state can be inf...
1,2017-En-40929,At least I don't have a guy trying to discoura...,valence,0: neutral or mixed emotional state can be inf...
2,2017-En-22012,UPLIFT: If you're still discouraged it means y...,valence,0: neutral or mixed emotional state can be inf...
3,2017-En-30837,"...at your age, the heyday in the blood is tam...",valence,0: neutral or mixed emotional state can be inf...
4,2017-En-30838,i was so embarrassed when she saw us i was lik...,valence,-2: moderately negative emotional state can be...
...,...,...,...,...
1176,2017-En-30999,@DumTeeDum don't think Ian knew of Pavel. He k...,valence,1: slightly positive emotional state can be in...
1177,2017-En-40498,i lost my wallet lol.... again....,valence,-2: moderately negative emotional state can be...
1178,2017-En-30436,"repentance, and trusting in Christ. It is lowl...",valence,1: slightly positive emotional state can be in...
1179,2017-En-30584,@FoRDaYS14 😳 chewing what? #smile #arcdental #...,valence,0: neutral or mixed emotional state can be inf...


In [3]:
Tweets_df.columns

Index(['ID', 'Tweet', 'Affect Dimension', 'Intensity Class'], dtype='object')

In [4]:
num_instances = len(Tweets_df)
num_instances

1181

## Task 2: Perform basic transformations

In [5]:
!pip install emoji



In [6]:
import sys
print(sys.executable)

/opt/anaconda3/envs/textmining/bin/python


In [7]:
import re
import emoji

In [8]:
def transformTweets(tweet):
    tweet = tweet.lower()  #lowecasing the tweets
    
    Hashtags = re.findall(r"#(\w+)", tweet) #Checked for hashtag words
    Account = re.findall(r"@(\w+)", tweet) #Checked for @ words
    Emoji = [char for char in tweet if char in emoji.EMOJI_DATA]
    CleanedTweet = re.sub(r"[#@]\w+|[^a-zA-Z0-9\s]|","",tweet).strip() #Added 
    
    return (Hashtags, Account, Emoji, CleanedTweet)

In [9]:
#sample_tweet = re.sub(r"[#@]\w+|[^a-zA-Z0-9\s]|","",df['Tweet'].iloc[0]).strip()

Tweets_df[['Hashtags', 'Account', 'Emoji', 'CleanedTweet']] = Tweets_df['Tweet'].apply(
    lambda x: pd.Series(transformTweets(x)))

In [10]:
for x in range(7,10):
    print(str(x-6))
    print('**Original Tweet**: '+Tweets_df['Tweet'][x])
    print('**Cleaned Tweet**: '+Tweets_df['CleanedTweet'][x])

1
**Original Tweet**: At the regular cheerfulness of any emotion, her own supper, and to give the ladies were soon as they have been entirely at
**Cleaned Tweet**: at the regular cheerfulness of any emotion her own supper and to give the ladies were soon as they have been entirely at
2
**Original Tweet**: A pessimist sees the difficulty in every opportunity, an optimist sees the opportunity in every difficulty' -Sir Winston Churchill-
**Cleaned Tweet**: a pessimist sees the difficulty in every opportunity an optimist sees the opportunity in every difficulty sir winston churchill
3
**Original Tweet**: Just because I'm hurting \nDoesn't mean I'm hurt \nDoesn't mean I didn't get \nWhat I deserved \nNo better and no worse #lost  @coldplay
**Cleaned Tweet**: just because im hurting ndoesnt mean im hurt ndoesnt mean i didnt get nwhat i deserved nno better and no worse


The cleaned version of the corresponding original tweet shown in the the above cell output has been lowercased. Apart from the tweets are stripped of emojis, account names and hastags

## Task 3: Tokenization (two approaches)

In [11]:
# Example showing tokenization of the Tweets into words for an instance

from nltk.tokenize import word_tokenize

In [12]:
word_tokenize(Tweets_df.CleanedTweet.tolist()[0])

['yeah', 'playing', 'well']

In [13]:
def tokenizeTweets(tweets):
    tokens = word_tokenize(tweets)
    
    return [tokens]

In [14]:
from spacy.lang.en import English
nlp = spacy.load("en_core_web_sm")

def spacytokenizeTweets(tweets):
    tokenizer = nlp.tokenizer
    doc = tokenizer(tweets)
    words = [token.text for token in doc]
    return [words]

In [15]:
Tweets_df[['TokenizedTweet']]=Tweets_df['CleanedTweet'].apply(
    lambda x: pd.Series(tokenizeTweets(x)))

In [16]:
Tweets_df[['SpacyTokenizedTweet']]=Tweets_df['CleanedTweet'].apply(
    lambda x: pd.Series(spacytokenizeTweets(x)))

In [17]:
Tweets_df[:5]

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Class,Hashtags,Account,Emoji,CleanedTweet,TokenizedTweet,SpacyTokenizedTweet
0,2017-En-30153,@liamch88 yeah! :) playing well,valence,0: neutral or mixed emotional state can be inf...,[],[liamch88],[],yeah playing well,"[yeah, playing, well]","[yeah, , playing, well]"
1,2017-En-40929,At least I don't have a guy trying to discoura...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],at least i dont have a guy trying to discourag...,"[at, least, i, dont, have, a, guy, trying, to,...","[at, least, i, do, nt, have, a, guy, trying, t..."
2,2017-En-22012,UPLIFT: If you're still discouraged it means y...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],uplift if youre still discouraged it means you...,"[uplift, if, youre, still, discouraged, it, me...","[uplift, if, you, re, still, discouraged, it, ..."
3,2017-En-30837,"...at your age, the heyday in the blood is tam...",valence,0: neutral or mixed emotional state can be inf...,"[shakespeareaninsults, hamlet, elizabethan, wi...",[theartofcharm],[],at your age the heyday in the blood is tame,"[at, your, age, the, heyday, in, the, blood, i...","[at, your, age, the, heyday, in, the, blood, i..."
4,2017-En-30838,i was so embarrassed when she saw us i was lik...,valence,-2: moderately negative emotional state can be...,[],[],[😩],i was so embarrassed when she saw us i was lik...,"[i, was, so, embarrassed, when, she, saw, us, ...","[i, was, so, embarrassed, when, she, saw, us, ..."


The difference I observed was time taken by both the methods. Spcay took a little more time. In terms of performance in this dataset, I think both produced similar results mostly, but while tokenizing through spacy, I can see that there was a extra space while tokenizing. In the above output the first row itself has it.

## Task 4: Remove stopwords

In [18]:
from nltk.corpus import stopwords
# Ensure stopwords are downloaded
nltk.download('stopwords')

# Convert stopwords to a set for fast lookup
stop_words = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/shivamsinghrawat/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
def StopWordsTweets(TokenizedTweet):
    StopwordRemovedTweet = [word for word in TokenizedTweet if word not in stop_words]
    return [StopwordRemovedTweet]
# [word for word in tokens if word.lower() not in stop_words]

In [20]:
Tweets_df[['StopwordRemovedTweet']]=Tweets_df['TokenizedTweet'].apply(
    lambda x: pd.Series(StopWordsTweets(x)))

In [21]:
Tweets_df[:5]

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Class,Hashtags,Account,Emoji,CleanedTweet,TokenizedTweet,SpacyTokenizedTweet,StopwordRemovedTweet
0,2017-En-30153,@liamch88 yeah! :) playing well,valence,0: neutral or mixed emotional state can be inf...,[],[liamch88],[],yeah playing well,"[yeah, playing, well]","[yeah, , playing, well]","[yeah, playing, well]"
1,2017-En-40929,At least I don't have a guy trying to discoura...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],at least i dont have a guy trying to discourag...,"[at, least, i, dont, have, a, guy, trying, to,...","[at, least, i, do, nt, have, a, guy, trying, t...","[least, dont, guy, trying, discourage, anymore..."
2,2017-En-22012,UPLIFT: If you're still discouraged it means y...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],uplift if youre still discouraged it means you...,"[uplift, if, youre, still, discouraged, it, me...","[uplift, if, you, re, still, discouraged, it, ...","[uplift, youre, still, discouraged, means, you..."
3,2017-En-30837,"...at your age, the heyday in the blood is tam...",valence,0: neutral or mixed emotional state can be inf...,"[shakespeareaninsults, hamlet, elizabethan, wi...",[theartofcharm],[],at your age the heyday in the blood is tame,"[at, your, age, the, heyday, in, the, blood, i...","[at, your, age, the, heyday, in, the, blood, i...","[age, heyday, blood, tame]"
4,2017-En-30838,i was so embarrassed when she saw us i was lik...,valence,-2: moderately negative emotional state can be...,[],[],[😩],i was so embarrassed when she saw us i was lik...,"[i, was, so, embarrassed, when, she, saw, us, ...","[i, was, so, embarrassed, when, she, saw, us, ...","[embarrassed, saw, us, like, knvfkkjg, thinks,..."


## Task 5: Stemming

In [22]:
from nltk.stem import PorterStemmer 
ps = PorterStemmer()

In [23]:
def StemmingTweets(StopwordRemovedTweet):
    StemmedTweet = [ps.stem(word) for word in StopwordRemovedTweet]
    return [StemmedTweet]

In [24]:
Tweets_df[['StemmedTweet']]=Tweets_df['StopwordRemovedTweet'].apply(
    lambda x: pd.Series(StemmingTweets(x)))

In [25]:
Tweets_df[:5]

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Class,Hashtags,Account,Emoji,CleanedTweet,TokenizedTweet,SpacyTokenizedTweet,StopwordRemovedTweet,StemmedTweet
0,2017-En-30153,@liamch88 yeah! :) playing well,valence,0: neutral or mixed emotional state can be inf...,[],[liamch88],[],yeah playing well,"[yeah, playing, well]","[yeah, , playing, well]","[yeah, playing, well]","[yeah, play, well]"
1,2017-En-40929,At least I don't have a guy trying to discoura...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],at least i dont have a guy trying to discourag...,"[at, least, i, dont, have, a, guy, trying, to,...","[at, least, i, do, nt, have, a, guy, trying, t...","[least, dont, guy, trying, discourage, anymore...","[least, dont, guy, tri, discourag, anymor, wan..."
2,2017-En-22012,UPLIFT: If you're still discouraged it means y...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],uplift if youre still discouraged it means you...,"[uplift, if, youre, still, discouraged, it, me...","[uplift, if, you, re, still, discouraged, it, ...","[uplift, youre, still, discouraged, means, you...","[uplift, your, still, discourag, mean, your, l..."
3,2017-En-30837,"...at your age, the heyday in the blood is tam...",valence,0: neutral or mixed emotional state can be inf...,"[shakespeareaninsults, hamlet, elizabethan, wi...",[theartofcharm],[],at your age the heyday in the blood is tame,"[at, your, age, the, heyday, in, the, blood, i...","[at, your, age, the, heyday, in, the, blood, i...","[age, heyday, blood, tame]","[age, heyday, blood, tame]"
4,2017-En-30838,i was so embarrassed when she saw us i was lik...,valence,-2: moderately negative emotional state can be...,[],[],[😩],i was so embarrassed when she saw us i was lik...,"[i, was, so, embarrassed, when, she, saw, us, ...","[i, was, so, embarrassed, when, she, saw, us, ...","[embarrassed, saw, us, like, knvfkkjg, thinks,...","[embarrass, saw, us, like, knvfkkjg, think, st..."


In [26]:
from nltk.stem.snowball import SnowballStemmer

In [27]:
stemmer = SnowballStemmer("english")

In [28]:
def SnowballStemmingTweets(StopwordRemovedTweet):
    StemmedTweet = [stemmer.stem(word) for word in StopwordRemovedTweet]
    return [StemmedTweet]

In [29]:
Tweets_df[['SnowballStemmedTweet']]=Tweets_df['StopwordRemovedTweet'].apply(
    lambda x: pd.Series(SnowballStemmingTweets(x)))

In [30]:
Tweets_df[:5]

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Class,Hashtags,Account,Emoji,CleanedTweet,TokenizedTweet,SpacyTokenizedTweet,StopwordRemovedTweet,StemmedTweet,SnowballStemmedTweet
0,2017-En-30153,@liamch88 yeah! :) playing well,valence,0: neutral or mixed emotional state can be inf...,[],[liamch88],[],yeah playing well,"[yeah, playing, well]","[yeah, , playing, well]","[yeah, playing, well]","[yeah, play, well]","[yeah, play, well]"
1,2017-En-40929,At least I don't have a guy trying to discoura...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],at least i dont have a guy trying to discourag...,"[at, least, i, dont, have, a, guy, trying, to,...","[at, least, i, do, nt, have, a, guy, trying, t...","[least, dont, guy, trying, discourage, anymore...","[least, dont, guy, tri, discourag, anymor, wan...","[least, dont, guy, tri, discourag, anymor, wan..."
2,2017-En-22012,UPLIFT: If you're still discouraged it means y...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],uplift if youre still discouraged it means you...,"[uplift, if, youre, still, discouraged, it, me...","[uplift, if, you, re, still, discouraged, it, ...","[uplift, youre, still, discouraged, means, you...","[uplift, your, still, discourag, mean, your, l...","[uplift, your, still, discourag, mean, your, l..."
3,2017-En-30837,"...at your age, the heyday in the blood is tam...",valence,0: neutral or mixed emotional state can be inf...,"[shakespeareaninsults, hamlet, elizabethan, wi...",[theartofcharm],[],at your age the heyday in the blood is tame,"[at, your, age, the, heyday, in, the, blood, i...","[at, your, age, the, heyday, in, the, blood, i...","[age, heyday, blood, tame]","[age, heyday, blood, tame]","[age, heyday, blood, tame]"
4,2017-En-30838,i was so embarrassed when she saw us i was lik...,valence,-2: moderately negative emotional state can be...,[],[],[😩],i was so embarrassed when she saw us i was lik...,"[i, was, so, embarrassed, when, she, saw, us, ...","[i, was, so, embarrassed, when, she, saw, us, ...","[embarrassed, saw, us, like, knvfkkjg, thinks,...","[embarrass, saw, us, like, knvfkkjg, think, st...","[embarrass, saw, us, like, knvfkkjg, think, st..."


Both Snowstemmer and Porterstemmer are similar in terms of functionality. After reading online about the differences between the two, I found that Snowstemmer is just an extension of the Porterstemmer algorithm, with Snowstemmer supporting some additional languages along with English. In terms of speed, I was not able to observe any difference as well.

## In a short paragraph, discuss any other observations on the dataset.

I think the dataset was a great dataset in terms of variety, it had hashtags, emojis, as well as account names in dataset. Along with that some of the tweets had "\n" escape sequences, which was a little unusual, but it made sense as they were extracted using Twitter's API.

## In a short paragraph, discuss any other observations on the data preprocessing process.

I had some preprocessing challenges such as having emojis in the form of icons for eg. 😩, as well as combination of charachters such as ":)" in some of the tweets. I was able to extract the icons but had to scrap emojis made by special charachters.

## Besides the preprocessing steps that you have taken above, are there other things that can be done to improve the “cleanliness” of the data and can aid in text mining? List as many as possible!

First would easily be lemmatization, it would provide a much more intuitive root word than stemming. But we would have to run different algorithm to be certain of it effectiveness by checking parameters such as Recall, accuracy, and etc.

Second would be to check for escape charachter, as after removing special characters some words had n in the beginning of a word, which made no sense, for e,g. ndoesnt instead of doesnt.

## Bonus Task (OPTIONAL): Additional preprocessing. Implement one of the additional preprocessing techniques you listed.

For this task I would perform lemmatization instead of Stemming

In [33]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

In [34]:
def LemmatizeTweets(StopwordRemovedTweet):
    StemmedTweet = [lemmatizer.lemmatize(word) for word in StopwordRemovedTweet]
    return [StemmedTweet]

In [35]:
Tweets_df[['LemmatizedTweet']]=Tweets_df['StopwordRemovedTweet'].apply(
    lambda x: pd.Series(LemmatizeTweets(x)))

In [37]:
Tweets_df[:5]

Unnamed: 0,ID,Tweet,Affect Dimension,Intensity Class,Hashtags,Account,Emoji,CleanedTweet,TokenizedTweet,SpacyTokenizedTweet,StopwordRemovedTweet,StemmedTweet,SnowballStemmedTweet,LemmatizedTweet
0,2017-En-30153,@liamch88 yeah! :) playing well,valence,0: neutral or mixed emotional state can be inf...,[],[liamch88],[],yeah playing well,"[yeah, playing, well]","[yeah, , playing, well]","[yeah, playing, well]","[yeah, play, well]","[yeah, play, well]","[yeah, playing, well]"
1,2017-En-40929,At least I don't have a guy trying to discoura...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],at least i dont have a guy trying to discourag...,"[at, least, i, dont, have, a, guy, trying, to,...","[at, least, i, do, nt, have, a, guy, trying, t...","[least, dont, guy, trying, discourage, anymore...","[least, dont, guy, tri, discourag, anymor, wan...","[least, dont, guy, tri, discourag, anymor, wan...","[least, dont, guy, trying, discourage, anymore..."
2,2017-En-22012,UPLIFT: If you're still discouraged it means y...,valence,0: neutral or mixed emotional state can be inf...,[],[],[],uplift if youre still discouraged it means you...,"[uplift, if, youre, still, discouraged, it, me...","[uplift, if, you, re, still, discouraged, it, ...","[uplift, youre, still, discouraged, means, you...","[uplift, your, still, discourag, mean, your, l...","[uplift, your, still, discourag, mean, your, l...","[uplift, youre, still, discouraged, mean, your..."
3,2017-En-30837,"...at your age, the heyday in the blood is tam...",valence,0: neutral or mixed emotional state can be inf...,"[shakespeareaninsults, hamlet, elizabethan, wi...",[theartofcharm],[],at your age the heyday in the blood is tame,"[at, your, age, the, heyday, in, the, blood, i...","[at, your, age, the, heyday, in, the, blood, i...","[age, heyday, blood, tame]","[age, heyday, blood, tame]","[age, heyday, blood, tame]","[age, heyday, blood, tame]"
4,2017-En-30838,i was so embarrassed when she saw us i was lik...,valence,-2: moderately negative emotional state can be...,[],[],[😩],i was so embarrassed when she saw us i was lik...,"[i, was, so, embarrassed, when, she, saw, us, ...","[i, was, so, embarrassed, when, she, saw, us, ...","[embarrassed, saw, us, like, knvfkkjg, thinks,...","[embarrass, saw, us, like, knvfkkjg, think, st...","[embarrass, saw, us, like, knvfkkjg, think, st...","[embarrassed, saw, u, like, knvfkkjg, think, s..."


There is a vast difference in words between words after stemming and lemmatizing. In the above sample of 5 tweets we can see that the words obtained after lemmatization are far inuitive and close to the original word.