### Pre-Process

Before using acquired tweets to train the selected models, pre-processing them is necessary.

**Several Pre-processing steps took place to sufficiently clean every tweet**

In [3]:
import pandas as pd
import itertools 
import emoji
import pickle
from ekphrasis.classes.preprocessor import TextPreProcessor
from ekphrasis.classes.tokenizer import SocialTokenizer
from ekphrasis.dicts.emoticons import emoticons
from ekphrasis.dicts.noslang import slangdict


df=pd.read_csv('df_before_preprocess1.csv',encoding='utf-8')
del df['Unnamed: 0']

df

Unnamed: 0,Tweet,Class
0,bout time!!!!! \xe2\x9c\x8a\xf0\x9f\x8f\xbd\xf...,Happy
1,not thrilled that the dog wanted to walk so ea...,Happy
2,get up and get your butt on the ride! #loveyou...,Happy
3,i am surrounded by internalised homophobia.\xf...,Happy
4,@peacexxanna your nightmare,Fear
...,...,...
57288,hate is an acid that decays its own container....,Happy
57289,making progress on the #yogaroom\n\n7 boxes of...,Happy
57290,i need this again\xf0\x9f\xa5\xba @euphoriahbo...,Happy
57291,always add a side of bacon.\n\ncommitment is a...,Happy


### First Preprocessing step

* Change encoding to take advantage of emoticons and emojis
 * Each emoji and emoticon is translated into a keyword. For example " :) " is translated to <smiley_face>
* Replacing " ’ " with ' 
* Replacing newlines symbol
* Replacing some extra symbols to reduce pancuation



In [4]:

# Replacing symbols
df['Tweet'] = df['Tweet'].str.decode('unicode_escape').str.encode('latin1','ignore').str.decode('utf-8')
df['Tweet'] = df['Tweet'].str.replace("’" , "'")#for some words
df['Tweet'] = df['Tweet'].str.replace("‘" , "'") #for some words
df['Tweet'] = df['Tweet'].str.replace('"' , '')
df['Tweet'] = df['Tweet'].str.replace("'" , "")
df['Tweet'] = df['Tweet'].str.replace('\n' , '..') #for newlines
df['Tweet'] = df['Tweet'].str.replace('&' , 'and') #reducing pancuations
df['Tweet'] = df['Tweet'].str.replace(',' , '') #Gettind rid of commas


#Replacing emojis:
#delimiter stands for adding whitespace before and after the converted emoji
df['Tweet'] = df.apply(lambda row : emoji.demojize(row['Tweet'] , delimiters=(" ", " "))  ,axis = 1) 
df['Tweet']

0        bout time!!!!!  raised_fist_medium_skin_tone  ...
1        not thrilled that the dog wanted to walk so ea...
2        get up and get your butt on the ride! #loveyou...
3        i am surrounded by internalised homophobia. up...
4                              @peacexxanna your nightmare
                               ...                        
57288    hate is an acid that decays its own container....
57289    making progress on the #yogaroom....7 boxes of...
57290    i need this again pleading_face  @euphoriahbo ...
57291    always add a side of bacon.....commitment is a...
57292    in a world that yearns for positivity and happ...
Name: Tweet, Length: 57293, dtype: object

### Second Preprocessing step

* Converting contractions to their complete form. For example I've => I have, It's => It is

In [1]:

pkl_file = open('extra1.p', 'rb') # => https://github.com/charlesmalafosse/FastText-sentiment-analysis-for-tweets/blob/master/betsentiment_sentiment_analysis_fasttext.py
extra = pickle.load(pkl_file)
pkl_file.close()

pkl_file = open('slang.p', 'rb') # => http://pydoc.net/ekphrasis/0.4.7/ekphrasis.dicts.noslang.slangdict/
slang = pickle.load(pkl_file)
pkl_file.close()



def check_1(tweet):
    #For extra:
    reformed = [extra[word] if word in extra else word for word in tweet.split()]
    tweet = " ".join(reformed)
    #For slang:
    reformed = [slang[word] if word in slang else word for word in tweet.split()]
    tweet = " ".join(reformed)   
    return(tweet)
df['Tweet'] = df.apply(lambda row : check_1(row['Tweet']),axis=1)


NameError: name 'pickle' is not defined

### Third Preprocessing step


**Ekprasis Pipeline**: (https://github.com/cbaziotis/ekphrasis)

* Normalize values such as mails , urls ,dates since they are irrelenant(we only need their type)
* Annotate some values (didnt applied it on hashtags) 
* Fixed some Html values that might have escaped before
* Segmented some words: for example retrogaming => retro gaming (based on Twitter vocabulary)
* Corrected the spelling of some words (Based on twitter)
* Didnt perform segmentation on hashtags 
* Unpacked some extra words such as cant't => can not
* Tokenized and then rejoined to perform the operations
* Used dictionaries to replace words after tokenizing 



In [6]:
#Ekphprasis pipeline!

text_processor = TextPreProcessor(
    # terms that will be normalized
    normalize = ['url', 'email', 'percent', 'money', 'phone', 'user','time', 'url', 'date', 'number'],
    
    # terms that will be annotated =>flagged
    annotate = {"allcaps", "elongated", "repeated",'emphasis', 'censored','hashtags'},
    fix_html = True,  # fix HTML tokens
    
    # corpus from which the word statistics are going to be used for word segmentation 
    segmenter = "twitter", 
    
    # corpus from which the word statistics are going to be used for spell correction
    corrector = "twitter", 
    
    unpack_hashtags = True,  # perform word segmentation on hashtags <-removes the hashtag symbol and treats it as a word
    unpack_contractions = True,  # Unpack contractions (can't -> can not)
    spell_correct_elong = False,  # spell correction for elongated words
    
    
    
    #Tokenizes and then rejoins while getting rid of some terms
    
    #Set hashtags to true to keep hashtags 
    #I can set it to keep other stuff too:
    #See documentation: https://github.com/cbaziotis/ekphrasis/blob/master/ekphrasis/classes/tokenizer.py
    #On kwargs
    tokenizer = SocialTokenizer(lowercase = True , hashtags = True , emojis = True).tokenize,
    
    # list of dictionaries, for replacing tokens extracted from the text,
    # with other expressions. =>slang is a dic created and saved as pickle
    #documentation for dictionaries : http://pydoc.net/ekphrasis/0.4.7/ekphrasis.dicts.emoticons/
    dicts = [emoticons]
)

  self.tok = re.compile(r"({})".format("|".join(pipeline)))


Reading twitter - 1grams ...
Reading twitter - 2grams ...
Reading twitter - 1grams ...


  regexes = {k.lower(): re.compile(self.expressions[k]) for k, v in


In [7]:
#Applying the pipeline
df['Tweet'] = df.apply(lambda row : " ".join(text_processor.pre_process_doc(row['Tweet'])) , axis = 1)

### Some extra steps:
* Replace repeated < user > tags or < url > with a single keyword
* For example < user >,< user >,< user >,< user > => < user >

In [8]:
#replace repeating <user> and <url>
#Example <user> <user> <user> <user> -> <user>
def repeated(tweet):
    if ('<user>' not in tweet) & ('<url>' not in tweet):
        return(tweet)
    else:
        cleaned_words = [word for word,zzzz in itertools.groupby(tweet.split())]
        return(" ".join(cleaned_words))
df['Tweet'] = df.apply(lambda row : (repeated(row['Tweet'])),axis=1)
df = df[1:]

### Dataset is ready

In [10]:
df.to_csv('D:/Big Data/project/final sets/final1.csv')