# 5. Feature Engineering and Feature Selection

In [1]:
import pandas as pd
from collections import Counter
import matplotlib.pyplot as plt
import warnings
import numpy as np

warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', None)

In [2]:
#na_filter set to False as otherwise empty strings are interpreted as NaN
df_tweets_cleaned = pd.read_csv('..\data\Tweets_cleaned.csv', encoding='utf-8', na_filter= False)

## Feature Engineering
Let's generate some features we could possibly use. Some features, such as `emojis_flag`, `emoticons_flag`, and `hashtags_flag` are already generated. Below are some of the features we are engineering/generating:

1. `emojis_num` denotes the number of emojis used in a tweet. 
2. `emoitcons_num` denotes the number of emoticons used in a tweet. 
3. `hashtag_num` denotes the number of hashtags used in a tweet.
4. `numbers_flag` denotes whether the tweet contains numbers or not (either in Arabic or English)
5. `numbers_num` denotes the number of times a tweet contains numbers
We noticed that numbers were used in quite a few negative tweets, such as hours, time, dollars, flight numbers, etc. This is why we are generating a binary flag, as well as a numeric count of numbers used in a tweet.
6. `char_length_original` denotes the length of the the user's original tweet. This includes everything (@ mentions, RT retweets, hyperlinks, etc.)
7. `char_length_user` denotes the length of the user's cleaned tweet. The length will be based off the column `text_cleaned`
We also noticed that negative tweets were, on average, longer than positive tweets in terms of character length.
8. `mentions_num` denotes the number of mentions a tweet has (@ mentions)
9. `retweet_flag` denotes if the user's tweet retweeted a tweet (normally the retweet is one of an airline, rarely another user).
No need to create a count for retweets in a user's tweet because it's always 1.
10. `http_flag` denotes if the user's tweet has a HTTP link. 
No need to create a count for http links in a user's tweet because it's always 1 too.

The True/Flase `_flag` will need to be converted into binary flags instead (i.e. True/False into 1/0).

Any of the `_num` columns will likely need to be scaled to a scale from 0 to 1. For `char_length_original` column, because all the tweets are in year 2015, all tweets had a character limit of 140 characters, so we can simply divide the `char_length_original` column by 140.

We will also need to vectorize the words in the tweets. To do so, there are several ways of doing so. We could use `word2vec`, `emoji2vec`, or a combination of both of them called `phrase2vec`.

Lastly, we will need to convert airline_sentiment into 0 or 1. In this situation, because we care about classifying negative sentiment tweets, and not really care about whether it's positive or neutral, we decided to group the positive and neutral tweets as `non-negative`. All `non-negative` tweets are class `0`, whereas all `negative` tweets are class `1`.

### Generate columns `emojis_num`, `emoticons_num`, and `hashtag_num`
Generate basic features such as `emojis_num`, `emoticons_num`, `hashtag_num` from already developed columns.

In [3]:
#creates emojis_num column
def create_emojis_num(df):
    df['emojis_num'] = 0 
    
    for i, row in df.iterrows():    
        if df.at[i, 'emojis_flag']:
            tweet_emojis = df.at[i, 'emojis']
            #strip brackets, quote, and spaces
            tweet_emojis_list = list(tweet_emojis.strip('[]').replace("\'", "").strip().split(","))
            emoji_counter = 0
            
            for emoji in tweet_emojis_list:           
                emoji_counter = emoji_counter + 1
            
            df.at[i, 'emojis_num'] = emoji_counter
        else:
            df.at[i, 'emojis_num'] = 0
            
    return df

#creates emoticons_num column
def create_emoticons_num(df):
    df['emoticons_num'] = 0 
    
    for i, row in df.iterrows():    
        if df.at[i, 'emoticons_flag']:
            tweet_emoticons = df.at[i, 'emoticons']
            #strip brackets, quote, and spaces
            tweet_emoticons_list = list(tweet_emoticons.strip('[]').replace("\'", "").strip().split(","))
            emoticons_counter = 0
            
            for emoticon in tweet_emoticons_list:            
                emoticons_counter = emoticons_counter + 1
            
            df.at[i, 'emoticons_num'] = emoticons_counter
        else:
            df.at[i, 'emoticons_num'] = 0
            
    return df

#creates hashtag_num column
def create_hashtags_num (df):
    df['hashtags_num'] = 0 
    
    for i, row in df.iterrows():    
        if df.at[i, 'hashtags_flag']:
            tweet_hashtags = df.at[i, 'hashtags']
            #strip brackets, quote, and spaces
            tweet_hashtags_list = list(tweet_hashtags.strip('[]').replace("\'", "").strip().split(","))
            hashtags_counter = 0
            
            for hashtag in tweet_hashtags_list:            
                hashtags_counter = hashtags_counter + 1
            
            df.at[i, 'hashtags_num'] = hashtags_counter
        else:
            df.at[i, 'hashtags_num'] = 0
            
    return df

In [4]:
df_tweets_cleaned = create_emojis_num(df_tweets_cleaned)
df_tweets_cleaned = create_emoticons_num(df_tweets_cleaned)
df_tweets_cleaned = create_hashtags_num(df_tweets_cleaned)

In [5]:
#df_tweets_cleaned.loc[df_tweets_cleaned['hashtags_flag'] == True]

### Generate columns `numbers_flag`, `numbers_num`
Generate a binary flag and a count of how many times numbers were used in a tweet. Numbers can either be numeric, or in English. English numbers are sometimes considered stop words by Spacy (e.g. "twelve" is a stop word in tweet `568911315026063361`, but "thirty" is not for some reason in tweet `568237684277141504`), and were removed in `lemmas_list`, so we generate the numbers features from column `text_cleaned_no_abbreviations`. We will use Spacy model to help us determine which token are like numbers, using `like_num`.

In [6]:
df_tweets_cleaned.loc[df_tweets_cleaned['tweet_id'] == 568911315026063361]

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,text,text_cleaned,text_cleaned_time_removed,emojis_flag,emojis,emoticons_flag,emoticons,text_cleaned_without_emojis_emoticons,hashtags,text_cleaned_without_emojis_emoticons_hashtags,hashtags_flag,text_cleaned_lower_case,text_cleaned_no_abbreviations,text_list_no_stop_words,lemmas_list,emojis_num,emoticons_num,hashtags_num
2767,568911315026063361,negative,1.0,Late Flight,1.0,United,@united 5.5 hours Late Flightr I've been in tr...,5.5 hours Late Flightr I've been in transit f...,5.5 hours Late Flightr I've been in transit f...,False,[],False,[],5.5 hours Late Flightr I've been in transit f...,[],5.5 hours Late Flightr I've been in transit f...,False,5.5 hours late flightr i've been in transit f...,5.5 hours late flightr i've been in transit f...,hours late flightr transit total hours change ...,hour late flightr transit total hour change pl...,0,0,0


In [7]:
#load spacy model
import spacy

nlp = spacy.load('en_core_web_md')

In [8]:
#this function will create the columns numbers_flag and numbers_num
def create_numbers_columns(df):
    df['numbers_flag'] = False
    df['numbers_num'] = 0
    
    for i, row in df.iterrows():   
        if i % 1000 == 0:
            print('at row number: ' + str(i))

        text = df.at[i, 'text_cleaned_no_abbreviations']
        #print(type(text))

        like_num_count = 0
        
        #tokenize text into list of tokens
        #print(text)
        
        token_list = nlp(text)

        #iterate through our tokens and count the number of nums
        for token in token_list:
            #print(token)
            if token.like_num:
                like_num_count = like_num_count + 1

        #at the end, we set our new columns
        if like_num_count != 0:
            df.at[i, 'numbers_flag'] = True
            df.at[i, 'numbers_num'] = like_num_count            
    
    return df

In [9]:
#Sanity check

create_numbers_columns(df_tweets_cleaned.loc[df_tweets_cleaned['tweet_id'] == 568911315026063361])
#create_numbers_columns(df_tweets_cleaned.loc[df_tweets_cleaned['tweet_id'] == 570093964059156481])

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,text,text_cleaned,text_cleaned_time_removed,emojis_flag,emojis,emoticons_flag,emoticons,text_cleaned_without_emojis_emoticons,hashtags,text_cleaned_without_emojis_emoticons_hashtags,hashtags_flag,text_cleaned_lower_case,text_cleaned_no_abbreviations,text_list_no_stop_words,lemmas_list,emojis_num,emoticons_num,hashtags_num,numbers_flag,numbers_num
2767,568911315026063361,negative,1.0,Late Flight,1.0,United,@united 5.5 hours Late Flightr I've been in tr...,5.5 hours Late Flightr I've been in transit f...,5.5 hours Late Flightr I've been in transit f...,False,[],False,[],5.5 hours Late Flightr I've been in transit f...,[],5.5 hours Late Flightr I've been in transit f...,False,5.5 hours late flightr i've been in transit f...,5.5 hours late flightr i've been in transit f...,hours late flightr transit total hours change ...,hour late flightr transit total hour change pl...,0,0,0,True,3


In [10]:
df_tweets_cleaned = create_numbers_columns(df_tweets_cleaned)

at row number: 0
at row number: 1000
at row number: 2000
at row number: 3000
at row number: 4000
at row number: 5000
at row number: 6000
at row number: 7000
at row number: 8000
at row number: 9000
at row number: 10000
at row number: 11000
at row number: 12000
at row number: 13000
at row number: 14000


In [11]:
#df_tweets_cleaned.loc[df_tweets_cleaned['numbers_flag'] == True]

### Generate columns `char_length_original`, `char_length_user`
Generate columns with the number of characters in original tweet, and cleaned tweet from column `text_cleaned`.

In [12]:
#this function will create the columns numbers_flag and numbers_num
def create_char_length_columns(df):
    df['char_length_original'] = 0
    df['char_length_user'] = 0
    
    for i, row in df.iterrows():
        text = df.at[i, 'text']
        cleaned_text = df.at[i, 'text_cleaned_no_abbreviations']
        
        df.at[i, 'char_length_original'] = len(text)
        df.at[i, 'char_length_user'] = len(cleaned_text)
    
    return df

In [13]:
df_tweets_cleaned = create_char_length_columns(df_tweets_cleaned)

### Generate columns `mentions_num`, `retweet_flag`, and `http_flag`
Generate columns `mentions_num`: number of mentions in a tweet, `retweet_flag`: whether a tweet has a retweet, and `http_flag`: whether a tweet has a http link.

In [25]:
import re

#this function will create mentions_num column
def create_mentions_num(df):
    df['mentions_num'] = 0
    
    for i, row in df.iterrows():
        text = df.at[i, 'text']
        regex_to_find = r'\@[\w\d]*'
        
        regex_hits_list = re.findall(regex_to_find, text)
        df.at[i, 'mentions_num'] = len(regex_hits_list)
    
    return df

#this function will create retweet_flag column
def create_retweet_flag(df):
    df['retweet_flag'] = False
    
    for i, row in df.iterrows():
        text = df.at[i, 'text']
        regex_to_find = r'RT \@.*'
        
        regex_hits_list = re.findall(regex_to_find, text)
        if (len(regex_hits_list) != 0):
            df.at[i, 'retweet_flag'] = True
    
    return df

#this function will create http_flag column
def create_http_flag(df):
    df['http_flag'] = False
    
    for i, row in df.iterrows():
        text = df.at[i, 'text']
        regex_to_find = r'https*://[^\s]*'
        
        regex_hits_list = re.findall(regex_to_find, text)
        if (len(regex_hits_list) != 0):
            df.at[i, 'http_flag'] = True
    
    return df

In [28]:
df_tweets_cleaned = create_mentions_num(df_tweets_cleaned)
df_tweets_cleaned = create_retweet_flag(df_tweets_cleaned)
df_tweets_cleaned = create_http_flag(df_tweets_cleaned)

### Generate columns `mentions_num`, `retweet_flag`, and `http_flag`
Generate columns `mentions_num`: number of mentions in a tweet, `retweet_flag`: whether a tweet has a retweet, and `http_flag`: whether a tweet has a http link.

In [None]:
### Generate columns `mentions_num`, `retweet_flag`, and `http_flag`
Generate columns `mentions_num`: number of mentions in a tweet, `retweet_flag`: whether a tweet has a retweet, and `http_flag`: whether a tweet has a http link.