# Data preparation

For the purposes of Machine Learning methods, we usually need to convert the dataset to the mathematical vector, where each dimension defines a different feature. In our case, we need to find a good way of encoding given tweet in such form. There are several possible ways of doing that, but first of all let's dive into the dataset and see if there is anything we can do to improve the quality of the messages we have.

As humans, we don't see any difference between capital and small letters, but for a computer such words would be seen as completely different ones. There are several other corrections we probably need to apply, as our datset is written by some random people, who often don't care about grammatical correctness, etc. Let's then analyze the dataset we have and see if we can correct some common issues.

In [3]:
%store -r tweets

As a first step, we are going to analyze the frequencies of all the words.

In [2]:
# Divide each tweet by its words
tweet_words = tweets["text"].str.split()
tweet_words.head()

0             [@VirginAmerica, What, @dhepburn, said.]
1    [@VirginAmerica, plus, you've, added, commerci...
2    [@VirginAmerica, I, didn't, today..., Must, me...
3    [@VirginAmerica, it's, really, aggressive, to,...
4    [@VirginAmerica, and, it's, a, really, big, ba...
Name: text, dtype: object

For each tweet we received a list of its words, but to analyze global frequencies, we need to combine all the lists together.

In [3]:
import pandas as pd

# Chain all the lists into one Series object
words = tweet_words.apply(pd.Series)\
                   .stack()\
                   .reset_index(drop=True)\
                   .to_frame(name="word")
words.head()

Unnamed: 0,word
0,@VirginAmerica
1,What
2,@dhepburn
3,said.
4,@VirginAmerica


In [4]:
words.groupby("word")\
     .size()\
     .reset_index(name="count")

Unnamed: 0,word,count
0,!,48
1,!!,22
2,!!!,11
3,!!!!,3
4,!!!!!,3
5,!!!!!!,1
6,!=,1
7,!?,2
8,!?!?,1
9,!Cancelled,1


Our dataset is built from 30105 unique words. As we may see, there are some common issues:
* lowercase and uppercase written words are different to our system
* as our dataset is taken from Twitter, there are a lot of hashtags and mentions of other users
* duplicated emojis form separate words - 😭😭😭 and 😭😭😭😭 are completely different, even though from human perspective they're almost the same
* some words are embraced with quotation marks
* there are a lot of duplicated exclamation, question marks, etc.
* spaces are not put correctly - for instance, somebody didn't use space after a dot

We need to preprocess the dataset to get rid of all these issues, which may be confusing in our further processing.

In [5]:
import re

# https://gist.github.com/Alex-Just/e86110836f3f93fe7932290526529cd1
EMOJI_REGEX = re.compile("([\U00010000-\U0010ffff])", re.UNICODE)
DUPLICATED_SYMBOL_REGEX = re.compile(r"([^a-z0-9])\1+", re.UNICODE | re.I)
PUNCTUATION_MARKS_REGEX = re.compile(r"([,\.\!\?\[\]\(\)])", re.UNICODE)


def preprocess_text(raw_text):
    # Convert all the letters to lowercase
    text = raw_text.lower()
    # Remove hashtag symbol and "at" for user mentions
    text = text.replace("#", "")
    text = text.replace("@", "")
    # Divide the emojis written in a row with spaces
    text = EMOJI_REGEX.sub("\\1 ", text)
    # Remove quotation marks
    text = text.replace("\"", "")
    text = text.replace("'", "")
    # Get rid of the misused spaces by
    text = PUNCTUATION_MARKS_REGEX.sub(" \\1 ", text)
    # Divide duplicated characters, so after text split they'll be treated
    # as if they were a single character used a couple of times
    text = DUPLICATED_SYMBOL_REGEX.sub("\\1", text)
    # Return preprocessed value
    return text

We have the logic for a simple preprocessing prepared, so let's see how it affects the dictionary.

In [6]:
# Divide each tweet by its words, but perform the preprocessing first
tweet_words = tweets["text"].apply(preprocess_text).str.split()
# Chain all the lists into one Series object
words = tweet_words.apply(pd.Series)\
                   .stack()\
                   .reset_index(drop=True)\
                   .to_frame(name="word")
words_occurences = words.groupby("word").size().reset_index(name="count")
words_occurences

Unnamed: 0,word,count
0,!,5312
1,$,47
2,$&amp;,1
3,$+,1
4,$0,3
5,$1,2
6,$10,2
7,$100,16
8,$1000,6
9,$1000cost-,1


We've succesfully reduced the dimensionality of our dictionary to 17161 words. The next step would be to analyze the words which have only one occurrence, in order to recoginze some more issues like spelling, etc.

In [7]:
words_occurences[words_occurences["count"] == 1]

Unnamed: 0,word,count
2,$&amp;,1
3,$+,1
9,$1000cost-,1
10,$1038,1
11,$1051,1
12,$10voucherwhatajoke,1
13,$1130,1
14,$12,1
15,$120,1
22,$154,1


It seems we have a lot of similar entries, words starting with dolar sign, for instance. Let's group the words by their first letter and see if there is something we can correct in the data.

In [8]:
words_occurences[words_occurences["count"] == 1]\
    .groupby(lambda idx: words_occurences["word"][idx][0])["word"]\
    .apply(list)\
    .to_frame()

Unnamed: 0,word
$,"[$&amp;, $+, $1000cost-, $1038, $1051, $10vouc..."
%,[%]
&,"[&amp;$250, &amp;&amp;, &amp;feel, &amp;only, ..."
*,"[*alliance, *any, *anything*, *bops, *cough*, ..."
+,"[+$400/ticket, +-10pm, +1-703-464-0200, +20min..."
-,"[-&gt;southwestair, -0, -17, -17mph, -1st, -30..."
/,"[/dying, /i, /pbi, /ua795]"
0,"[0%, 0/3, 000ft, 000lbs, 0011, 0162389030167, ..."
1,"[1&amp;2, 1+hour, 1-15, 1-2888155964, 1-3, 1-3..."
2,"[2-, 2-1/2, 2-4, 2/10, 2/11/15, 2/13, 2/14/15,..."


From the human point of view, use of some diactric marks is an useful piece of information, when it comes to the meaning of a particular sentence, so we are going to keep them.

## Exercise

As we see, there are some more issues with the data, for instance:
- html entities are encoded (< as &amp;lt;, > as &amp;gt;, etc.)
- leading special characters, like ❤️from, :arrived, =we, /dying, \*any

The goal of this exercise is to review the dictionary once again to find some more issues, and to include the corrections for all the found problems in our **preprocess_text** function. The source code may be found in *exercise/exercise_01.py*. Please modify the file with your changes before going further.

In [1]:
%run exercise/exercise_01.py

In [5]:
import pandas as pd

# Divide each tweet by its words, but perform the preprocessing first
tweet_words = tweets["text"].apply(preprocess_text).str.split()
# Chain all the lists into one Series object
words = tweet_words.apply(pd.Series)\
                   .stack()\
                   .reset_index(drop=True)\
                   .to_frame(name="word")
words_occurences = words.groupby("word")\
                        .size()\
                        .reset_index(name="count")
words_occurences.sort_values("count", ascending=False)

Unnamed: 0,word,count
17,.,19104
15116,to,8644
14891,the,6056
8875,i,5408
0,!,5312
1969,?,4678
1975,a,4478
13,",",4199
16724,you,4128
15696,united,4103


In [6]:
words_occurences[words_occurences["count"] == 1]\
    .groupby(lambda idx: words_occurences["word"][idx][0])["word"]\
    .apply(list)\
    .to_frame()

Unnamed: 0,word
$,"[$&, $+]"
%,[%]
&,[&$]
+,"[+$, +-]"
0,"[0%, 0/3, 000ft, 000lbs, 0011, 0162389030167, ..."
1,"[1&2, 1+hour, 1-15, 1-2888155964, 1-3, 1-30014..."
2,"[2-, 2-1/2, 2-4, 2/, 2/10, 2/11/15, 2/13, 2/14..."
3,"[3-1/2, 3-10, 3-5, 3-6, 3-8, 3-a, 3-yr-old, 3/..."
4,"[4-10, 4-4, 4-5, 4-50, 4-minute, 4/1/15-4/17/1..."
5,"[5%, 5-, 5-12, 5-19, 5-8pm, 5-minute, 5/17, 5/..."
