# Data preparation

For the purposes of Machine Learning methods, we usually need to convert the dataset to the mathematical vector, where each dimension defines a different feature. In our case, we need to find a good way of encoding given tweet in such form. There are several possible ways of doing that, but first of all let's dive into the dataset and see if there is anything we can do to improve the quality of the messages we have.

As humans, we don't see any difference between capital and small letters, but for a computer such words would be seen as completely different ones. There are several other corrections we probably need to apply, as our datset is written by some random people, who often don't care about grammatical correctness, etc. Let's then analyze the dataset we have and see if we can correct some common issues.

In [2]:
%store -r tw

As a first step, we are going to analyze the frequencies of all the words.

In [4]:
# Divide each tweet by its words
tw_words = tw["text"].str.split()
tw_words.head()

0             [@VirginAmerica, What, @dhepburn, said.]
1    [@VirginAmerica, plus, you've, added, commerci...
2    [@VirginAmerica, I, didn't, today..., Must, me...
3    [@VirginAmerica, it's, really, aggressive, to,...
4    [@VirginAmerica, and, it's, a, really, big, ba...
Name: text, dtype: object

For each tweet we received a list of its words, but to analyze global frequencies, we need to combine all the lists together.

In [8]:
import pandas as pd

# Chain all the lists into one Series object
grp_words = tw_words.apply(pd.Series)\
                   .stack()\
                   .reset_index(drop=True)\
                   .to_frame(name="word")
grp_words.head(20)

Unnamed: 0,word
0,@VirginAmerica
1,What
2,@dhepburn
3,said.
4,@VirginAmerica
5,plus
6,you've
7,added
8,commercials
9,to


In [9]:
grp_words.groupby("word")\
     .size()\
     .reset_index(name="count")

Unnamed: 0,word,count
0,!,48
1,!!,22
2,!!!,11
3,!!!!,3
4,!!!!!,3
...,...,...
30100,🙏,7
30101,🙏🙏🙏😢😢😢🙏🙏🙏,1
30102,🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏😢😢😢😢😢😢😢😢🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏🙏,1
30103,🚫,1


preprocess each rows

In [10]:
import re

EMOJI_Rgx= re.compile("([\U00010000-\U0010ffff])", re.UNICODE)
Dplicated_SYM_REG = re.compile(r"([^a-z0-9])\1+", re.UNICODE | re.I)
PUNC_MARKS_REG = re.compile(r"([,\.\!\?\[\]\(\)])", re.UNICODE)


def preprocess_text(raw_text):

    text = raw_text.lower()

    text = text.replace("#", "")
    text = text.replace("@", "")
    text = EMOJI_Rgx.sub("\\1 ", text)

    text = text.replace("\"", "")
    text = text.replace("'", "")
    # Get rid of the misused spaces by
    text = PUNC_MARKS_REG.sub(" \\1 ", text)

    text = Dplicated_SYM_REG.sub("\\1", text)
    # Return preprocessed value
    return text

We have the logic for a simple preprocessing prepared, so let's see how it affects the dictionary.

In [11]:
# Divide each tweet by its words, but perform the preprocessing first
tw_words = tw["text"].apply(preprocess_text).str.split()
# Chain all the lists into one Series object
grp_words = tw_words.apply(pd.Series)\
                   .stack()\
                   .reset_index(drop=True)\
                   .to_frame(name="word")
words_occ = grp_words.groupby("word").size().reset_index(name="count")
words_occ

Unnamed: 0,word,count
0,!,5312
1,$,47
2,$&amp;,1
3,$+,1
4,$0,3
...,...,...
17156,🙌,6
17157,🙏,119
17158,🚪,1
17159,🚫,1


We've succesfully reduced the dimensionality of our dictionary to 17161 words. The next step would be to analyze the words which have only one occurrence, in order to recoginze some more issues like spelling, etc.

In [12]:
words_occ[words_occ["count"] == 1]

Unnamed: 0,word,count
2,$&amp;,1
3,$+,1
9,$1000cost-,1
10,$1038,1
11,$1051,1
...,...,...
17150,😵,1
17154,🙈,1
17155,🙉,1
17158,🚪,1


It seems we have a lot of similar entries, words starting with dolar sign, for instance. Let's group the words by their first letter and see if there is something we can correct in the data.

In [13]:
words_occ[words_occ["count"] == 1]\
    .groupby(lambda idx: words_occ["word"][idx][0])["word"]\
    .apply(list)\
    .to_frame()

Unnamed: 0,word
$,"[$&amp;, $+, $1000cost-, $1038, $1051, $10vouc..."
%,[%]
&,"[&amp;$250, &amp;&amp;, &amp;feel, &amp;only, ..."
*,"[*alliance, *any, *anything*, *bops, *cough*, ..."
+,"[+$400/ticket, +-10pm, +1-703-464-0200, +20min..."
...,...
😵,[😵]
🙈,[🙈]
🙉,[🙉]
🚪,[🚪]


From the human point of view, use of some diactric marks is an useful piece of information, when it comes to the meaning of a particular sentence, so we are going to keep them.

## Exercise

As we see, there are some more issues with the data, for instance:
- html entities are encoded (< as &amp;lt;, > as &amp;gt;, etc.)
- leading special characters, like ❤️from, :arrived, =we, /dying, \*any

The goal of this exercise is to review the dictionary once again to find some more issues, and to include the corrections for all the found problems in our **preprocess_text** function. The source code may be found in *exercise/exercise_01.py*. Please modify the file with your changes before going further.

In [14]:
%run exercise/exercise_01.py

In [15]:
import pandas as pd

# Divide each tweet by its words, but perform the preprocessing first
tw_words = tw["text"].apply(preprocess_text).str.split()
# Chain all the lists into one Series object
grp_words = tw_words.apply(pd.Series)\
                   .stack()\
                   .reset_index(drop=True)\
                   .to_frame(name="word")
words_occ = grp_words.groupby("word")\
                        .size()\
                        .reset_index(name="count")
words_occ.sort_values("count", ascending=False)

Unnamed: 0,word,count
237,.,19104
15270,to,8644
15045,the,6055
9064,i,5407
0,!,5312
...,...,...
10165,limits,1
10166,lin,1
10167,lindaswc,1
10168,lindsay,1


In [16]:
words_occ[words_occ["count"] == 1]\
    .groupby(lambda idx: words_occ["word"][idx][0])["word"]\
    .apply(list)\
    .to_frame()

Unnamed: 0,word
$,"[$&amp;, $+, $1000cost-, $1038, $1051, $10vouc..."
%,[%]
&,"[&amp;$250, &amp;&amp;, &amp;feel, &amp;only, ..."
*,"[*alliance, *any, *anything*, *bops, *cough*, ..."
+,"[+$400/ticket, +-10pm, +1-703-464-0200, +20min..."
...,...
😵,[😵]
🙈,[🙈]
🙉,[🙉]
🚪,[🚪]
