## Main points
- Data Exploration
- NLP Pre-rpocessing

## Data:
We will use the following data (both can be downloaded from Kiro platform):
1. top_english_movies: a dataset with the top movies rated on IMDb
2. Twitter dataset: a corpus that contains text from a specific social media

## Sources:
- NLTK: https://www.nltk.org/
- PANDAS: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html
- SPACY: https://spacy.io/usage/spacy-101

## 1. Basic Pandas Functions

In [4]:
import string
import re
import pandas as pd
import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews

# print nltk version
nltk.__version__

'3.9.1'

Download useful functions and data from nltk.

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('twitter_samples')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

### Pandas
Pandas is a Python library used for working with data sets, specifically tabular data. It has functions for analyzing, cleaning, exploring, and manipulating data. Pandas gives you answers about the data. Like: What is average value?, Max value? or Min value?. Pandas are also able to delete rows that are not relevant, or contains wrong values, like empty or NULL values. This is called cleaning the data. Pandas is usually imported under the pd alias. One of the most important object in pandas is a DataFrame. A Pandas DataFrame is a 2 dimensional data structure, like a table with rows and columns. A simple way to store big data sets is to use CSV files (comma separated files), like Excel. CSV files contains plain text and is a well know format that can be read by everyone including Pandas.

In [6]:
# Load DataFrame from a csv file using 'read_csv' function
movies_db = pd.read_csv('/content/top_english_movies.csv')
movies_db.head()

Unnamed: 0.1,Unnamed: 0,movie_name,movie_year,movie_rating,user_votes
0,0,The Shawshank Redemption,1994,9.3,2.8M
1,1,The Godfather,1972,9.2,2M
2,2,The Dark Knight,2008,9.0,2.8M
3,3,The Godfather Part II,1974,9.0,1.3M
4,4,12 Angry Men,1957,9.0,839K


In [7]:
# Plot one specific column
movies_db['movie_name']

Unnamed: 0,movie_name
0,The Shawshank Redemption
1,The Godfather
2,The Dark Knight
3,The Godfather Part II
4,12 Angry Men
...,...
245,12 Monkeys
246,Papillon
247,Blood Diamond
248,Blade Runner 2049


A Pandas Series is like a column in a table. It is a one-dimensional array holding data of any type. The function type allow us to discover the type of its input.

In [8]:
type(movies_db['movie_name'])

In [9]:
movies_db.shape

(250, 5)

In [10]:
movies_db.columns

Index(['Unnamed: 0', 'movie_name', 'movie_year', 'movie_rating', 'user_votes'], dtype='object')

In [11]:
# inplace=True is required to override the 'old' dataset
# changing the name of the column
movies_db.rename(columns={'Unnamed: 0': 'Position'}, inplace=True)

### How to modify column values or create a new column?

In [12]:
movies_db['Position'] = movies_db['Position'] + 1
movies_db.head()

Unnamed: 0,Position,movie_name,movie_year,movie_rating,user_votes
0,1,The Shawshank Redemption,1994,9.3,2.8M
1,2,The Godfather,1972,9.2,2M
2,3,The Dark Knight,2008,9.0,2.8M
3,4,The Godfather Part II,1974,9.0,1.3M
4,5,12 Angry Men,1957,9.0,839K


In [13]:
movies_db['movie_rating_5'] = movies_db['movie_rating'] / 2
movies_db.head()

Unnamed: 0,Position,movie_name,movie_year,movie_rating,user_votes,movie_rating_5
0,1,The Shawshank Redemption,1994,9.3,2.8M,4.65
1,2,The Godfather,1972,9.2,2M,4.6
2,3,The Dark Knight,2008,9.0,2.8M,4.5
3,4,The Godfather Part II,1974,9.0,1.3M,4.5
4,5,12 Angry Men,1957,9.0,839K,4.5


In [14]:
# with function dtype() we can discover the type of the data
movies_db['movie_year'].dtype

dtype('int64')

In [15]:
movies_db['movie_rating'].dtype

dtype('float64')

In [16]:
movies_db['user_votes'].dtype

dtype('O')

In [17]:
# we can select a subset of rows given a certain consition
movies_db[movies_db['movie_rating_5'] > 4.25]

Unnamed: 0,Position,movie_name,movie_year,movie_rating,user_votes,movie_rating_5
0,1,The Shawshank Redemption,1994,9.3,2.8M,4.65
1,2,The Godfather,1972,9.2,2M,4.6
2,3,The Dark Knight,2008,9.0,2.8M,4.5
3,4,The Godfather Part II,1974,9.0,1.3M,4.5
4,5,12 Angry Men,1957,9.0,839K,4.5
5,6,Schindler's List,1993,9.0,1.4M,4.5
6,7,The Lord of the Rings: The Return of the King,2003,9.0,1.9M,4.5
7,8,Pulp Fiction,1994,8.9,2.2M,4.45
8,9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8,2M,4.4
9,10,Forrest Gump,1994,8.8,2.2M,4.4


In [18]:
movies_db[movies_db['movie_rating_5'] >= 4.60]['movie_name']

Unnamed: 0,movie_name
0,The Shawshank Redemption
1,The Godfather


In [19]:
movies_db[movies_db['movie_rating_5'] >= 4.60][['movie_name','user_votes']]

Unnamed: 0,movie_name,user_votes
0,The Shawshank Redemption,2.8M
1,The Godfather,2M


In [20]:
movies_db.head(7)

Unnamed: 0,Position,movie_name,movie_year,movie_rating,user_votes,movie_rating_5
0,1,The Shawshank Redemption,1994,9.3,2.8M,4.65
1,2,The Godfather,1972,9.2,2M,4.6
2,3,The Dark Knight,2008,9.0,2.8M,4.5
3,4,The Godfather Part II,1974,9.0,1.3M,4.5
4,5,12 Angry Men,1957,9.0,839K,4.5
5,6,Schindler's List,1993,9.0,1.4M,4.5
6,7,The Lord of the Rings: The Return of the King,2003,9.0,1.9M,4.5


What if we want to select a specific row? It depends on the idnex we define

In [21]:
# setting Position column as a new index column
movies_db.set_index('Position', inplace=True)
movies_db.head()

Unnamed: 0_level_0,movie_name,movie_year,movie_rating,user_votes,movie_rating_5
Position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,The Shawshank Redemption,1994,9.3,2.8M,4.65
2,The Godfather,1972,9.2,2M,4.6
3,The Dark Knight,2008,9.0,2.8M,4.5
4,The Godfather Part II,1974,9.0,1.3M,4.5
5,12 Angry Men,1957,9.0,839K,4.5


In [22]:
movies_db.loc[5]

Unnamed: 0,5
movie_name,12 Angry Men
movie_year,1957
movie_rating,9.0
user_votes,839K
movie_rating_5,4.5


### Exercise 1 (Pandas)
How many movies from XXth century are in the dataset?

In [23]:
movies_db['movie_year']

Unnamed: 0_level_0,movie_year
Position,Unnamed: 1_level_1
1,1994
2,1972
3,2008
4,1974
5,1957
...,...
246,1995
247,1973
248,2006
249,2017


In [24]:
(movies_db[movies_db['movie_year'] < 2000]).shape[0]


153

## JSON Format

JSON stands for JavaScript Object Notation and it is a lightweight format for storing and transporting data.
JSON data is written as name/value pairs.
A name/value pair consists of a field name (in double quotes), followed by a colon, followed by a value.

In [25]:
tweets = twitter_samples.strings("tweets.20150430-223406.json")
tweets

['RT @KirkKus: Indirect cost of the UK being in the EU is estimated to be costing Britain £170 billion per year! #BetterOffOut #UKIP',
 'VIDEO: Sturgeon on post-election deals http://t.co/BTJwrpbmOY',
 'RT @LabourEoin: The economy was growing 3 times faster on the day David Cameron became Prime Minister than it is today.. #BBCqt http://t.co…',
 'RT @GregLauder: the UKIP east lothian candidate looks about 16 and still has an msn addy http://t.co/7eIU0c5Fm1',
 "RT @thesundaypeople: UKIP's housing spokesman rakes in £800k in housing benefit from migrants.  http://t.co/GVwb9Rcb4w http://t.co/c1AZxcLh…",
 'RT @Nigel_Farage: Make sure you tune in to #AskNigelFarage tonight on BBC 1 at 22:50! #UKIP http://t.co/ogHSc2Rsr2',
 'RT @joannetallis: Ed Milliband is an embarrassment. Would you want him representing the UK?!  #bbcqt vote @Conservatives',
 "RT @abstex: The FT is backing the Tories. On an unrelated note, here's a photo of FT leader writer Jonathan Ford (next to Boris) http://t.c…",
 "RT

Given the list of tweets, we create a Pandas DataFrame.

In [26]:
df_tweet = pd.DataFrame(tweets, columns=['text'])
df_tweet

Unnamed: 0,text
0,RT @KirkKus: Indirect cost of the UK being in ...
1,VIDEO: Sturgeon on post-election deals http://...
2,RT @LabourEoin: The economy was growing 3 time...
3,RT @GregLauder: the UKIP east lothian candidat...
4,RT @thesundaypeople: UKIP's housing spokesman ...
...,...
19995,RT @UKLabour: .@Ed_Miliband: we're not going t...
19996,RT @DisabledScot: @blairmcdougall @ScotlandTon...
19997,RT @Staircase2: @VividRicky exactly but that a...
19998,Actually agreed with %95 of what farage was sa...


## Preprocessing

1. Case Folding
2. Removal
3. Tokenization
4. Stemming
5. POS
6. Lemmatization

### Case Folding

Two ways to apply the same processing technique:

In [27]:
df_tweet.text = df_tweet.text.str.lower()
df_tweet.head(2)

Unnamed: 0,text
0,rt @kirkkus: indirect cost of the uk being in ...
1,video: sturgeon on post-election deals http://...


In [28]:
df_tweet['text'] = df_tweet['text'].str.lower()
df_tweet.head(2)

Unnamed: 0,text
0,rt @kirkkus: indirect cost of the uk being in ...
1,video: sturgeon on post-election deals http://...


### Removal
#### Regular Expression

A regular expression (shortened as regex) is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings. We use library re, in which there are some pre-defined possible pattern, which can be found at https://docs.python.org/3/library/re.html. Following some examples.

In [29]:
import re

string_examples = ['ABCDEFabcdef', 'I live in Europe', 'I live in europe', '*&%@#!}{', 'His dad is 45 years old', '123458', 'Europe is a continent']
df_re = pd.DataFrame(string_examples, columns=['Text'])

df_re.Text.str.contains('europe')

Unnamed: 0,Text
0,False
1,False
2,True
3,False
4,False
5,False
6,False


*[ ]* is used to indicate a set of characters. In a set:
- Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'.
- Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [a-z] will match any lowercase letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59.

In [30]:
df_re.Text.str.contains('[Ee]urope')

Unnamed: 0,Text
0,False
1,True
2,True
3,False
4,False
5,False
6,True


In [31]:
df_re.Text.str.contains(r'[a-zA-Z0-9]', regex=True)

Unnamed: 0,Text
0,True
1,True
2,True
3,False
4,True
5,True
6,True


\w: this matched all alphanumeric characters as well as the underscore. It equals to [a-zA-Z0-9_].

In [32]:
df_re.Text.str.contains(r'\w', regex=True)

Unnamed: 0,Text
0,True
1,True
2,True
3,False
4,True
5,True
6,True


In [33]:
df_re.Text.str.contains(r'[0-9]', regex=True)

Unnamed: 0,Text
0,False
1,False
2,False
3,False
4,True
5,True
6,False


$: it matches the end of the string or just before the newline at the end of the string.

In [34]:
df_re.Text.str.contains('e$')

Unnamed: 0,Text
0,False
1,True
2,True
3,False
4,False
5,False
6,False


+: Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

In [35]:
df_re.Text.str.replace('45+', '100', regex=True)

Unnamed: 0,Text
0,ABCDEFabcdef
1,I live in Europe
2,I live in europe
3,*&%@#!}{
4,His dad is 100 years old
5,1231008
6,Europe is a continent


In [36]:
df_re.Text.str.replace('[Ee]urope', 'Sud America', regex=True)

Unnamed: 0,Text
0,ABCDEFabcdef
1,I live in Sud America
2,I live in Sud America
3,*&%@#!}{
4,His dad is 45 years old
5,123458
6,Sud America is a continent


- . : In the default mode, this matches any character except a newline.
- ^ : Matches the start of the string.
- "*" : Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

In [37]:
df_re.Text.str.replace('^I.*', 'Sud America', regex=True)

Unnamed: 0,Text
0,ABCDEFabcdef
1,Sud America
2,Sud America
3,*&%@#!}{
4,His dad is 45 years old
5,123458
6,Europe is a continent


- \s: Matches whitespace characters (which includes [ \t\n\r\f\v])
- \S: Matches any character which is not a whitespace character. This is the opposite of \s.
- ?: Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

In [38]:
df_re.Text.str.replace('.*\s[Ee]urope', 'I love eating Sushi', regex=True)

Unnamed: 0,Text
0,ABCDEFabcdef
1,I love eating Sushi
2,I love eating Sushi
3,*&%@#!}{
4,His dad is 45 years old
5,123458
6,Europe is a continent


#### URLs

In [39]:
# regular expressions
text_ex = 'Learn more about regular expressions on https://docs.python.org/3/library/re.html'
re.search('https?://\S+|www\.\S+', text_ex)

<re.Match object; span=(40, 81), match='https://docs.python.org/3/library/re.html'>

In [40]:
print('\033[1m' + 'Text before removing URLs' +'\033[0;0m')
print(df_tweet.iloc[1]['text'])

df_tweet.text = df_tweet.text.str.replace(r'https?://\S+|www\.\S+', '', regex=True)

print('\033[1m' + 'Text after removing URLs' +'\033[0;0m')
print(df_tweet.iloc[1]['text'])

[1mText before removing URLs[0;0m
video: sturgeon on post-election deals http://t.co/btjwrpbmoy
[1mText after removing URLs[0;0m
video: sturgeon on post-election deals 


#### Emoji

In [41]:
#Reference: https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               u"\U00002500-\U00002BEF"  # chinese char
                               u"\U00002702-\U000027B0"
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               u"\U0001f926-\U0001f937"
                               u"\U00010000-\U0010ffff"
                               u"\u2640-\u2642"
                               u"\u2600-\u2B55"
                               u"\u200d"
                               u"\u23cf"
                               u"\u23e9"
                               u"\u231a"
                               u"\ufe0f"  # dingbats
                               u"\u3030"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [42]:
text = "Italy is on 🔥🔥. Lets win!!!!"
remove_emoji(text)

'Italy is on . Lets win!!!!'

Problem:
What is the problem in the past example? Has the 'clean' sentence the same meaning (sentiment) as the original?

A lambda function is a small anonymous function. A lambda function can take any number of arguments,but can only have one expression. The power of lambda is better shown when you use them as an anonymous function inside another function.

In [43]:
x = lambda a: a + 10 if a < 10 else a - 2
print(x(50))

48


The input of function *apply* is a lambda function that wil be applied row by row

In [44]:
df_tweet.text = df_tweet.text.apply(lambda x: remove_emoji(x))
df_tweet.head(2)

Unnamed: 0,text
0,rt @kirkkus: indirect cost of the uk being in ...
1,video: sturgeon on post-election deals


#### Numbers

\d: Matches any decimal digit. This includes [0-9].

In [45]:
df_tweet.text = df_tweet.text.str.replace(r'\d+', '', regex=True)
df_tweet.head()

Unnamed: 0,text
0,rt @kirkkus: indirect cost of the uk being in ...
1,video: sturgeon on post-election deals
2,rt @laboureoin: the economy was growing times...
3,rt @greglauder: the ukip east lothian candidat...
4,rt @thesundaypeople: ukip's housing spokesman ...


### Exercise 2 (RE)

Column 'user_votes' in movies dataset has string types. Could you create a new column with the same number but with type int?

In [46]:
movies_db.head()

Unnamed: 0_level_0,movie_name,movie_year,movie_rating,user_votes,movie_rating_5
Position,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,The Shawshank Redemption,1994,9.3,2.8M,4.65
2,The Godfather,1972,9.2,2M,4.6
3,The Dark Knight,2008,9.0,2.8M,4.5
4,The Godfather Part II,1974,9.0,1.3M,4.5
5,12 Angry Men,1957,9.0,839K,4.5


In [58]:
type(movies_db['user_votes'].iloc[0])

str

In [59]:
movies_db['int_user_votes'] = movies_db['user_votes'].apply(lambda x: float(x.replace('K', '')) * 1000 if 'K' in x else float(x.replace('M', ''))*1000000)

In [60]:
movies_db['int_user_votes'].iloc[0]

2800000.0

### Stopwords

Stopwords are the words which are filtered out (i.e., stopped) before or after processing of natural language data (text) because they are 'insignificant'. There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identiffying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose or task or domain.

In [61]:
STOPWORDS = stopwords.words('english')
STOPWORDS

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [63]:
STOPWORDS.extend(['ciao'])
'ciao' in STOPWORDS

True

In [64]:
stopwords.words('italian')

['ad',
 'al',
 'allo',
 'ai',
 'agli',
 'all',
 'agl',
 'alla',
 'alle',
 'con',
 'col',
 'coi',
 'da',
 'dal',
 'dallo',
 'dai',
 'dagli',
 'dall',
 'dagl',
 'dalla',
 'dalle',
 'di',
 'del',
 'dello',
 'dei',
 'degli',
 'dell',
 'degl',
 'della',
 'delle',
 'in',
 'nel',
 'nello',
 'nei',
 'negli',
 'nell',
 'negl',
 'nella',
 'nelle',
 'su',
 'sul',
 'sullo',
 'sui',
 'sugli',
 'sull',
 'sugl',
 'sulla',
 'sulle',
 'per',
 'tra',
 'contro',
 'io',
 'tu',
 'lui',
 'lei',
 'noi',
 'voi',
 'loro',
 'mio',
 'mia',
 'miei',
 'mie',
 'tuo',
 'tua',
 'tuoi',
 'tue',
 'suo',
 'sua',
 'suoi',
 'sue',
 'nostro',
 'nostra',
 'nostri',
 'nostre',
 'vostro',
 'vostra',
 'vostri',
 'vostre',
 'mi',
 'ti',
 'ci',
 'vi',
 'lo',
 'la',
 'li',
 'le',
 'gli',
 'ne',
 'il',
 'un',
 'uno',
 'una',
 'ma',
 'ed',
 'se',
 'perché',
 'anche',
 'come',
 'dov',
 'dove',
 'che',
 'chi',
 'cui',
 'non',
 'più',
 'quale',
 'quanto',
 'quanti',
 'quanta',
 'quante',
 'quello',
 'quelli',
 'quella',
 'quelle',
 'q

In [65]:
stopwords.words('russian')

['и',
 'в',
 'во',
 'не',
 'что',
 'он',
 'на',
 'я',
 'с',
 'со',
 'как',
 'а',
 'то',
 'все',
 'она',
 'так',
 'его',
 'но',
 'да',
 'ты',
 'к',
 'у',
 'же',
 'вы',
 'за',
 'бы',
 'по',
 'только',
 'ее',
 'мне',
 'было',
 'вот',
 'от',
 'меня',
 'еще',
 'нет',
 'о',
 'из',
 'ему',
 'теперь',
 'когда',
 'даже',
 'ну',
 'вдруг',
 'ли',
 'если',
 'уже',
 'или',
 'ни',
 'быть',
 'был',
 'него',
 'до',
 'вас',
 'нибудь',
 'опять',
 'уж',
 'вам',
 'ведь',
 'там',
 'потом',
 'себя',
 'ничего',
 'ей',
 'может',
 'они',
 'тут',
 'где',
 'есть',
 'надо',
 'ней',
 'для',
 'мы',
 'тебя',
 'их',
 'чем',
 'была',
 'сам',
 'чтоб',
 'без',
 'будто',
 'чего',
 'раз',
 'тоже',
 'себе',
 'под',
 'будет',
 'ж',
 'тогда',
 'кто',
 'этот',
 'того',
 'потому',
 'этого',
 'какой',
 'совсем',
 'ним',
 'здесь',
 'этом',
 'один',
 'почти',
 'мой',
 'тем',
 'чтобы',
 'нее',
 'сейчас',
 'были',
 'куда',
 'зачем',
 'всех',
 'никогда',
 'можно',
 'при',
 'наконец',
 'два',
 'об',
 'другой',
 'хоть',
 'после',
 'на

In [66]:
stopwords.words('chinese')

['一',
 '一下',
 '一些',
 '一切',
 '一则',
 '一天',
 '一定',
 '一方面',
 '一旦',
 '一时',
 '一来',
 '一样',
 '一次',
 '一片',
 '一直',
 '一致',
 '一般',
 '一起',
 '一边',
 '一面',
 '万一',
 '上下',
 '上升',
 '上去',
 '上来',
 '上述',
 '上面',
 '下列',
 '下去',
 '下来',
 '下面',
 '不一',
 '不久',
 '不仅',
 '不会',
 '不但',
 '不光',
 '不单',
 '不变',
 '不只',
 '不可',
 '不同',
 '不够',
 '不如',
 '不得',
 '不怕',
 '不惟',
 '不成',
 '不拘',
 '不敢',
 '不断',
 '不是',
 '不比',
 '不然',
 '不特',
 '不独',
 '不管',
 '不能',
 '不要',
 '不论',
 '不足',
 '不过',
 '不问',
 '与',
 '与其',
 '与否',
 '与此同时',
 '专门',
 '且',
 '两者',
 '严格',
 '严重',
 '个',
 '个人',
 '个别',
 '中小',
 '中间',
 '丰富',
 '临',
 '为',
 '为主',
 '为了',
 '为什么',
 '为什麽',
 '为何',
 '为着',
 '主张',
 '主要',
 '举行',
 '乃',
 '乃至',
 '么',
 '之',
 '之一',
 '之前',
 '之后',
 '之後',
 '之所以',
 '之类',
 '乌乎',
 '乎',
 '乘',
 '也',
 '也好',
 '也是',
 '也罢',
 '了',
 '了解',
 '争取',
 '于',
 '于是',
 '于是乎',
 '云云',
 '互相',
 '产生',
 '人们',
 '人家',
 '什么',
 '什么样',
 '什麽',
 '今后',
 '今天',
 '今年',
 '今後',
 '仍然',
 '从',
 '从事',
 '从而',
 '他',
 '他人',
 '他们',
 '他的',
 '代替',
 '以',
 '以上',
 '以下',
 '以为',
 '以便',
 '以免',
 '以前',
 '以及',
 '以后',
 '以外',
 '以後',
 

In [67]:
ex = 'Hey ciao I am a student in the course of Artificial Intelligence. I live in Milan'
print(ex.split(' '))

['Hey', 'ciao', 'I', 'am', 'a', 'student', 'in', 'the', 'course', 'of', 'Artificial', 'Intelligence.', 'I', 'live', 'in', 'Milan']


In [68]:
print(ex.split('a'))

['Hey ci', 'o I ', 'm ', ' student in the course of Artifici', 'l Intelligence. I live in Mil', 'n']


In [69]:
'a'.join(ex.split('a'))

'Hey ciao I am a student in the course of Artificial Intelligence. I live in Milan'

In [70]:
[word for word in ex.split() if word not in STOPWORDS]

['Hey',
 'I',
 'student',
 'course',
 'Artificial',
 'Intelligence.',
 'I',
 'live',
 'Milan']

In [71]:
' '.join([word for word in text.split() if word not in STOPWORDS])

'Italy 🔥🔥. Lets win!!!!'

In [72]:
def remove_stopwords(text):
    return ' '.join([word for word in text.split() if word not in STOPWORDS])

text = 'I am a student in the course of Artificial Intelligence. I live in Milan'

print(remove_stopwords(text))

I student course Artificial Intelligence. I live Milan


In [73]:
df_tweet.text = df_tweet.text.apply(lambda text: remove_stopwords(text))

### Punctuations

In [74]:
punc = string.punctuation
punc

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [75]:
dict.fromkeys(punc, ' ')

{'!': ' ',
 '"': ' ',
 '#': ' ',
 '$': ' ',
 '%': ' ',
 '&': ' ',
 "'": ' ',
 '(': ' ',
 ')': ' ',
 '*': ' ',
 '+': ' ',
 ',': ' ',
 '-': ' ',
 '.': ' ',
 '/': ' ',
 ':': ' ',
 ';': ' ',
 '<': ' ',
 '=': ' ',
 '>': ' ',
 '?': ' ',
 '@': ' ',
 '[': ' ',
 '\\': ' ',
 ']': ' ',
 '^': ' ',
 '_': ' ',
 '`': ' ',
 '{': ' ',
 '|': ' ',
 '}': ' ',
 '~': ' '}

maketrans + translate: Create a mapping table, and use it in the translate() method to replace any keys characters with a value character

In [78]:
PUNCTUATIONS = string.punctuation.replace('#', '')

def remove_punctuations(text):
    trans = str.maketrans(dict.fromkeys(PUNCTUATIONS, ' '))
    return text.translate(trans)

In [76]:
def remove_whitespaces(text):
  return " ".join(text.split())

In [79]:
df_tweet.text = df_tweet.text.apply(lambda x: remove_whitespaces(x))

### Tokenization

Toeknization is one of the most important steps in text pre-processing. Whether we are working with traditional NLP techniques or using advanced deep-learning techniques, we cannot skip this step. Tokenization in simple words is the process of splitting a phrase, sentence paragrpah, one or multiple text documents into smaller units. Each of these smaller units is called a token. Now, these tokens can be anything -- a word, a subword, even a character. Different algorithms follow different processes in performing tokenization.

Consider the following sentence/raw text: “Let us learn tokenization". A word-based tokenization algorithm will break the sentence into words. The most common one is splitting based on space: [“Let”, “us”, “learn”, “tokenization.”]. This is the most commonly used tokenization technique. It splits a piece of text into words based on a delimiter. The most commonly used delimiter is space. You can also split your text using more than one delimiter, like space and punctuation marks. Depending on the delimiter you used, you will get different word-level tokens. Word-based tokenization can be easily done using custom RegEx or Python’s split() method. Example:
“Is it weird I don’t like coffee?”
By performing word-based tokenization with space as a delimiter, we get:
[“Is”, “it”, “weird”, “I”, “don’t”, “like”, “coffee?”]
If we look at the tokens “don’t” and “coffee?”, we will notice that these words have punctuation attached to them. What if there is another raw text (sentence) in our corpora like this — “I love coffee.” This time there will be a token “coffee.” which can lead the model to learn different representations of the word coffee (“coffee?” and “coffee.”) and will make the representation of words (tokens) suboptimal.

The reason that we should take punctuation into account while performing tokenization is that we do not want our model to learn different representations of the same word with every possible punctuation (of course the ones that can follow a word). If we allow our model to do so, we will be exploded with the number of representations a model will learn (each word × number of punctuations used in a language). So, let’s take punctation into account.
[“Is”, “it”, “wierd”, “I”, “don”, “’”, “t”, “like”, “coffee”, “?”]

This is better than what we had earlier. However, if we notice, tokenization has made three tokens for the word “don’t” — “don”, “’”, “t”. Better tokenization of “don’t” would have been “do” and “n’t” and this way if the model would have seen a word “doesn’t” in the future, it would have tokenized it into “does” and “n’t” and since the model would have already learned about “n’t” in the past it would have applied its knowledge here.

In [80]:
text = 'I don\'t like training a neural model from zero, I prefer using a a pre-trained model from huggingface.com'

In [81]:
text.split(' ')

['I',
 "don't",
 'like',
 'training',
 'a',
 'neural',
 'model',
 'from',
 'zero,',
 'I',
 'prefer',
 'using',
 'a',
 'a',
 'pre-trained',
 'model',
 'from',
 'huggingface.com']

In [83]:
from nltk.tokenize import word_tokenize
nltk.download('punkt_tab')

word_tokenize(text)

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


['I',
 'do',
 "n't",
 'like',
 'training',
 'a',
 'neural',
 'model',
 'from',
 'zero',
 ',',
 'I',
 'prefer',
 'using',
 'a',
 'a',
 'pre-trained',
 'model',
 'from',
 'huggingface.com']

In [84]:
from nltk.tokenize import wordpunct_tokenize

wordpunct_tokenize(text)

['I',
 'don',
 "'",
 't',
 'like',
 'training',
 'a',
 'neural',
 'model',
 'from',
 'zero',
 ',',
 'I',
 'prefer',
 'using',
 'a',
 'a',
 'pre',
 '-',
 'trained',
 'model',
 'from',
 'huggingface',
 '.',
 'com']

### Stemming and Lemmatization

**Stemming** is the process of reducing inflected (or sometimes derived) words to their words stem, base or root form -- generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. A stemmer for English operating on the stem *cat* should identify such strings as *cats*, *catlike* and *catty*. A stemming algorithm might also reduce the words *fishing*, *fished*, and *fisher* to the stem *fish*. The stem need not be a word, for example the Porter algorithm reduces *argue*, *argued*, *argues*, *arguing*, and *argus* to the stem *argu*.

**Lemmatization** is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighbouring sentences or even an entire document. In many languages, words appear in several inflected forms. For example, in English, the verb to walk may appear as walk, walked, walks or walking. The base form, walk, that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word.

Lemmatization is closely related to stemming. The difference is that **a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech**. However, **stemmers are typically easier to implement and run faster**.

For instance:
- The word *better* has *good* as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
- The word *walk* is the base form for the word *walking*, and hence this is matched in both stemming and lemmatization.
- The word *meeting* can be either the base form of a noun or a form of a verb (*to meet*) depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.
- For example, a stemming algorithm may reduce “saw” down to “s.” A lemmatization algorithm will consider whether “saw” is a noun (the hand tool for cutting) or a verb (to see) based on the context in which it is used before deciding to return a lemma, if it’s a noun it will return “saw,” and if it’s a verb it will return “see.”

In [87]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.lancaster import LancasterStemmer

porter = PorterStemmer()
lancaster = LancasterStemmer()

word_list = ["friend", "friendship", "friends", "friendships","stabil","destabilize","misunderstanding","railroad","moonlight","football","studies"]
print("{0:20}{1:20}{2:20}".format("Word","Porter Stemmer","lancaster Stemmer"))
for word in word_list:
    print("{0:20}{1:20}{2:20}".format(word,porter.stem(word),lancaster.stem(word)))

Word                Porter Stemmer      lancaster Stemmer   
friend              friend              friend              
friendship          friendship          friend              
friends             friend              friend              
friendships         friendship          friend              
stabil              stabil              stabl               
destabilize         destabil            dest                
misunderstanding    misunderstand       misunderstand       
railroad            railroad            railroad            
moonlight           moonlight           moonlight           
football            footbal             footbal             
studies             studi               study               


### Part of Speech

A part of speech (POS) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are assigned to the same part of speech generally display similar syntactic behaviour (they play similar roles within the grammatical structure of sentences), sometimes similar morphological behaviour in that they undergoinflection for similar properties and even similar semantic behaviour. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjuction, interjection, numeral, article, determiner.

In [91]:
nltk.download('averaged_perceptron_tagger_eng')

# the POS model mustbe applied over word tokenized text
text = nltk.word_tokenize("Today the lab is in Room 1234. The teacher is Marco and the lecture is about pre-processing techinques.")
print(text)

# Use the recommended part of speech tagger
print(nltk.pos_tag(text))

[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


['Today', 'the', 'lab', 'is', 'in', 'Room', '1234', '.', 'The', 'teacher', 'is', 'Marco', 'and', 'the', 'lecture', 'is', 'about', 'pre-processing', 'techinques', '.']
[('Today', 'NN'), ('the', 'DT'), ('lab', 'NN'), ('is', 'VBZ'), ('in', 'IN'), ('Room', 'NNP'), ('1234', 'CD'), ('.', '.'), ('The', 'DT'), ('teacher', 'NN'), ('is', 'VBZ'), ('Marco', 'NNP'), ('and', 'CC'), ('the', 'DT'), ('lecture', 'NN'), ('is', 'VBZ'), ('about', 'IN'), ('pre-processing', 'JJ'), ('techinques', 'NNS'), ('.', '.')]


spaCy is an open-source software library for advance natural language processing. unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage.

In [92]:
# Install and import package Spacy for lemmatization
import sys
!{sys.executable} -m pip install spacy

# Download spaCy's  'en' Model
!{sys.executable} -m spacy download en

[38;5;3m⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the
full pipeline package name 'en_core_web_sm' instead.[0m
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m47.7 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [93]:
import spacy

# Example of how the spacy model works
# Initialize spacy 'en' model, keeping only tagger component needed for lemmatization
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])


sentence = "The striped bats are hanging on their feet for best. We're living in Milan now."

# Parse the sentence using the loaded 'en' model object `nlp`
doc = nlp(sentence)

" ".join([token.lemma_ for token in doc])

'the stripe bat be hang on their foot for good . we be live in Milan now .'

In [94]:
def lemmSentence(text):
    doc = nlp(text)
    return " ".join([token.lemma_ for token in doc])

In [95]:
# Definition of white space tokenization and Porter stemming function
def stemSentenceTokenSpace(text):
    # tokens are defined by space between words
    token_words = text.split()
    stem_text = ""
    for word in token_words:
        # Stemming
        stem_text = stem_text + ' ' + porter.stem(word)
    return stem_text

In [96]:
df_tweet['Stemming_text'] = df_tweet['text'].map(lambda x: stemSentenceTokenSpace(x))
df_tweet['Lemmatization_text'] = df_tweet['text'].map(lambda x: lemmSentence(x))

In [97]:
df_tweet

Unnamed: 0,text,Stemming_text,Lemmatization_text
0,rt @kirkkus: indirect cost uk eu estimated cos...,rt @kirkkus: indirect cost uk eu estim cost b...,rt @kirkku : indirect cost uk eu estimate cost...
1,video: sturgeon post-election deals,video: sturgeon post-elect deal,video : sturgeon post - election deal
2,rt @laboureoin: economy growing times faster d...,rt @laboureoin: economi grow time faster day ...,rt @laboureoin : economy grow time fast day da...
3,rt @greglauder: ukip east lothian candidate lo...,rt @greglauder: ukip east lothian candid look...,rt @greglauder : ukip east lothian candidate l...
4,rt @thesundaypeople: ukip's housing spokesman ...,rt @thesundaypeople: ukip' hous spokesman rak...,rt @thesundaypeople : ukip 's housing spokesma...
...,...,...,...
19995,rt @uklabour: .@ed_miliband: we're going deal ...,rt @uklabour: .@ed_miliband: we'r go deal snp...,rt @uklabour : .@ed_miliband : we be go deal s...
19996,rt @disabledscot: @blairmcdougall @scotlandton...,rt @disabledscot: @blairmcdougal @scotlandton...,rt @disabledscot : @blairmcdougall @scotlandto...
19997,rt @staircase: @vividricky exactly alleged com...,rt @staircase: @vividricki exactli alleg comm...,rt @staircase : @vividricky exactly allege com...
19998,actually agreed % farage sayin rhen#voteukip,actual agre % farag sayin rhen#voteukip,actually agree % farage sayin rhen#voteukip


### Ex. 3 (Text Processing)

Given the twitter dataset, which are the most common hashtags? Hint: use Counter fucntion https://docs.python.org/3/library/collections.html#collections.Counter

In [98]:
list_tweet = list(df_tweet['text'])

In [99]:
list_tweet[0]

'rt @kirkkus: indirect cost uk eu estimated costing britain £ billion per year! #betteroffout #ukip'

In [100]:
list_hashtags = []

for i in range(len(list_tweet)):
    list_hashtags += [word for word in list_tweet[i].split(' ') if '#' in word]

In [101]:
list_hashtags

['#betteroffout',
 '#ukip',
 '#bbcqt',
 '#asknigelfarage',
 '#ukip',
 '#bbcqt',
 '#snp',
 '#bbcqt',
 '#snpbecause',
 '#tomorrowspaperstoday',
 '#bbcpapers',
 '#snp',
 '#conservative',
 '#bbcqt',
 '#snp',
 '#bbcqt',
 '#bbcqt',
 '#bbcqt',
 '#prettyplease',
 '#bbc',
 '#leadersdebate',
 '#nigelfarage',
 '#ukip',
 '#ge',
 '#ge',
 '#votesnp',
 '#votesnp',
 '#newsnight',
 '#david',
 '#hugahusky',
 '#davidcamerontweet',
 '#greatstorm',
 '#stjude',
 '#bbcqt',
 '#b…',
 '#sunnation',
 '#bbcbias',
 '#ukip',
 '#farageforever',
 '#bbcqt',
 '#tories',
 '#ge',
 '#bbcqt',
 '#bbcqt',
 '#bbcqt',
 '#bbcqt',
 '#bb…',
 '#snpbecause',
 '#snp',
 '#votesnp',
 '#v…',
 '#ge',
 '#bbcqt',
 '#bb…',
 '#snpbecause',
 '#snp',
 '#votesnp',
 '#ukip',
 '#bbcqt',
 "#alexsalmond's",
 '#referendum',
 '#snp',
 '#alexsalmond',
 '#bbcqt',
 '#redtories?',
 '#miliband',
 '#snp',
 '#ge',
 '#snp',
 '#mili…',
 '#bbcqt',
 '#bbcqt',
 '#bbcqt',
 '#questiontime',
 '#snpbecause',
 '#asknicola',
 '#bbcqt',
 '#votesnp',
 '#ukip',
 '#bbcqt

In [102]:
from collections import Counter

Counter(list_hashtags).most_common(10)

[('#bbcqt', 2526),
 ('#asknigelfarage', 1110),
 ('#ukip', 1082),
 ('#ge', 800),
 ('#snp', 646),
 ('#votesnp', 286),
 ('#askfarage', 220),
 ('#labour', 218),
 ('#plaid', 133),
 ('#tories', 126)]