### Preprocessing


 



This notebook is part of my NLP crash course, which I teach in Saudi Arabia. To provide context, you'll find examples demonstrating the application of common NLP preprocessing steps to Arabic text. I'll be adding more notebooks soon.

Please note that this notebook was created live, so there might be some typos. I'll be making corrections and improvements shortly. This is the third notebook in the course. The first part was an introduction to NLP, and the second part focused on exploratory data analysis (EDA) of text data.

**Garbage in -> Garbage out**



##### Preprocessing depends on multiple factors:
- **Lang**: English -> lowercase vs Arabic -> strip tashkeel
- **Data Source** -> tweets -> spelling correction VS research papers will never apply spelling correction

- **Task**: Sentiment analysis -> Stop word removal VS translation will never apply stop word removal 

**Here in sentiment analysis (order does matter):**


- Longataion -> "Loooove"

- Remove non-useful text (handles, links, ..) -> @xxx or www.bla.com




- Lemmatization -> (play, plays, played, playing)

- Expand contactions -> "I'm" != "I am"
- Lowercase -> Car != car
- Stop word removal

- Text normalization -> 4you, for you
- Spelling correction -> helo, hell0


- Remove puncts -> ?;:@
- Remove numbers




- Stemming (if did't applied lemmatization) -> (play, plays, played, playing)







- Remove extra spaces (strip)

- Drop long/short tweets

**Steps you may do in other tasks (Advanced preprocessing):**

- Lang detection
- Code mixing -> عربي and English -> (SSD كان كويس)
- Translitrtion -> SSD kan kways



Note: **Order does matter**

Ex: if you remove puncts `@` you cant remove mentions or handles `@blabla`. You need first to remove handles then remove punctuations

#### Reading Data

In [None]:
#!unzip $path # CLI
#!unzip  "/content/training.1600000.processed.noemoticon.csv.zip"

In [None]:
import pandas as pd

df = pd.read_csv("/kaggle/input/sentiment140/training.1600000.processed.noemoticon.csv", encoding='latin-1', names=['sentiment', 'id', 'time', 'q', 'user', 'tweet'])
df.head()

In [None]:
df.drop(columns=['id', 'q', 'user', 'time'], inplace=True)

In [None]:
from sklearn.model_selection import train_test_split

train_val, test = train_test_split(df, test_size=0.2, random_state=42)
train, val = train_test_split(train_val, test_size=0.2, random_state=42)

#### Quick note on the number of unique words



In [None]:
the_whole_text = df['tweet'].str.cat(sep=' ')

In [None]:
unique_words = set(the_whole_text.split())
unique_words

In [None]:
len(unique_words)

*Example*:

In [None]:
sen = "i love love it love"

print(sen.split())

print(set(sen.split()))

print(len(set(sen.split())))

In [None]:
# Let's write it as a function

def get_unique_words(data):
    the_whole_text = data['tweet'].str.cat(sep=' ')
    unique_words = set(the_whole_text.split())
    return unique_words

In [None]:
len(get_unique_words(train))

Note: **Same preprocessing pipeline will be applied on val and test set (& production data)**

#### Longataion


Ex: diiiiiid,  looooove


Note: In English any letter will not be repeted more than two times in sequence Eg "hello"

Example on a single string

Recommended: Always check your regular expression https://regex101.com/

In [None]:
import re
#
sen = "I looooove lOOOOve it. i will call him"
re.sub(r"([a-zA-Z])\1{2,}", r"\1", sen)

In [None]:
# You can write it in this way
# Step 1
def remove_longation(tweet):
    return re.sub(r"([a-zA-Z])\1{2,}", r"\1", tweet)

# step 2
train['tweet'] = train['tweet'].apply(remove_longation)

In [None]:
# Or tyou can write it in this way: recommended to use lambda fun
train['tweet'] = train['tweet'].apply(lambda twt: re.sub(r"([a-zA-Z])\1{2,}", r"\1", twt ))

In [None]:
len(get_unique_words(train)) # num of unique words reducted by 20,000

Note: Remove tatweel in Arabic

In [None]:
#!pip install pyarabic
from pyarabic import araby

sen = "بـــــــــــــــــسم الله"
araby.strip_tatweel(sen)

#### Remove non-useful text (handles, links, ..)

In [None]:
#\w matches any word character (equivalent to [a-zA-Z0-9_])
#+ matches the previous token between one and unlimited times, as many times as possible, giving back as needed

user_pattern = r"@\w+" # r"#[a-zA-Z0-9_]+"
hash_pattern = r"#\w+"
url_pattern = r"https?\S+|www.\S+|\S+\.sa|\S+\.com" # start with https or start with www or end with .sa

#non_useful_pattern = "@\w+|#\w+|https?\S+|www.\S+|\.sa$"
non_useful_pattern = r'|'.join([user_pattern, hash_pattern, url_pattern])
non_useful_pattern

In [None]:
train['tweet'] = train['tweet'].apply(lambda twt: re.sub(non_useful_pattern, "", twt))

In [None]:
len(get_unique_words(train)) # 302,000 removed

In [None]:
tweet = """
Feel free to follow me on  linkedin https://www.linkedin.com/in/alielkassas/ or github https://github.com/alielkassas
http.google.com
https://regex101.com/
#Happy_coding #python 
regex.com
nic.sa
@ work
"""
re.sub(non_useful_pattern, "", tweet)

Again, this is depends on the data source. If you scrapped the web, you will need to remove html tags ex `<\S*>`

#### Lemmatization -> (play, plays, played, playing)

**Put it in this order as Lemmatization need to do POS on the text before checking dics**

Lemmatization steps (it takes time):
1. POS
2. Lookup in the dict


Note: Apply **Stemming or lemmatization not both**

Ex:

cats -> cat

was -> be

meeting -> it could be meeting (Noun) or meet (Verb) depens POS


In [None]:
# Will not apply it on the dataset as i will do stemmming latter on
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize("We are meeting right now")

It raise error asking you to download dictionary

In [None]:
# Download wordnet dictionary
import nltk
nltk.download('wordnet')

In [None]:
lemmatizer.lemmatize("meeting") # meeting(noun) -> meeting

In [None]:
lemmatizer.lemmatize("meeting", 'v') # meeting (verb)-> meet

In [None]:
lemmatizer.lemmatize('cats')

In [None]:
lemmatizer.lemmatize? # Returns the input word unchanged if it cannot be found in WordNet.

In [None]:
def lemmatize(tweet):
    tweet_lemmas = []
    for word in tweet.split():
    tweet_lemmas.append(lemmatizer.lemmatize(word))

    return " ".join(tweet_lemmas)

tweet = "Love playing"
lemmatize(tweet)

In [None]:
#train['tweet'] = train['tweet'].apply(lemmatize)

##### List comprehention

In [None]:
# recommended

" ".join([lemmatizer.lemmatize(word) for word in tweet.split()])

#train['tweet'] = train['tweet'].apply(lambda tweet: " ".join([lemmatizer.lemmatize(word) for word in tweet.split()]))

#### Expand contactions -> "I'm" -> "I am"


Sometimes it's ambigous ex: "it's" -> "it is", "it has"

Another example:
```
ain't -> am not
ain't -> are not
ain't -> is not
ain't -> has not
ain't -> have not
```

We have a defult which is ` are not`

Some libs solve this ambiguaty using embedding like `pycontractions`
https://pypi.org/project/pycontractions/

In [None]:
 "I'm" == "I am"

In [None]:
# Will not use this way, this is just for clarafication
contractions = {
    "ain't": "am not",
    "aren't": "are not",
    "can't": "can not",
    "'cause": "because",
    "could've": "could have",
    "couldn't": "could not",
    "didn't": "did not",
    "doesn't": "does not",
    "don't": "do not",
    "hadn't": "had not",
    "hasn't": "has not",
    "haven't": "have not",
    "he's": "he is",
    "how's": "how is",
    "I'd": "I would",
    "I'll": "I will",
}

In [None]:
!pip install contractions

In [None]:
from contractions import contractions_dict

contractions_dict

In [None]:
tweet = "I'm teaching. how's going with you?"

expanded_tweet = []
for word in tweet.split():
    if word in contractions_dict:
        expanded_tweet.append(contractions_dict.get(word))
    else:
        expanded_tweet.append(word)

' '.join(expanded_tweet)

In [None]:
" ".join([contractions_dict.get(word) if word in contractions_dict else word for word in tweet.split()])

In [None]:
train['tweet'] = train['tweet'].apply(lambda tweet: " ".join([ contractions_dict[word] if word in contractions_dict else word for word in tweet.split()]))

#### Case-normalization

model will count cat & Cat as two different words

Again preprocessing is langauge dependent. Many lang has't upper and lowercase like Arabic

In [None]:
"Blabla".lower()

In [None]:
train['tweet'] = train['tweet'].str.lower()

In [None]:
len(get_unique_words(train)) # reducted by 100,000  -> {cat, Cat}

#### Stop words removal

I will apply it on classification ortopic modeling, but will never do it in translation

English -> is, now, he, why

Arabic -> في , على, هذا

will add more stopwords based on the context "السيد"

Stop words list are different from one lib to anthoher **Not something universal**

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

In [None]:
stopwords.words('arabic')

In [None]:
stopwords.words('english')

In [None]:
custom_stopwords = set(stopwords.words('english')) # will be easy to remove some words

In [None]:
tweet = "i am not happy today"
# stopword list by defualt have not
[word for word in tweet.split() if word not in custom_stopwords]

In [None]:
# Willn't count not as a stop word
#custom_stopwords = stopwords - {"not"}
custom_stopwords -= {"not"}
[word for word in tweet.split() if word not in custom_stopwords]

In [None]:
# Willn't remove any negation
#stopwords = stopwords - {"not"}
custom_stopwords -= {'no',
 'nor',
 'not',
 'same',
 'so',
 'too',
 'don',
 "don't",
 'aren',
 "aren't",
 'couldn',
 "couldn't",
 'didn',
 "didn't",
 'doesn',
 "doesn't",
 'hadn',
 "hadn't",
 'hasn',
 "hasn't",
 'haven',
 "haven't",
 'isn',
 "isn't",
 'mightn',
 "mightn't",
 'mustn',
 "mustn't",
 'needn',
 "needn't",
 "shan't",
 'shouldn',
 "shouldn't",
 'wasn',
 "wasn't",
 'weren',
 "weren't",
 'won',
 "won't",
 'wouldn',
 "wouldn't"}

custom_stopwords

In [None]:
# Will add morning to the stopwords list
# stopwords = stopwords | {"morning"}
custom_stopwords |= {"today", "now", "&qout;"}

In [None]:
len(custom_stopwords)

In [None]:
train['tweet'] = train['tweet'].apply(lambda tweet: " ".join([word for word in tweet.split() if word not in custom_stopwords]))

#### Normalization

In [None]:
custom_normalization = {"4you": "for you",
                        "2morrow":"tomorrow",
                        "ksa":"saudi aribia",
                        "sa":"saudi aribia"}
#train['tweet'] = train['tweet'].apply(lambda tweet: " ".join([custom_normalization.get(word) if word in custom_normalization else word for word in tweet.split()]))

#### Spelling correction -> helo, hell0

In [None]:
!pip install pyspellchecker==0.5.6

In [None]:
# https://pypi.org/project/pyspellchecker/
# will run it only on example

from spellchecker import SpellChecker

spell = SpellChecker()
spell.correction('slep')

In [None]:
tweet = "i likd the pr0cess"
' '.join([spell.correction(word) for word in tweet.split()])

In [None]:
#train['tweet'] = train['tweet'].apply(lambda tweet: ' '.join([spell.correction(word) for word in tweet.split()]))

#### Remove puncts & numbers

 Ex: @ work in 2004

In [None]:
import string

string.punctuation

In [None]:
punct_pattern = r"[^a-zA-Z\s]" # anything not a-z or space -> remove puncts & numbers

tweet = "I am doing well. How are you? I was working @ xxx in 2008"

re.sub(punct_pattern, "", tweet)

It will remove any non-latin letters

In [None]:
tweet = "in 2009 blbla happend and I was @ القاهرةunivertiy. Do you remember?"
re.sub(punct_pattern, "", tweet)

In [None]:
train['tweet'] = train['tweet'].apply(lambda tweet: re.sub(punct_pattern, "", tweet))

In [None]:
len(get_unique_words(train)) # you. you, you? # 250,000 words dropped

#### Stemming (if did't applied lemmatization)


Ex: play, plays, played, playing, player -> play

chipoff prefixes and suffixes

In [None]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

stemmer.stem("playing")

In [None]:
stemmer.stem("king")

In [None]:
train['tweet'] = train['tweet'].apply(lambda tweet: " ".join([stemmer.stem(word) for word in tweet.split()]))

In [None]:
len(get_unique_words(train)) # 40,000

##### Note: You can have your own stemmer

In [None]:
from nltk.stem import RegexpStemmer
rs = RegexpStemmer('ing$|s$|ed$|y$', min=4) # set the minimum of the string to stem
rs.stem('playing')

In [None]:
rs.stem('king') # you need to customize tyour rule with edge cases -> porter stemmer will handle this case

##### Note: Steming in Arabic

In [None]:
from nltk.stem import ISRIStemmer

stemmer = ISRIStemmer()

stemmer.stem("الوقت")

In [None]:
stemmer.stem("يلعبون")

#### Remove extra spaces

In [None]:
train["tweet"] = train["tweet"].str.strip()

#### Drop long/short tweets (some empty tweets)

In [None]:
len(train['tweet']) # number of rows

In [None]:
#train['tweet'].apply(len)
train['length'] = train['tweet'].str.len()
train['length']

In [None]:
train[train['length'] < 4]

**Empty string will not be counted as Null**

 **Nan!=""**

In [None]:
train.isnull().sum()

In [None]:
train[train['length'] > 140]

In [None]:
train = train[~((train['length'] > 140) | (train['length'] < 4))]

In [None]:
train.drop(columns='length', inplace=True)

#### Language Detection

In [None]:
#https://pypi.org/project/langdetect/

!pip install langdetect

In [None]:
from langdetect import detect
detect("War doesn't show who's right, just who's left.")

https://github.com/pemistahl/lingua-py

In [None]:
!pip install lingua-language-detector

In [None]:
from lingua import Language, LanguageDetectorBuilder
languages = [Language.ENGLISH, Language.FRENCH, Language.GERMAN, Language.SPANISH, Language.ARABIC]
detector = LanguageDetectorBuilder.from_languages(*languages).build()
detector.detect_language_of("languages are awesome")


In [None]:
detector.detect_language_of("Welcome in this course معالجة اللغات الطبيعية")

#### Put all in one pipeline

In [None]:
user_pattern = r"@\w+"
hash_pattern = r"#\w+"
url_pattern = r"https?\S+|www.\S+|\S+\.sa|\S+\.com" # start with https or start with www or end with .sa


non_useful_pattern = r'|'.join([user_pattern, hash_pattern, url_pattern])
stopwords = set(stopwords.words('english'))

punct_pattern = r"[^a-zA-Z\s]"

def preprocess(tweet):

    # remove longation
    tweet = re.sub(non_useful_patterns, '', tweet)

    #Romove non-useful text
    tweet = re.sub(non_useful_patterns, "", tweet)


    # Tokenization
    tweet = tweet.split()

    # Lemmatization

    #lemmatizer = WordNetLemmatizer()
    #tweet = [lemmatizer.lemmatize(word) for word in tweet]

    # Expand contraction
    tweet = [contractions_dict.get(word) if word in contractions_dict else word for word in tweet]

    # Lowercase
    tweet = [word.lower() for word in tweet]

    # Remove stop words
    tweet = [word for word in tweet if word not in custom_stopwords]

    # Spell checker
    tweet = [checker.correction(word) for word in tweet]

    # Stemming
    tweet = [stemmer.stem(word) for word in tweet]

    # Concat
    tweet = " ".join(tweet)

    # Remove extra spaces
    tweet = tweet.strip()

    return tweet

In [None]:
train['tweet'] = train['tweet'].apply(preprocess)
val['tweet'] = val['tweet'].apply(preprocess)
test['tweet'] = test['tweet'].apply(preprocess)


#### Save clean data

In [None]:
train.to_csv('preprocessed_train.csv')
#val.to_csv('preprocessed_val.csv')
#test.to_csv('preprocessed_test.csv')

### Quick look on Spacy

In [None]:
#https://spacy.io/usage/spacy-101

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
for token in doc:
    print(token.text, token.pos_, token.dep_, token.lemma_, token.is_stop)

In [None]:
spacy.explain("ADP")

In [None]:
for ent in doc.ents:
  print(ent.text, ent.label_)