In [5]:
raw_data = '''"I just finished a great book on NLP! 📖 #NLP #DataScience https://example.com/nlp-book","Wow, this is a terrible movie. So boringggg. 👎😠","I don't know what to think about this. There are too many facts, figures and numbers: 1, 2, 3.","The product is amazing!!! It's perfect.","The product is not so good. It is a really bad quality product.","So so, not impressed. So much hype, so little to show for it."
'''

In [11]:
# remove URL
import re
def remove_url(text):
    exp = r'https?://\S+|www\.\S+'
    result = re.sub(exp, r'', text)
    return result

url_remove = remove_url(raw_data)
url_remove

'"I just finished a great book on NLP! 📖 #NLP #DataScience  this is a terrible movie. So boringggg. 👎😠","I don\'t know what to think about this. There are too many facts, figures and numbers: 1, 2, 3.","The product is amazing!!! It\'s perfect.","The product is not so good. It is a really bad quality product.","So so, not impressed. So much hype, so little to show for it."\n'

In [16]:
# removing punctuation
import string
exclude = string.punctuation

def remove_punc(text):
    res = text.translate(str.maketrans('','',exclude))
    return res
punc_remove = remove_punc(url_remove)
punc_remove

'I just finished a great book on NLP 📖 NLP DataScience  this is a terrible movie So boringggg 👎😠I dont know what to think about this There are too many facts figures and numbers 1 2 3The product is amazing Its perfectThe product is not so good It is a really bad quality productSo so not impressed So much hype so little to show for it\n'

In [19]:
import nltk
from nltk.tokenize import word_tokenize
stop_words = nltk.corpus.stopwords.words('english')

def remove_stopwords(text):
    words = [word for word in text.split() if word.lower() not in stop_words]
    return " ".join(words)

stopwords_remove = remove_stopwords(punc_remove)
stopwords_remove

'finished great book NLP 📖 NLP DataScience terrible movie boringggg 👎😠I dont know think many facts figures numbers 1 2 3The product amazing perfectThe product good really bad quality productSo impressed much hype little show'

In [20]:
# removing emojis
import re
def remove_emoji(text):
    emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F" # EMOTICONS
                               u"\U0001F300-\U0001F5FF" # SYMBOLS & PICTOGRAPHS
                               u"\U0001F680-\U0001F6FF" # TRANSPORT & MAP SYMBOLS
                               u"\U0001F1E0-\U0001F1FF" # FLAGS (IOS)
                               u"\U00002702-\U000027B0"
                               u"\U000024C2-\U0001F251"
                               "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'',text)

emoji_remove = remove_emoji(stopwords_remove)
emoji_remove

'finished great book NLP  NLP DataScience terrible movie boringggg I dont know think many facts figures numbers 1 2 3The product amazing perfectThe product good really bad quality productSo impressed much hype little show'

In [22]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [23]:
from textblob import TextBlob

In [24]:
# spelling correction
txtblb = TextBlob(emoji_remove)
spell_correction = txtblb.correct().string
spell_correction

'finished great book NLP  NLP DataScience terrible movie boringggg I dont know think many facts figures numbers 1 2 The product amazing perfectThe product good really bad quality production impressed much hope little show'

As we can see here TextBlob didn't work properly the word boringgg remains same so we need to use more robous library for spell checking. For more effective spelling correction, especially with non-standard spellings or slang, you need a library with a more advanced algorithm. Two popular and powerful alternatives are pyspellchecker and SymSpell.

In [25]:
pip install pyspellchecker

Collecting pyspellchecker
  Downloading pyspellchecker-0.8.3-py3-none-any.whl.metadata (9.5 kB)
Downloading pyspellchecker-0.8.3-py3-none-any.whl (7.2 MB)
   ---------------------------------------- 0.0/7.2 MB ? eta -:--:--
   - -------------------------------------- 0.3/7.2 MB ? eta -:--:--
   ------- -------------------------------- 1.3/7.2 MB 4.1 MB/s eta 0:00:02
   ---------- ----------------------------- 1.8/7.2 MB 3.6 MB/s eta 0:00:02
   ----------- ---------------------------- 2.1/7.2 MB 3.1 MB/s eta 0:00:02
   --------------- ------------------------ 2.9/7.2 MB 3.0 MB/s eta 0:00:02
   ------------------ --------------------- 3.4/7.2 MB 2.9 MB/s eta 0:00:02
   --------------------- ------------------ 3.9/7.2 MB 2.9 MB/s eta 0:00:02
   ------------------------ --------------- 4.5/7.2 MB 2.8 MB/s eta 0:00:01
   --------------------------- ------------ 5.0/7.2 MB 2.7 MB/s eta 0:00:01
   ------------------------------ --------- 5.5/7.2 MB 2.7 MB/s eta 0:00:01
   --------------------

In [26]:
from spellchecker import SpellChecker
spell = SpellChecker()

In [33]:
# The TypeError: sequence item 5: expected str instance, NoneType found means that the 6th item (Python lists are 0-indexed) in the list you were 
# trying to join was None instead of a string.
def word_corrected(text):
    res = [spell.correction(word) if spell.correction(word) is not None else word for word in text.split()]
    return " ".join(res)

corrected_words = word_corrected(emoji_remove)
corrected_words

"finished great book nap nap DataScience terrible movie boringggg I don't know think many facts figures numbers 1 2 the product amazing perfective product good really bad quality products impressed much hype little show"

In [34]:
#as we can see still it's not yet solved so let's apply regex
def normalize_repetitions(text):
    # This regex looks for any character (\w) followed by itself two or more times (\1{2,})
    # It replaces the entire match with a single instance of the character (\1)
    return re.sub(r'(\w)\1{2,}', r'\1', text)
    
def word_corrected(text):
    text = normalize_repetitions(text)
    res = [spell.correction(word) if spell.correction(word) is not None else word for word in text.split()]
    return " ".join(res)

corrected_words = word_corrected(emoji_remove)
corrected_words

"finished great book nap nap DataScience terrible movie boring I don't know think many facts figures numbers 1 2 the product amazing perfective product good really bad quality products impressed much hype little show"

\1: This is a backreference that refers back to the content of the first capturing group (\w). In the case of the input boringggg, the first captured character is g. So, \1 literally means "the character g."
In the context of boringggg:
The regex engine starts from the beginning.
It finds b, o, r, i, n, g. The (\w) captures the first g.
The pattern then looks for at least two more consecutive instances of that same captured character (\1 which is now g) due to the \1{2,} part of the regex.
The engine finds ggg, which consists of three g's. Since 3 >= 2, this whole sequence matches the pattern.