# String Data Cleaning & Preprocessing Steps

Here we use Python and regular expressions to perform some basic text cleaning and preprocessing steps:


# 1.Cleaning of Text

In [1]:
import re

The re.sub() function in Python is used to replace all occurrences of a pattern in a string with a replacement string. The first argument is the pattern to be matched, and the second argument is the replacement string. The third argument is the string that the function will operate on.

In the specific example you provided, re.sub(r'[^\w\s]','',text) is used to remove all characters that are not alphanumeric or whitespace.

The regular expression [^\w\s] matches any character that is not a word character (\w) or a whitespace character (\s).

\w matches any alphanumeric character (letter, digit, or underscore)
\s matches any whitespace character (space, tab, newline, etc.)
The ^ inside the square brackets negates the set, so it matches any character that is not in the set.

The '' is the replacement string, which means that the matched characters will be removed, and replaced by an empty string, resulting in removing all non-alphanumeric and non-whitespace characters from the text.

In [2]:
text='     my # Name is !@#$^&*+,Abdul Remhan  2324325    '

text=re.sub(r'[^\w\s]','',text)  #Removing Punctuations

In [3]:
text

'     my  Name is Abdul Remhan  2324325    '

In [4]:
text=text.lower()   #convert to lower case
text

'     my  name is abdul remhan  2324325    '

The re.sub(r'\d+', '', text) function is used to remove all digits from the string.

\d is a special character that matches any digit, and the + after it indicates that one or more consecutive digits should be matched. So, \d+ matches one or more consecutive digits in the string.

The re.sub() function then rep

In [5]:
text=re.sub(r'\d+','',text)  #Remove the digits
text

'     my  name is abdul remhan      '

In [6]:
text=text.strip()   #Removing Whitespaces from text
text

'my  name is abdul remhan'

### After All of Explainantion,I can make the combine fnction for Text cleaning 

In [7]:
def cleaning_text(txt):
    txt=re.sub(r'[^\w\s]','',txt)
    txt=txt.lower()
    txt=re.sub(r'\d+','',txt)
    txt=txt.strip()
    
    return(txt)

In [8]:
cleaning_text('     My name is Abdul Rehman .I am In class 9 .My Email is: Abdul .rehman@gmail.com  I am 21 years old    ')

'my name is abdul rehman i am in class  my email is abdul rehmangmailcom  i am  years old'

# 2.Tokenization of Text 

Tokenization is a critical step in natural language processing as it allows you to work with individual words and sentences, which can be useful for tasks such as text classification, information retrieval, and text generation. In addition, it allows for the removal of stopwords and stemming/lemmatization of the words, which in turn can improve the results of the NLP tasks.

nltk (Natural Language Toolkit) is a python library that provides tools to work with human language data, such as text. It provides a wide range of functionality for tasks such as tokenization, stemming, lemmatization, parsing, semantic reasoning, and wrappers for industrial-strength NLP libraries.

Punkt tokenizer is trained on a large corpus of text and is able to accurately tokenize text in multiple languages, including English, German, and Spanish.

In [9]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

word_tokenize() is also a function provided by the nltk.tokenize module, which is used to tokenize a piece of text into words. It takes a string of text as input and returns a list of words.


In [10]:
from nltk.tokenize import word_tokenize  #laibrary for word tokenization 
some_txt='my name is abdul rehman i am in class  my email is abdul rehmangmailcom  i am  years old'
word_tokens=word_tokenize(some_txt)
word_tokens

['my',
 'name',
 'is',
 'abdul',
 'rehman',
 'i',
 'am',
 'in',
 'class',
 'my',
 'email',
 'is',
 'abdul',
 'rehmangmailcom',
 'i',
 'am',
 'years',
 'old']

sent_tokenize() is a function provided by the nltk.tokenize module, which is used to tokenize a piece of text into sentences. It takes a string of text as input and returns a list of sentences.

In [11]:
from nltk.tokenize import sent_tokenize  #laibrary for sentences tokenization 
some_txt='my name is abdul rehman i am in class  my email is abdul rehmangmailcom  i am  years old'
sent_tokens=sent_tokenize(some_txt)
sent_tokens

['my name is abdul rehman i am in class  my email is abdul rehmangmailcom  i am  years old']

Both sent_tokenize() and word_tokenize() use Punkt tokenizer, a pre-trained unsupervised machine learning model.

# 3.Removing StopWords From Text

Stop words are common words that are typically removed from text data before performing natural language processing tasks such as text classification or text generation. 

In [12]:
from nltk.corpus import stopwords  #importing stopwords laibrary

In [13]:
nltk.download('stopwords')  #download stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\User\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [14]:
stop_words=list(stopwords.words('english'))  #list all the english stopwords
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [15]:
#make a list to seperate the words which are not stop words

filtered_words=[word for word in word_tokens if word not in stop_words] 
filtered_words

['name',
 'abdul',
 'rehman',
 'class',
 'email',
 'abdul',
 'rehmangmailcom',
 'years',
 'old']

# 3. Conversion of words in to their Base Roots

###  Stemming or limitization of the Words

Stemming or limitization of the word is the process of reducing words to their base or root form. This can be useful for tasks such as text classification or information retrieval, as it can help to group together words that have the same meaning.

In [16]:
from nltk.stem import PorterStemmer  #importing stemming laibrary

In [17]:
stemer=PorterStemmer()
stemed_words=[stemer.stem(word)for word in filtered_words] #making the list of the stemed words
stemed_words

['name',
 'abdul',
 'rehman',
 'class',
 'email',
 'abdul',
 'rehmangmailcom',
 'year',
 'old']

Limitization has the same functionality as stemming but different function

# 4. Encoding of Tokenized Text

There are two fuctions which are used to apply an encoder on tokenized text, you can use a method such as texts_to_matrix() or texts_to_sequences() from the Tokenizer class.Both methods are same in working and approaches.

In [18]:
!pip install keras



### Difference Between texts_to_matrix function and text_to_sequence function

In short, texts_to_matrix() converts a list of texts into a bag-of-words representation, while texts_to_sequences() converts a list of texts into a list of lists of integers, where each integer corresponds to a word in the text.

In [19]:
!pip install tensorflow



In [20]:
import keras

In [21]:
#It converts the data into binary numerical form which is acceptable for machine learning
from keras.preprocessing.text import Tokenizer

In [43]:
encoded_text = tokenizer.texts_to_matrix(stemed_words)  

In [44]:
encoded_text

array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])