**Processing the text**

- Lowercase Conversion: Convert the text to lower case.
- Remove Stopwords: Remove common words like "and" , "is" , "the" using stop word list
- tokenization: Split the text into individual words or phrases.
- Remove Punctuation and Special Characters: Clean the text by removing non-aphanumeric characters.
- Lemmatization/Stemming: Reduce words to their base or root from to avoid duplication eg "running" -> "run"


example : "this course provides an in-depth introduction to python programming, focusing on data analysis, machine learning and automation."

after processing: ["course", "provides", "introduction", "python", "programming", "focusing", "data", "analysis", "machine", "learning", "automation"]

**Lowercasing**

In [None]:
text = "This provides an in-depth introduction to Python programming, focusing on data analysis, machine learning, and automation."

text = text.lower()

print(text)

**Removing Stopwords**

There are 3 ways to remove Stopwords

- NLTK
- spaCy
- Textcleaner

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('punkt_tab')
nltk.download('stopwords')

tokens = word_tokenize(text)
stop = set(stopwords.words('english'))
filtered = [w for w in tokens if not w.lower() in stop]
print(tokens)
print(filtered)


spacy is not able to insall in this jupiter notebook but got it in file


link - E:\C125\python\data cleaning\spacy.py

In [None]:
from textcleaner import clean

text = "This is a sample text with some HTML tags like <br> and special characters like &amp; and some numbers 123."
cleaned_text = clean(text)
print(cleaned_text)

text = "This is a sample text with some \t tabs and \n new lines."
cleaned_text = clean(text, tabs=True, new_lines=True)
print(cleaned_text)

text = "This is a sample text with some extra spaces.   "
cleaned_text = clean(text, extra_spaces=True)
print(cleaned_text)

text = "This is a sample text with some punctuation!?"
cleaned_text = clean(text, punctuation=True)
print(cleaned_text)

text = "This is a sample text with some numbers 12345."
cleaned_text = clean(text, numbers=True)
print(cleaned_text)


text = "This is a sample text with some accents like éàçüö."
cleaned_text = clean(text, accents=True)
print(cleaned_text)

text = "This is a sample text with some brackets like []{}()<>."
cleaned_text = clean(text, brackets=True)
print(cleaned_text)

text = "This is a sample text with some quotes like ''\"\"."
cleaned_text = clean(text, quotes=True)
print(cleaned_text)

**With re module Data Cleaning** 

In [None]:
#basic data cleaning 

import re

text = " This course covers advanced topics in <b>Deep Learning</b> and Neural Networks.   Extra spaces"

text = text.lower()
text = re.sub(r"<.*?>","",text)  #removing HTML tags
text = re.sub(r"&amp","&",text)  #removing HTML entities
text = re.sub(r"[^\x00-\x7F]+","",text)  #removing non-ASCII charcaters
text = text.replace('\n','').replace('\t', ' ') # remoing next line and tabs
text = re.sub(r"\s+"," ",text).split()  #remvoing extra spaces

print(text)



**Tokenization**

Methods to tokenize:
- nltk : 
    - word_tokenize() : it uesee the treebandk word tokenizer, which handles contractions(eg, "don't" becomes "do" and "n't") and punctuation reasonably well.
    - sent_tokenize(): splites text into sentences.
    - TreebankWordTokenizer() : this is the underlying tokenizer used by word_tokenizer(). you can use it directly for more control.

    Other Tokenizer : NLTK also provides other tokenizer like WHitespaceTokenizer , PunkWordTokenizer ,and regular expression-based tokenizers for more specialized use cases.

- spaCy : it's tokenization is rule_based and highly optimized for speed and accuracy. it's integrated into its processing pipeline.
    

- Other Libraries/Methods:
    - str.split(): the simplest form of tokenization

    - Regular Expression(re.findall()): you can use regular expression for more complex tokenization patterns.

    - Hugging face Tokenizers : The tokenizers libraru for Hugging Face is widely used for transformer-based models. It offers various Tokenization Algorithms like WordPiecs, Byte-Pair Encoding(BPE),and SentencePiece, Which are often used in deep learning models for NLP.

In [None]:
#word_tokenizer()

import nltk
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.tokenize.treebank import TreebankWordtokenizer

nltk.download('punkt_tab')

text = "this is an example sentence. It's Tokenized"
tokens = word_tokenize(text)

sentences = sent_tokenize(text)

tokens2 = TreebankWordtokenizer().tokenize(text)

print(tokens)
print(sentences)
print(tokens2)

In [None]:
#spacy is in python file - data cleaning(spacy).py
#str.split()
text = "this is a sample text"
tokens = text.split()
print(tokens)

#using regular explression

import re

text = "This is an example-with-hyphens."
tokens = re.findall(r"[\w]+-[\w]+|[\w]+|[\W]", text)  # Matches hyphenated words, words, or non-word characters
print(tokens)
# Output: ['This', 'is', 'an', 'example-with-hyphens', '.']

**Removing Punctuation and special characters**

In [None]:
# using re
import re

text = "this is a sample text whith 4$%(#)"
text = re.sub(r"[^\w+]",' ',text)
print(text)



**Lemmiatization or Stemming or a word**

- Stemming : A simpler approach that involves chopping off prefixes or suffixes of words based on heuristics. it often results in non-words (eg. running , becommes, runn)
- Lemmatization: A more sophisticated approach that uses a vocabolary and morphological analysis to find the base or dictionary form of word(lemma). it produces actual words (eg. running becomes run)

**done in file - data cleaning(Stemming and lemmatization).py**