<h2>Natural Language processing using Spacy and Python</h2>

Natural Language Processing (NLP) is a field focused on enabling computers to understand and process human language. 

<h4>Tokenization in NLP</h4>

Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. spaCy's tokenizer takes input in form of unicode text and outputs a sequence of token objects
- Types of tokenization
  <ul>
      <li>Word Tokenization</li>
      This method divides text into individual words. It's the most common form of tokenization and works well for languages with clear word boundaries.Example: The sentence "Natural language processing is fascinating." becomes ["Natural", "language", "processing", "is", "fascinating"].

   <li>Sentence Tokenization</li>
  This technique splits text into sentences rather than words. It’s useful for tasks that require analyzing the structure of a document.
   <li>Character Tokenization</li>
   Divides text into smaller units(characters)
   eg NLP becomes "N", "L", "P"
   <li>Subword Tokenization</li>
   This method breaks text into smaller units that are larger than single characters but smaller than full words. It’s useful for handling out-of-vocabulary words in NLP tasks.Eg The word "unhappiness" could be tokenized into ["un", "happiness"].
   <li>Custom Tokenization</li>
   In some cases, custom rules are created based on specific requirements, such as tokenizing hashtags or identifying domain-specific phrases.
  </ul>
Here i will only deal with word & sentence tokenization

In [1]:
# word tokenization
from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""
#  "nlp" Object is used to create documents with linguistic annotations.
token_doc = nlp(text)
token_list = []
for token in token_doc:
    token_list.append(token.text)
print(token_list)

['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


- Notice that `Spacy` produces a list that contains each token(word) as a separate item.

In [4]:
#sentence tokenization
nlp.add_pipe("sentencizer")
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""
doc = nlp(text)
sentence = list(doc.sents)
for i, sentence in enumerate(sentence):
    print(f"Sentence {i + 1}: {sentence.text}")

Sentence 1: When learning data science, you shouldn't get discouraged!
Sentence 2: 
Challenges and setbacks aren't failures, they're just part of the journey.
Sentence 3: You've got this!


Again, spaCy has correctly parsed the text into the format we want, this time outputting a list of sentences found in our source text.



<h4>Cleaning Text Data: Removing Stopwords</h4>