Natural Language Processing using Python (January 2025) - Lab Activities
Day 1:
- Use Stop Words
- POS tagging
- Generate Antonyms
- Generate Synonyms
Day 2:
- Calculating frequency of top N words
- Using NLTK and RE module to tokenize text, remove stop words, special characters
- Using wordcloud library to generate word cloud from cleaned text
Day 3: Apply CountVectorization to the input data & answer the following:
- Count Vectorizer Matrix
- Vocabulary (unique words in corpus)
- Calculating frequency of top N words (top 1 word = most frequent term)
- Finding out words that appear in all/maximum sentences/documents
- (Doubts):
- What would happen if we set stop_words='english' in the CountVectorizer?
- Should we manually explain the tone of 'happy' or should we write a code (Like for pick out term that occurs in all documents)
Day 4: Multinomial Naive Bayes model on 'Movie Review' dataset:
- Pre processing steps - Using NLTK and RE module to:
- Tokenize text
- Remove stop words
- Remove special characters
- Word Cloud for the data
- MultiNB Model
Capstone Project: Major concepts used:
- Words Tokenization
- Stop Words
- Lemmatization