# Week 6, Day 1: Introduction to Natural Language Processing

## Learning Objectives
- Understand NLP fundamentals
- Learn text preprocessing techniques
- Master basic NLP tasks
- Practice implementing NLP pipelines

## Topics Covered
1. Text Preprocessing
2. Tokenization
3. Text Representation
4. Basic NLP Tasks

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Download required NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

## 1. Text Preprocessing

In [None]:
def text_preprocessing_example():
    # Sample text
    text = """
    Natural Language Processing (NLP) is a branch of artificial intelligence
    that helps computers understand, interpret, and manipulate human language.
    NLP combines computational linguistics, machine learning, and deep learning
    to process and analyze large amounts of natural language data.
    """
    
    # Tokenization
    sentences = sent_tokenize(text)
    words = word_tokenize(text)
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_words = [word.lower() for word in words if word.lower() not in stop_words]
    
    # Stemming
    stemmer = PorterStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
    
    # Print results
    print("Original Text:")
    print(text)
    
    print("\nSentences:")
    for i, sent in enumerate(sentences, 1):
        print(f"{i}. {sent.strip()}")
    
    print("\nOriginal Words:")
    print(words[:20])
    
    print("\nFiltered Words (without stopwords):")
    print(filtered_words[:20])
    
    print("\nStemmed Words:")
    print(stemmed_words[:20])
    
    print("\nLemmatized Words:")
    print(lemmatized_words[:20])

text_preprocessing_example()

## 2. Text Representation

In [None]:
def text_representation_example():
    # Sample documents
    documents = [
        "Natural language processing is fascinating.",
        "Machine learning algorithms are powerful.",
        "Deep learning revolutionized NLP.",
        "Processing natural language requires understanding."
    ]
    
    # Bag of Words
    count_vectorizer = CountVectorizer()
    bow_matrix = count_vectorizer.fit_transform(documents)
    
    # TF-IDF
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(documents)
    
    # Create DataFrames for visualization
    bow_df = pd.DataFrame(
        bow_matrix.toarray(),
        columns=count_vectorizer.get_feature_names_out()
    )
    
    tfidf_df = pd.DataFrame(
        tfidf_matrix.toarray(),
        columns=tfidf_vectorizer.get_feature_names_out()
    )
    
    # Visualize results
    plt.figure(figsize=(15, 5))
    
    plt.subplot(121)
    sns.heatmap(bow_df, annot=True, fmt='.0f', cmap='YlOrRd')
    plt.title('Bag of Words Representation')
    plt.xlabel('Words')
    plt.ylabel('Documents')
    
    plt.subplot(122)
    sns.heatmap(tfidf_df, annot=True, fmt='.2f', cmap='YlOrRd')
    plt.title('TF-IDF Representation')
    plt.xlabel('Words')
    plt.ylabel('Documents')
    
    plt.tight_layout()
    plt.show()

text_representation_example()

## 3. Basic NLP Tasks

In [None]:
def basic_nlp_tasks():
    # Part-of-Speech Tagging
    nltk.download('averaged_perceptron_tagger')
    text = "The quick brown fox jumps over the lazy dog"
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    
    # Named Entity Recognition
    nltk.download('maxent_ne_chunker')
    nltk.download('words')
    text = "Apple Inc. was founded by Steve Jobs in California"
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    named_entities = nltk.ne_chunk(pos_tags)
    
    # Print results
    print("Part-of-Speech Tagging:")
    for word, tag in pos_tags:
        print(f"{word}: {tag}")
    
    print("\nNamed Entity Recognition:")
    print(named_entities)

basic_nlp_tasks()

## Practical Exercises

In [None]:
# Exercise 1: Text Classification

def text_classification_exercise():
    # Sample dataset
    texts = [
        "This movie was fantastic! Great acting and plot.",
        "Terrible waste of time. Poor acting and boring story.",
        "Amazing film, highly recommended!",
        "Don't waste your money on this movie.",
        "Excellent performance by the entire cast."
    ]
    labels = [1, 0, 1, 0, 1]  # 1: positive, 0: negative
    
    print("Task: Create a simple sentiment classifier")
    print("1. Preprocess the texts")
    print("2. Create text representations")
    print("3. Train a classifier")
    print("4. Evaluate performance")
    
    # Your code here

text_classification_exercise()

In [None]:
# Exercise 2: Text Analysis

def text_analysis_exercise():
    text = """
    The field of artificial intelligence has seen remarkable progress in recent years.
    Machine learning algorithms have revolutionized many industries, from healthcare
    to finance. Natural language processing, a subset of AI, has enabled computers
    to understand and generate human language with increasing accuracy.
    """
    
    print("Task: Analyze the text")
    print("1. Calculate word frequencies")
    print("2. Find key phrases")
    print("3. Identify main topics")
    print("4. Visualize results")
    
    # Your code here

text_analysis_exercise()

## MCQ Quiz

1. What is tokenization?
   - a) Text compression
   - b) Breaking text into units
   - c) Text generation
   - d) Language translation

2. What are stopwords?
   - a) Important keywords
   - b) Common words with little meaning
   - c) Punctuation marks
   - d) Named entities

3. What is stemming?
   - a) Word creation
   - b) Reducing words to root form
   - c) Part-of-speech tagging
   - d) Sentence parsing

4. What does TF-IDF measure?
   - a) Word count
   - b) Word importance
   - c) Text similarity
   - d) Grammar accuracy

5. What is lemmatization?
   - a) Word counting
   - b) Converting to proper word form
   - c) Text summarization
   - d) Language detection

6. What is part-of-speech tagging?
   - a) Word counting
   - b) Grammar checking
   - c) Word role labeling
   - d) Text generation

7. What is named entity recognition?
   - a) Grammar checking
   - b) Identifying proper nouns
   - c) Word counting
   - d) Text translation

8. What is the bag of words model?
   - a) Grammar model
   - b) Word frequency representation
   - c) Translation model
   - d) Language model

9. What is corpus in NLP?
   - a) Text editor
   - b) Collection of texts
   - c) Programming language
   - d) Machine learning model

10. What is the purpose of text preprocessing?
    - a) Text generation
    - b) Clean and standardize text
    - c) Language translation
    - d) Grammar checking

Answers: 1-b, 2-b, 3-b, 4-b, 5-b, 6-c, 7-b, 8-b, 9-b, 10-b