### Introduction to Text Classification
Text Classification is a Natural Language Processing task where text is analyzed, understood and categorized into predefined classes or categories.

It involves assigning a label or category to a given piece of text based on its context. 

- Categorizing text into predefined categories based on its content.
- Examples:
   - Email filtering (spam vs. not spam)
   - Sentiment analysis (positive, negative, neutral)
   - News Categorization(politics, sports, technology)
- Importance : Automates the understanding of large valumes of text

#### Application of Text Classification
- Sentiment Analysis
- Document Classification
- Spam Detection
- Language Detection
- Customer Support
-Search Engine Optimization 

#### Methods used in Text Classification
- Rule Based Systems
- Machine Learning Models
- Deep Learning Models 

#### Building a Text Classisfication using different librarires
1. Import library
2. Load and preprocess text data
3. Feature extraction using Bag of Words or TF-IDF
4. Train a simple Naive Bayes classifier
5. Test and evaluate the classifier 

#### spaCy:
- Built for production-level tasks and focuses on efficiency and performance. It is optimized for industrial use cases, handling large-scale data and real-time NLP applications.
- Known for its speed and efficiency
- For advnaced preprocessing like lemmatization, POS tagging, dependency parsing, named entity recognition (NER) and word embeddings.

- spaCy 101 A brief introduction
- spaCy Course Advanced NLP with spaCy

Text Classification with spaCy
-Load spaCy and process text
- Extract linguistic features (POS, entities, etc)
- Train a classifier using features
- Evaluate the classifier

# Text Classification with NLTK [Natural language ToolKit]

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.classify import NaiveBayesClassifier

In [3]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\movie_reviews.zip.


True

In [4]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sandh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [6]:
training_data = [
    ("I love this movie", "pos"),
    ("Good and Nice acting", "pos"),
    ("Boring and Terrible", "neg"),
    ("Bad acting and I hated this", "neg")]

In [8]:
words = word_tokenize("I love this movie".lower())
words

['i', 'love', 'this', 'movie']

In [12]:
stop_words = stopwords.words('english')
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [13]:
words_revised = [word for word in words if word.isalpha() and word not in stop_words]
words_revised

['love', 'movie']

In [16]:
def extract_features(sentence):
    words = word_tokenize(sentence.lower())
    # (sentence.lower() -> I love this movie
    # words = word_tokenize(sentence.lower()) # ["I", "love", "this", "movie"]
    stop_words = set(stopwords.words('english'))
    words_revised = [word for word in words if word.isalpha() and word not in stop_words] # Remove punctuations [! , . ; "" ''] and stop words
    return {word: True for word in words_revised}

In [17]:
extract_features("I love this movie")

{'love': True, 'movie': True}

In [20]:
# Convert training_data into feature sets
training_features = [(extract_features(sentence), label) for sentence, label in training_data]
training_features

[({'love': True, 'movie': True}, 'pos'),
 ({'good': True, 'nice': True, 'acting': True}, 'pos'),
 ({'boring': True, 'terrible': True}, 'neg'),
 ({'bad': True, 'acting': True, 'hated': True}, 'neg')]

In [21]:
training_data = [
    ("I love this movie", "pos"),
    ("Good and Nice acting", "pos"),
    ("Boring and Terrible", "neg"),
    ("Bad acting and I hated this", "neg")]

In [22]:
classifier = NaiveBayesClassifier.train(training_features)

In [39]:
test_sentences = [
    "I really enjoyed this movie"
    "Boring and Terrible"
]

In [40]:
for t_sent in test_sentences:
    f_extract = extract_features(t_sent)
    predicted_label = classifier.classify(f_extract)
    print(f"Sentence: '{t_sent}' => Predicted Sentiment: {predicted_label}")

Sentence:'I really enjoyed this movieBoring and Terrible' => Predicted Sentiment: neg
